Language Detection
The aim of this project is to analyse a text in order to detect its language. For that, we propose to find the occurrence frequencies of the characters in the text. We will interpret these frequencies as a probability distribution that will be used for several purposes:
- detect the language of the text by comparing its distribution with several probability distributions of known languages (english, french, spanish, ...)
- compress the text by using a compression algorithm (Huffman coding) based on its probability distribution.
From a programming point of view, the first objectives of this project are to learn the basics in Python (variables, conditions, loops, ...). The concept of function will also be seen. Simultaneously, we will try to learn to decompose a problem into several problems.
Table of content
- Requirements
- Assignment
- Q1. Occurrence frequency analysis
- Q2. Enhanced occurrence frequency analysis
- Q3. Influence of the text length on the quality of detection
- Q4. Compression with Huffman
- Upload your work
- Fill the quizz on the LMS
toc]
Requirements [you should have completed the following notebooks before starting this project:
toc]
Assignment [you and anyone looking at your code should be able to run the following command without getting any errors:
make test
💡 Note
for any question x, you can run
make qx
to see the output of this question, e.g.make q1
ormake q4.2.1
toc]
Q1. Occurrence frequency analysis [❗ Important
there are no tests to run in this section, you can relax... for now 😉
Your first objective is to find the occurrence frequencies of the letters of the alphabet in a text.
Open the occurrence_frequencies.py
script in your editor of choice and fill it. This file already contains the steps to follow :
- Create a dictionary containing the letters with the occurrences equal to 0
- For each letter in the text, increment the corresponding entry of the dictionary
- Normalize the values of the dictionary in order to have frequencies (the sum is equal to 1)
toc]
Q2. Enhanced occurrence frequency analysis [❗ Important
there are tests that should pass
Now we will use the Wuthering Heights by Emily Bronte text and analyze it.
You should now be able to answer the following questions. Your work must be written in a new Python program, called enhanced_occurrence_frequencies.py
in the src
directory:
- Write a function
read_text
to read a text file and return a string text containing only lower case characters.
- arguments:
- filename:
str
- return:
str
- Apply this function to the file
data/english_wuthering_heights.txt
and store its content in the variablemy_text
. - Write a function
compute_occurence_frequencies
to compute the occurrence frequencies of the letters of a stringalphabet
in the stringtext
. This function returns the corresponding dictionary.
- arguments:
- text:
str
- alphabet:
str
- return:
Dict[str, float]
- Apply this function to the string
characters = "abcdefghijklmnopqrstuvwxyz"
and the stringmy_text
. Store the output dictionary in the variableenglish_of
(english occurrence frequency). - Compute the occurrence frequencies of french, spanish and german languages by respectively using the files
french_madame_bovary.txt
,spanish_don_quijote.txt
andgerman_kritik_der_reinen_vernunft.txt
. - The Kullback–Leibler (KL) divergence defines a kind of distance between two probability distributions. If we consider two distributions P=(p_1, \ldots,p_n)andQ=(q_1, \ldots,q_n):D_{KL}(P,Q)=\sum_{i=1}^{n} p_i \log_2 \frac{p_i}{q_i}
💡 Note
All theq_imust be strictly positive. If some of them are equal to zero, do not consider the corresponding symbol.
Since the KL divergence is not symmetric, you can symmetrize it by computing the absolute value of the average of
💡 Note
You need thelog
and thefabs
functions. Since there are not loaded by default, you must add the linefrom math import log,fabs
at the beginning of your file.
Write a function kl_divergence
that takes two probability distributions in the form of previously computed dictionaries and returns the symmetric Kullback-Leibler divergence.
- arguments:
- first probability distribution:
Dict[str, float]
- second probability distribution:
Dict[str, float]
- return:
float
- Compute the KL divergence between the different languages.
- Write a function
detect_language
which is able to determine the language used in a text file. The idea is just to compute the occurrence frequencies of the characters in the text file and to compute the KL divergences with the occurrence frequencies of the various languages.
- arguments:
- text:
str
- references languages:
Dict[str, Dict[str, float]]
- alphabet:
str
- return:
str
toc]
Q3. Influence of the text length on the quality of detection [❗ Important
your code should be written in
src/text_length.py
The accuracy of the language detector depends on the length of the input text. The objective of this exercise is to evaluate this accuracy according to the length of the text.
toc]
Q3.1. [For each text length, evaluate the probability of good detection and plot the result with matplotlib
.
💡 Note
the details of this question have been intentionally left as an exercise to the students. it's time for you to shine and experiment what you think would be a good approach.if you get really stuck or have no idea where to begin, feel free to ask your professor for some insights 😉
toc]
Q3.2. [To try improve this result, you can try to capture the redundancy of a language in the sequences of 2 (or more) letters. For that, consider the pairs of consecutive letters as a symbol (of course, the size of your alphabet of sybols will increase) and do the same analysis to check if you can improve the detection results.
toc]
Q4. Compression with Huffman [❗ Important
your code should be written in
src/data_compression.py
The analysis of the occurrence frequencies of the characters in the languages shows that there is a strong variability between the characters. This shows that there is some "redundancy" in the languages. In other words, this means that it is possible to compress the languages by using these occurrence frequencies.
Basically speaking, the maximum level of compression that can be obtained with the occurrence frequencies is given by the entropy. The entropy gives a bound on the average number of bits used to code a symbol. In this part, we will first compute the entropy, and then verify that we can obtain such level of compression with the Huffman algorithm.
-
The entropy is a mathematical concept that measures the "amount of information" of a probability distribution. If we consider a variable which takes its values in the set {
a_1, a_2, \ldots, a_n} with the probabilitiesp_i=p(a_i), fori=1,\ldots,n, then the entropyH(X)is defined by :H(X)=-\sum_{i=1}^{n} p_i \log_2 p_iIf you want to learn more on the entropy concept, which has strong applications in compression, error correcting codes and cryptography, have a look on this video :)By assuming that the occurrence frequencies obtained of englishOF are representative of the occurrence probabilities of the characters in the english language, compute the entropy of the english language. For that :
- create a new file
data_compression.py
insrc/
- add the line
from enhanced_occurrence_frequencies import read_text, compute_occurence_frequencies, detect_language
that allows you to use the functions defined in the previous file. - In order to be fair in the evaluation of the compression, you need to integrate all the possible characters, instead of just the letters. So, to generate the string characters which contains the considered characters, copy-paste the code :
characters = [chr(i) for i in range(128)]
and re-compute the associated dictionaries
- write a generic function that computes the entropy from any distribution probability represented by a dictionary.
- create a new file
-
Compute the entropies of french, spanish and german languages.
-
The Huffman algorithm is a famous compression algorithm which is one of the components of current audio and video compression standards. A short presentation of this algorithm is given in this video. The objective here is not to program it (lack of time...) but rather to use an implementation. This implementation is provided in the file
huffman.py
. You can find on this files the main steps of the algorithm :- the construction of the tree according to the probabilities with
build_huffman_tree
- the generation of the code with
generate_code
- the compression of the text with
compress
- the decompression with
decompress
Use these algorithm to compress and decompress the language files (do not forget to
import
it).- Verify that the entropy gives a lower bound of the average number of bits used to code the characters. This number of bits can be obtained from the output of the compression algorithm. Note that this output is not a binary file, but a string of characters '0' and '1'... So the average number of bits per characters can be obtained by
\frac{\textrm{number of characters in the compressed file}}{\textrm{number of characters in the initial text file}}Observe the "beauty of the math" by verifying the accuracy of the entropy bound !!
- Evaluate the compression ratio by considering that each character of the text files is encoded on 7 bits. Thus the compression rate is given by :
\frac{7\times \textrm{number of characters in the initial text file}}{\textrm{number of characters in the compressed file}}
- the construction of the tree according to the probabilities with
-
How to improve the level of compression ? The occurrence frequencies of each character only capture the redundancy of the characters, but not the one between consecutive characters... So, you can do the same exercice by analyzing the occurence frequencies of all 2-letter (or more) words. This should allow you to capture more redundancy of the text and then to improve the compression ratio.
upload them to the LMS [toc]
Compress the files and-
src/occurrence_frequencies.py
-
src/enhanced_occurrence_frequencies.py
-
src/text_length.py
-
src/data_compression.py
To compress several files into one archive, select all the files, then, right-click and choose compress in the menu. Choose the .zip
extension for the new archive.
alternately, you can use one of the following commands from a shell:
tar -czvf language_detection_LASTNAME_FIRSTNAME.zip src/occurrence_frequencies.py src/enhanced_occurrence_frequencies.py src/text_length.py src/data_compression.py
zip language_detection_LASTNAME_FIRSTNAME.zip src/occurrence_frequencies.py src/enhanced_occurrence_frequencies.py src/text_length.py src/data_compression.py