Language Detection
The aim of this project is to analyse a text in order to detect its language. For that, we propose to find the occurrence frequencies of the characters in the text. We will interpret these frequencies as a probability distribution that will be used for several purposes:
- detect the language of the text by comparing its distribution with several probability distributions of known languages (english, french, spanish, ...)
- compress the text by using a compression algorithm (Huffman coding) based on its probability distribution.
From a programming point of view, the first objectives of this project are to learn the basics in Python (variables, conditions, loops, ...). The concept of function will also be seen. Simultaneously, we will try to learn to decompose a problem into several problems.
Table of content
- Requirements
- Running the project
- Assignment
- Q1. Occurrence frequency analysis
- Q2. Enhanced occurrence frequency analysis
- Q3. Influence of the text length on the quality of detection
- Q4. Compression with Huffman (optional)
- Upload your work to the LMS
toc]
Requirements [you should have completed the following notebooks before starting this project:
toc]
Running the project on ISAE's computers [You can use the following steps:
- open a
Terminal
on the computer - clone this repository using the following command:
git clone https://gitlab.isae-supaero.fr/mae-ac/language-detection.git
- Load the
Anaconda
module using the following command:
module load anaconda3/2023
- Activate the
Python
environment using the following command:
source activate mae1-env
- Run your editor or you IDE. For vscode, run:
code
toc]
Assignment [you and anyone looking at your code should be able to run the following command without getting any errors:
make test
💡 Note
for any question x, you can run
make qx
to see the output of this question, e.g.make q1
ormake q4.2.1
toc]
Q1. Occurrence frequency analysis [❗ Important
there are no tests to run in this section, you can relax... for now 😉
Your first objective is to find the occurrence frequencies of the letters of the alphabet in a text.
📝 Open the occurrence_frequencies.py
script in your editor
of choice and fill it. This file already contains the steps to follow :
- Add the following line at the start of your script, which will allow the use of the type annotations below
from typing import Dict
- Create a dictionary containing the letters with the occurrences equal to 0
- For each letter in the text, increment the corresponding entry of the dictionary
- Normalize the values of the dictionary in order to have frequencies (the sum is equal to 1)
toc]
Q2. Enhanced occurrence frequency analysis [📁 Create a new file called enhanced_occurrence_frequencies.py
in the src/
directory.
⚙️ you can run the make test
command and see 6 failing tests. These tests should hopefully
pass at the end of this question!
Now we will use the Wuthering Heights by Emily Bronte text and analyze it.
📝 You should now be able to answer the following questions:
- Write a function
read_text
to read a text file and return a string text containing only lower case characters.
- arguments:
- filename:
str
- return:
str
- Apply this function to the file
data/english_wuthering_heights.txt
and store its content in the variablemy_text
. - Write a function
compute_occurence_frequencies
to compute the occurrence frequencies of the letters of a stringalphabet
in the stringtext
. This function returns the corresponding dictionary.
- arguments:
- text:
str
- alphabet:
str
- return:
Dict[str, float]
- Apply this function to the string
characters = "abcdefghijklmnopqrstuvwxyz"
and the stringmy_text
. Store the output dictionary in the variableenglish_of
(english occurrence frequency). - Compute the occurrence frequencies of french, spanish and german languages by respectively using the files
french_madame_bovary.txt
,spanish_don_quijote.txt
andgerman_kritik_der_reinen_vernunft.txt
. - The Kullback–Leibler (KL) divergence defines a kind of distance between two probability distributions. If we consider two distributions P=(p_1, \ldots,p_n)andQ=(q_1, \ldots,q_n):D_{KL}(P,Q)=\sum_{i=1}^{n} p_i \log_2 \frac{p_i}{q_i}
💡 Note
All the
q_imust be strictly positive. If some of them are equal to zero, do not consider the corresponding symbol.
Since the KL divergence is not symmetric, you can symmetrize it by computing the absolute value of the average of
💡 Note
You need the
log
and thefabs
functions. Since there are not loaded by default, you must add the linefrom math import log,fabs
at the beginning of your file.
Write a function kl_divergence
that takes two probability distributions in the form of previously computed dictionaries and returns the symmetric Kullback-Leibler divergence.
- arguments:
- first probability distribution:
Dict[str, float]
- second probability distribution:
Dict[str, float]
- return:
float
- Compute the KL divergence between the different languages.
- Write a function
detect_language
which is able to determine the language used in a text file. The idea is just to compute the occurrence frequencies of the characters in the text file and to compute the KL divergences with the occurrence frequencies of the various languages.
- arguments:
- text:
str
- references languages:
Dict[str, Dict[str, float]]
- alphabet:
str
- return:
str
toc]
Q3. Influence of the text length on the quality of detection [📁 Create a new file src/text_length.py
.
The accuracy of the language detector depends on the length of the input text. The objective of this exercise is to evaluate this accuracy according to the length of the text.
toc]
Q3.1. [For each text length, evaluate the probability of good detection and plot the result with matplotlib
.
💡 Note
the details of this question have been intentionally left as an exercise to the students. it's time for you to shine and experiment what you think would be a good approach.
if you get really stuck or have no idea where to begin, feel free to ask your professor for some insights 😉
toc]
Q3.2. [To try improve this result, you can try to capture the redundancy of a language in the sequences of 2 (or more) letters. For that, consider the pairs of consecutive letters as a symbol (of course, the size of your alphabet of sybols will increase) and do the same analysis to check if you can improve the detection results.
toc] (optional)
Q4. Compression with Huffman [📁 Create a new file src/data_compression.py
.
The analysis of the occurrence frequencies of the characters in the languages shows that there is a strong variability between the characters. This shows that there is some "redundancy" in the languages. In other words, this means that it is possible to compress the languages by using these occurrence frequencies.
Basically speaking, the maximum level of compression that can be obtained with the occurrence frequencies is given by the entropy. The entropy gives a bound on the average number of bits used to code a symbol. In this part, we will first compute the entropy, and then verify that we can obtain such level of compression with the Huffman algorithm.
-
The entropy is a mathematical concept that measures the "amount of information" of a probability distribution. If we consider a variable which takes its values in the set {
a_1, a_2, \ldots, a_n} with the probabilitiesp_i=p(a_i), fori=1,\ldots,n, then the entropyH(X)is defined by :H(X)=-\sum_{i=1}^{n} p_i \log_2 p_iIf you want to learn more on the entropy concept, which has strong applications in compression, error correcting codes and cryptography, have a look on this video :)By assuming that the occurrence frequencies obtained of englishOF are representative of the occurrence probabilities of the characters in the english language, compute the entropy of the english language. For that :
- add the line
from enhanced_occurrence_frequencies import read_text, compute_occurence_frequencies, detect_language
that allows you to use the functions defined in the previous file. - In order to be fair in the evaluation of the compression, you need to integrate all the possible characters, instead of just the letters. So, to generate the string characters which contains the considered characters, copy-paste the code :
characters = [chr(i) for i in range(128)]
and re-compute the associated dictionaries
- write a generic function that computes the entropy from any distribution probability represented by a dictionary.
- add the line
-
Compute the entropies of french, spanish and german languages.
-
The Huffman algorithm is a famous compression algorithm which is one of the components of current audio and video compression standards. A short presentation of this algorithm is given in this video. The objective here is not to program it (lack of time...) but rather to use an implementation. This implementation is provided in the file
huffman.py
. You can find on this files the main steps of the algorithm :- the construction of the tree according to the probabilities with
build_huffman_tree
- the generation of the code with
generate_code
- the compression of the text with
compress
- the decompression with
decompress
Use these algorithm to compress and decompress the language files (do not forget to
import
it).- Verify that the entropy gives a lower bound of the average number of bits used to code the characters. This number of bits can be obtained from the output of the compression algorithm. Note that this output is not a binary file, but a string of characters '0' and '1'... So the average number of bits per characters can be obtained by
\frac{\textrm{number of characters in the compressed file}}{\textrm{number of characters in the initial text file}}Observe the "beauty of the math" by verifying the accuracy of the entropy bound !!
- Evaluate the compression ratio by considering that each character of the text files is encoded on 7 bits. Thus the compression rate is given by :
\frac{7\times \textrm{number of characters in the initial text file}}{\textrm{number of characters in the compressed file}}
- the construction of the tree according to the probabilities with
-
How to improve the level of compression ? The occurrence frequencies of each character only capture the redundancy of the characters, but not the one between consecutive characters... So, you can do the same exercice by analyzing the occurence frequencies of all 2-letter (or more) words. This should allow you to capture more redundancy of the text and then to improve the compression ratio.
toc]
Upload your work to the LMS [- open a terminal
- go into the folder containing your project
- use the
zip
command to compress your project
zip -r project.zip . -x "venv/**" ".git/**"
- upload the ZIP archive to the LMS