Snippets Groups Projects

language-detection

Update README.md

DETCHART Jonathan authored Sep 12, 2024

ccbb2f78

ccbb2f78 Sep 12, 2024

Name	Last commit	Last update
assets
data
src
tests
.envrc
.gitignore
Makefile
README.md
dev-requirements.txt
requirements.txt

Language Detection

The aim of this project is to analyse a text in order to detect its language. For that, we propose to find the occurrence frequencies of the characters in the text. We will interpret these frequencies as a probability distribution that will be used for several purposes:

detect the language of the text by comparing its distribution with several probability distributions of known languages (english, french, spanish, ...)
compress the text by using a compression algorithm (Huffman coding) based on its probability distribution.

From a programming point of view, the first objectives of this project are to learn the basics in Python (variables, conditions, loops, ...). The concept of function will also be seen. Simultaneously, we will try to learn to decompose a problem into several problems.

Table of content

Requirements
Running the project
Assignment
Q1. Occurrence frequency analysis
Q2. Enhanced occurrence frequency analysis
Q3. Influence of the text length on the quality of detection
- Q3.1.
- Q3.2.
Q4. Compression with Huffman (optional)
Upload your work to the LMS

Requirements [toc]

you should have completed the following notebooks before starting this project:

Running the project on ISAE's computers [toc]

You can use the following steps:

open a Terminal on the computer
clone this repository using the following command:

git clone https://gitlab.isae-supaero.fr/mae-ac/language-detection.git

Load the Anaconda module using the following command:

module load anaconda3/2023

Activate the Python environment using the following command:

source activate mae1-env

Run your editor or you IDE. For vscode, run:

code

Assignment [toc]

you and anyone looking at your code should be able to run the following command without getting any errors:

make test

💡 Note

for any question x, you can run make qx to see the output of this question, e.g. make q1 or make q4.2.1

Q1. Occurrence frequency analysis [toc]

❗ Important

there are no tests to run in this section, you can relax... for now 😉

Your first objective is to find the occurrence frequencies of the letters of the alphabet in a text.

📝 Open the occurrence_frequencies.py script in your editor of choice and fill it. This file already contains the steps to follow :

Add the following line at the start of your script, which will allow the use of the type annotations below

from typing import Dict

Create a dictionary containing the letters with the occurrences equal to 0
For each letter in the text, increment the corresponding entry of the dictionary
Normalize the values of the dictionary in order to have frequencies (the sum is equal to 1)

Q2. Enhanced occurrence frequency analysis [toc]

📁 Create a new file called enhanced_occurrence_frequencies.py in the src/ directory.

⚙️ you can run the make test command and see 6 failing tests. These tests should hopefully pass at the end of this question!

Now we will use the Wuthering Heights by Emily Bronte text and analyze it.

📝 You should now be able to answer the following questions:

Write a function read_text to read a text file and return a string text containing only lower case characters.

arguments:

filename: str

return: str

Apply this function to the file data/english_wuthering_heights.txt and store its content in the variable my_text.
Write a function compute_occurence_frequencies to compute the occurrence frequencies of the letters of a string alphabet in the string text. This function returns the corresponding dictionary.

arguments:

text: str

alphabet: str

return: Dict[str, float]

Apply this function to the string characters = "abcdefghijklmnopqrstuvwxyz" and the string my_text. Store the output dictionary in the variable english_of (english occurrence frequency).
Compute the occurrence frequencies of french, spanish and german languages by respectively using the files french_madame_bovary.txt, spanish_don_quijote.txt and german_kritik_der_reinen_vernunft.txt.
The Kullback–Leibler (KL) divergence defines a kind of distance between two probability distributions. If we consider two distributions
P=(p_1, \ldots,p_n)
and
Q=(q_1, \ldots,q_n)
:
D_{KL}(P,Q)=\sum_{i=1}^{n} p_i \log_2 \frac{p_i}{q_i}

💡 Note

All the
q_i
must be strictly positive. If some of them are equal to zero, do not consider the corresponding symbol.

Since the KL divergence is not symmetric, you can symmetrize it by computing the absolute value of the average of

D_{KL}(P,Q)

and

D_{KL}(Q,P)

.

💡 Note

You need the log and the fabs functions. Since there are not loaded by default, you must add the line from math import log,fabs at the beginning of your file.

Write a function kl_divergence that takes two probability distributions in the form of previously computed dictionaries and returns the symmetric Kullback-Leibler divergence.

arguments:

first probability distribution: Dict[str, float]

second probability distribution: Dict[str, float]

return: float

Compute the KL divergence between the different languages.
Write a function detect_language which is able to determine the language used in a text file. The idea is just to compute the occurrence frequencies of the characters in the text file and to compute the KL divergences with the occurrence frequencies of the various languages.

arguments:

text: str

references languages: Dict[str, Dict[str, float]]

alphabet: str

return: str

Q3. Influence of the text length on the quality of detection [toc]

📁 Create a new file src/text_length.py.

The accuracy of the language detector depends on the length of the input text. The objective of this exercise is to evaluate this accuracy according to the length of the text.

Q3.1. [toc]

For each text length, evaluate the probability of good detection and plot the result with matplotlib.

💡 Note

the details of this question have been intentionally left as an exercise to the students. it's time for you to shine and experiment what you think would be a good approach.

if you get really stuck or have no idea where to begin, feel free to ask your professor for some insights 😉

Q3.2. [toc]

To try improve this result, you can try to capture the redundancy of a language in the sequences of 2 (or more) letters. For that, consider the pairs of consecutive letters as a symbol (of course, the size of your alphabet of sybols will increase) and do the same analysis to check if you can improve the detection results.

Q4. Compression with Huffman [toc] (optional)

📁 Create a new file src/data_compression.py.

The analysis of the occurrence frequencies of the characters in the languages shows that there is a strong variability between the characters. This shows that there is some "redundancy" in the languages. In other words, this means that it is possible to compress the languages by using these occurrence frequencies.

Basically speaking, the maximum level of compression that can be obtained with the occurrence frequencies is given by the entropy. The entropy gives a bound on the average number of bits used to code a symbol. In this part, we will first compute the entropy, and then verify that we can obtain such level of compression with the Huffman algorithm.

The entropy is a mathematical concept that measures the "amount of information" of a probability distribution. If we consider a variable which takes its values in the set {
a_1, a_2, \ldots, a_n
} with the probabilities
p_i=p(a_i)
, for
i=1,\ldots,n
, then the entropy
H(X)
is defined by :
H(X)=-\sum_{i=1}^{n} p_i \log_2 p_i
If you want to learn more on the entropy concept, which has strong applications in compression, error correcting codes and cryptography, have a look on this video :)

By assuming that the occurrence frequencies obtained of englishOF are representative of the occurrence probabilities of the characters in the english language, compute the entropy of the english language. For that :
1. add the line from enhanced_occurrence_frequencies import read_text, compute_occurence_frequencies, detect_language that allows you to use the functions defined in the previous file.
2. In order to be fair in the evaluation of the compression, you need to integrate all the possible characters, instead of just the letters. So, to generate the string characters which contains the considered characters, copy-paste the code :
```
characters = [chr(i) for i in range(128)]
```
and re-compute the associated dictionaries
1. write a generic function that computes the entropy from any distribution probability represented by a dictionary.
Compute the entropies of french, spanish and german languages.
The Huffman algorithm is a famous compression algorithm which is one of the components of current audio and video compression standards. A short presentation of this algorithm is given in this video. The objective here is not to program it (lack of time...) but rather to use an implementation. This implementation is provided in the file huffman.py. You can find on this files the main steps of the algorithm :
1. the construction of the tree according to the probabilities with build_huffman_tree
2. the generation of the code with generate_code
3. the compression of the text with compress
4. the decompression with decompress
Use these algorithm to compress and decompress the language files (do not forget to import it).
1. Verify that the entropy gives a lower bound of the average number of bits used to code the characters. This number of bits can be obtained from the output of the compression algorithm. Note that this output is not a binary file, but a string of characters '0' and '1'... So the average number of bits per characters can be obtained by
  \frac{\textrm{number of characters in the compressed file}}{\textrm{number of characters in the initial text file}}
  Observe the "beauty of the math" by verifying the accuracy of the entropy bound !!
2. Evaluate the compression ratio by considering that each character of the text files is encoded on 7 bits. Thus the compression rate is given by :
  \frac{7\times \textrm{number of characters in the initial text file}}{\textrm{number of characters in the compressed file}}
How to improve the level of compression ? The occurrence frequencies of each character only capture the redundancy of the characters, but not the one between consecutive characters... So, you can do the same exercice by analyzing the occurence frequencies of all 2-letter (or more) words. This should allow you to capture more redundancy of the text and then to improve the compression ratio.

Upload your work to the LMS [toc]

open a terminal
go into the folder containing your project
use the zip command to compress your project

zip -r project.zip . -x "venv/**" ".git/**"

upload the ZIP archive to the LMS