Skip to content
Snippets Groups Projects
Verified Commit 5546a27d authored by STEVAN Antoine's avatar STEVAN Antoine :crab:
Browse files

huffman: add documentation and hints

parent a4378114
Branches
No related tags found
No related merge requests found
......@@ -25,8 +25,15 @@ $i = 1, \ldots, n$, then the entropy $H(X)$ is defined by:
$$H(X) = - \sum\limits_{i = 1}^{n}p_i \log_2 p_i$$
The entropy gives a bound on the average number of bits used to code a symbol.
In this part, we will first compute the entropy, and then verify that we can
obtain such level of compression with the Huffman algorithm.
That is, it is impossible to compress any binary to a size that is smaller than
its entropy times the number of bytes.
In the rest of this document, we will take as an example input text the following
```text
this is an example of a huffman tree
```
You can define it in `huffman.py` and use it later to make sure your implementation is correct.
#### Question 12. [[toc](README.md#table-of-content)]
:pencil: Write a function `compute_distribution` that will compute the
......@@ -38,6 +45,32 @@ following signature:
of all the symbols in the text, represented as a dictionary where the keys are
the $(a_i)$ and the values are the associated $(p_i)$
> :bulb: **Note**
>
> another version of this would be to simply compute the number of occurrences, as the rest of the
> Huffman algorithm only needs to be able to sort symbols by "_frequency_" and the two quantities
> are equal up to a multiplicative constant.
:bulb: The distribution for our example text is the following
| symbol | frequency (3 decimals) |
| ------ | ---------------------- |
| t | $0.056$ |
| h | $0.056$ |
| i | $0.056$ |
| s | $0.056$ |
| | $0.194$ |
| a | $0.111$ |
| n | $0.056$ |
| e | $0.111$ |
| x | $0.028$ |
| m | $0.056$ |
| p | $0.028$ |
| l | $0.028$ |
| o | $0.028$ |
| f | $0.083$ |
| u | $0.028$ |
| r | $0.028$ |
#### Question 13. [[toc](README.md#table-of-content)]
:pencil: Write a function `entropy` that will compute the entropy of a
probability distribution. It should have the following signature:
......@@ -46,6 +79,8 @@ probability distribution. It should have the following signature:
the keys are the $(a_i)$ and the values are the associated $(p_i)$
- return: the entropy of `x`, i.e. $H(X)$
:bulb: The entropy of our example input is $3.714192447093237$.
:question: What is the entropy of `README.md`?
> :bulb: **Note**
......@@ -59,13 +94,14 @@ binary sequence of $0$s and $1$s.
There are two properties that we want to have:
- (1) the more common a symbol in the text, the shorter the encoding. And
vice-versa with the least common symbols.
vice-versa with the least common symbols, e.g. in English, "_e_" should have
the smallest encoding because it is the most common letter and "_z_" should
have the longest one.
- (2) we want to use an encoding where no encoded symbol is the prefix of
another one
Intuitively, property (1) will help reduce the size of the text file by
assigning shorter encodings to the most common symbols in the text, e.g. the
letter "e" in French or in English.
assigning shorter encodings to the most common symbols in the text.
Property (2) will make sure there is no ambiguity in the encoding and greatly
help the decoding process.
......@@ -76,12 +112,12 @@ tree.
The construction of the tree is as follows:
- compute the probability distribution of the symbols in the text
- put all the symbols with their "weights", i.e. their frequencies at the start,
in a priority queue
- pick the two least frequent symbols
in a _priority queue_
- pop the two least frequent symbols from the _queue_
- merge them into a tree where they are the two children
- set their "weight" to the sum of the "weights" of the two children
- put this tree back in the queue
- rince and repeat as long as there are strictly more than $1$ item in the queue
- push this tree back into the queue
- rince and repeat as long as there are strictly more than $1$ item in the _queue_
When there is only one item left in the queue, you can extract it and you should
have the _Huffman_ tree!
......@@ -91,14 +127,22 @@ have the _Huffman_ tree!
following signature:
- arguments:
- `p`: a probability distribution of all the symbols in the text
- return: the _Huffman_, i.e. a nested tuple
- return: the _Huffman_, i.e. a nested tuple (see the example tree below for what this can mean)
> :bulb: **None**
>
> the [`src/priority_queue.py`](src/priority_queue.py) provides a _naive_ implementation of a
> _priority queue_ that should be good enough for this class.
>
> you are encouraged to use it or feel free to use the built-in `heapq` module of Python.
>
> there is an example usage of the `src.priority_queue.PriorityQueue` class at the bottom of the module.
:question: Test your `build_huffman_tree` on the following sentence: "_this is an example of a huffman tree_"
Do you get the same tree as below?
:question: Test your `build_huffman_tree` on our example text. Do you get the same tree as below?
> :bulb: **Note**
>
> the weights are expressed as the number of occurences, which is equivalent to
> the weights are expressed as the number of occurrences, which is equivalent to
> the frequencies you have computed, just need to divide / multiply by the total
> number of symbols to get from one to the other.
>
......@@ -174,6 +218,47 @@ flowchart TD
12 --- spc
```
In terms of Python code, the your tree might look like the following _nested tuple_
```python
# a nested tuple is simply tuples inside other tuples
( # this is the root
( # this is the left child of the root
('a', 'e'),
(
('t', 'h'), # t and h are siblings
('i', 's'),
),
),
( # this is the right child of the root
(
('n', 'm'),
(
('x', 'p'),
('l', 'o'),
),
),
(
(
('u', 'r'),
'f',
),
' ',
),
),
)
```
> :bulb: **Note**
>
> the exact placement of the nodes in this tree is not important. However, the
> depth at which the nodes are is really important and should depend on the
> frequency of each symbol compared to the other symbols.
>
> e.g. a and e could be swapped, depending on how they are sorted in the _priority queue_,
> however, they need to remain at the same depth, in order to have a encoding of
> the correct length!
### Coding books [[toc](README.md#table-of-content)]
The _Huffman_ tree contains all the information we need about the input text.
......@@ -188,8 +273,8 @@ The idea is to traverse the tree from the root to each one of the leaves, our
symbols. When going to the left, a $0$ will be added to the sequence, going to
the right will add a $1$.
This will ensure, by construction of the _Huffman_ based on the frequencies of
the symbols, that the two properties defined below are satisfied!
This will ensure, by construction of the _Huffman_ tree based on the frequencies
of the symbols, that the two _Huffman_ properties defined above are satisfied!
#### Question 15. [[toc](README.md#table-of-content)]
:pencil: Write a function `build_coding_book` that will compute the coding book
......@@ -199,8 +284,8 @@ of any _Huffman_ tree. It should have the following signature:
- return: the coding book, represented as a dictionary where the keys are the
symbols and the values are the associated binary sequences
:question: Do you get the expected encoding for all symbols with the sentence
above, "_this is an example of a huffman tree_"?
:question: Do you get the expected encoding for all symbols with our example
sentence?
Verify that you get the correct encoding, as below:
```mermaid
......@@ -273,6 +358,27 @@ flowchart TD
n5 -- 1 --- f
n12 -- 1 --- spc
```
Which translates to the following in Python
```python
{
'a': '000',
'e': '001',
't': '0100',
'h': '0101',
'i': '0110',
's': '0111',
'n': '1000',
'm': '1001',
'x': '10100',
'p': '10101',
'l': '10110',
'o': '10111',
'u': '11000',
'r': '11001',
'f': '1101',
' ': '111',
}
```
:question: Can you find any encoded sequence of bits that is the prefix of
another one?
......@@ -287,9 +393,18 @@ It is now time to put together the compression algorithm. The steps are as follo
- compute the frequencies of all symbols
- build the _Huffman_ tree
- build the coding book
- replace all symbols by their associated binary encoding
- convert to binary
- return the coding book, the binary sequence and the number of padding bits
- build the compressed sequence or $0$s and $1$s by replacing all symbols by
their associated binary encoding
- convert the sequence to real binary
- return the coding book, the binary sequence and the number of _padding_ bits
But what is _padding_? Well, due to the nature of the _Huffman_ tree, we don't
know what the size of the encodings for each symbol will be. And we know even
less what the final size of the compressed bits will be! In particular, we don't
know if the compressed bits will be a multiple of $8$. To make sure the
compressed data fits perfectly into a whole number of bytes, we'll be adding
some _padding_ bits, e.g. let's say the bits are `0101101011`, we need to add 6
bits of _padding_ to get to 2 bytes, i.e. `0101101011111111` if we pad with $1$s.
#### Question 16. [[toc](README.md#table-of-content)]
:pencil: Write a function `compress` that implements the _Huffman_ compression.
......@@ -301,12 +416,44 @@ It should have the following signature:
> :bulb: **Note**
>
> you can use the `into_bytes` function from `src.helpers` to convert a string
> of $0$s and $1$s into real bytes.
> you can use the `into_bytes` function from [`src.helpers`](src/helpers.py) to
> convert a string of $0$s and $1$s into real bytes.
>
> there is documentation and you can find some simple examples at the bottom of
> the module.
:question: Test your `compress` function by compressing a bunch of text.
For instance, on our example text from the start, the bits, arranged in groups
of $8$, should be
```
01000101
01100111
11101100
11111100
01000111
00110100
00010011
01011011
00011111
01111101
11100011
10101110
00110111
01100100
01000111
01001100
1001001
```
and there should be 1 bit of padding.
:question: Is the average number of bits used in the compressed output always
higher than the entropy of the input text?
:question: Test your `compress` function by compressing a bunch of text. Is the
average number of bits used in the compressed output always higher than the
entropy of the input text?
You can plot your results for
- samples of different lengths in a real text file, e.g. written in English to
use the structure of the language
- sample random character strings of various length, i.e. without structure
### Text decompression [[toc](README.md#table-of-content)]
......@@ -333,7 +480,7 @@ vice-versa
> :bulb: **Note**
>
> this dictionary inversion only makes sense for dictionaries where all values
> are uniq.
> are unique.
#### Question 18. [[toc](README.md#table-of-content)]
:pencil: Write a function `decompress` that implements the decompression. It
......@@ -346,15 +493,28 @@ should have the following signature:
> :bulb: **Hint**
>
> First, you can use the `from_bytes` function from `src.helpers` that will
> First, you can use the `from_bytes` function from [`src.helpers`](src/helpers.py) that will
> convert bytes into their string $0$s and $1$s.
>
> there is documentation and you can find some simple examples at the bottom of
> the module.
>
> Second, you need to iterate over the $0$s and $1$s and read the "decode book"
> to rebuild the original sequence of symbols.
:question: Can you reconstruct some text with that last function? You can either
use a dummy phrase such as "_this is an example of a huffman tree_" or read a
file such as `README.md`.
:question: Can you reconstruct our original example text with that last function?
As the first byte of this example is `01000101`, the only symbol in our coding
book, thanks to the first _Huffman_ property, that fits is `t` with encoding
`0100`. So we know the first character is `t`.
Then we can remove the first $4$ bits of the compressed bits and search for the
next symbol.
Without the bits of `t`, the first remaining $8$ bits are `01010110`. We see
that the only encoding that fits these first bits is `0101` and corresponds to
the symbol `h`.
And so on...
### Writing a CLI compression tool [[toc](README.md#table-of-content)]
......@@ -408,6 +568,15 @@ disk following the file format above. It should have the following signature:
- `compressed`: the compressed binary data
- `padding`: the number of padding bits
> :bulb: **Note**
>
> below is a small example of how to write newline-separated binary data to a file
> ```python
> with open("my_file.bin", "wb") as handle:
> handle.write("hello world\n".encode())
> handle.write(b"abc")
> ```
##### Question 20. [[toc](README.md#table-of-content)]
:pencil: Write a function `read` that will read compression information from
disk following the file format above. It should have the following signature:
......@@ -416,6 +585,15 @@ disk following the file format above. It should have the following signature:
- return: the codebook that comes from the compression, the compressed binary
data and the number of padding bits
> :bulb: **Note**
>
> below is a small example of how to read newline-separated binary data from a file
> ```python
> with open("my_file.bin", 'rb') as handle:
> hello_world = handle.readline().decode()
> abc = handle.readline()
> ```
#### Wrapping up in a CLI application [[toc](README.md#table-of-content)]
In order to write a CLI application, we will be using the `argparse` module.
......
......@@ -2,26 +2,50 @@ from typing import Iterable
def read_text(filename: str) -> str:
"""
Read text from a file on disk by taking care of the decoding
"""
with open(filename, "rb") as handle:
return handle.read().decode()
def chunks(it: Iterable, n: int) -> Iterable:
def __chunks(it: Iterable, n: int) -> Iterable:
"""Yield successive n-sized chunks from it."""
for i in range(0, len(it), n):
yield it[i:i + n]
def into_bytes(bits: str) -> (bytes, int):
padding = 8 - len(bits) % 8
"""
Convert a sequence of 0s and 1s, represented as a simple string, into real
bytes, possibly with padding
The padding bits are 0s.
"""
padding = (8 - len(bits) % 8) % 8
bin = bytes(map(
lambda n: int(n, 2),
chunks(bits + '0' * padding, 8),
__chunks(bits + '0' * padding, 8),
))
return bin, padding
def from_bytes(bin: bytes, padding: int) -> str:
return ''.join(format(b, '#010b').removeprefix("0b") for b in bin)[:-padding]
"""
Convert a sequence of bytes to its string representation with 0s and 1s,
taking care of padding
"""
decoded = ''.join(format(b, '#010b').removeprefix("0b") for b in bin)
return decoded if padding == 0 else decoded[:-padding]
if __name__ == "__main__":
assert into_bytes("0101") == (b'P', 4)
assert into_bytes("01000001") == (b'A', 0)
assert into_bytes("010000010100001") == (b'AB', 1)
assert "0101" == from_bytes(b'P', 4)
assert "01000001" == from_bytes(b'A', 0)
assert "010000010100001" == from_bytes(b'AB', 1)
......@@ -4,18 +4,56 @@ Value = Any
class PriorityQueue:
def __init__(self, values: List[Tuple[float, Value]]):
"""
A naive implementation of a _priority queue_.
An element is said to have a "_higher priority_" if its value is smaller.
"""
def __init__(self, values: List[Tuple[float, Value]] = []):
"""
Initialize a _priority queue_ with optional pre-inserted values.
"""
self._values = values
def is_empty(self) -> bool:
"""
Determine if the _queue_ is empty.
"""
return len(self._values) == 0
def __len__(self) -> int:
"""
Give the number of remaining elements in the _queue_.
"""
return len(self._values)
def push(self, v: Tuple[float, Value]):
"""
Push a value into the _queue_, it will be popped last, unless it has a
_higher priority_.
"""
self._values.append(v)
def pop(self) -> Tuple[float, Value]:
"""
Pop the element of the _queue_ with the _highest priority_.
"""
sort_key = [w for w, _ in self._values]
return self._values.pop(sort_key.index(min(sort_key)))
if __name__ == "__main__":
from random import randint, random
q = PriorityQueue()
print("inserting 10 items with random weigths")
for _ in range(10):
x = (random(), randint(0, 100))
print(x)
q.push(x)
print("popping every item out of the queue, in order")
while not q.is_empty():
print(q.pop())
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment