Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
Trees
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Iterations
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Locked files
Deploy
Model registry
Help
Help
Support
GitLab documentation
Compare GitLab plans
GitLab community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
1MAE002 - ALGORITHM AND COMPUTING
Advanced
Trees
Commits
5546a27d
Verified
Commit
5546a27d
authored
10 months ago
by
STEVAN Antoine
Browse files
Options
Downloads
Patches
Plain Diff
huffman: add documentation and hints
parent
a4378114
Branches
Branches containing commit
No related tags found
No related merge requests found
Changes
3
Show whitespace changes
Inline
Side-by-side
Showing
3 changed files
huffman.md
+208
-30
208 additions, 30 deletions
huffman.md
src/helpers.py
+28
-4
28 additions, 4 deletions
src/helpers.py
src/priority_queue.py
+39
-1
39 additions, 1 deletion
src/priority_queue.py
with
275 additions
and
35 deletions
huffman.md
+
208
−
30
View file @
5546a27d
...
...
@@ -25,8 +25,15 @@ $i = 1, \ldots, n$, then the entropy $H(X)$ is defined by:
$$H(X) = -
\s
um
\l
imits_{i = 1}^{n}p_i
\l
og_2 p_i$$
The entropy gives a bound on the average number of bits used to code a symbol.
In this part, we will first compute the entropy, and then verify that we can
obtain such level of compression with the Huffman algorithm.
That is, it is impossible to compress any binary to a size that is smaller than
its entropy times the number of bytes.
In the rest of this document, we will take as an example input text the following
```
text
this is an example of a huffman tree
```
You can define it in
`huffman.py`
and use it later to make sure your implementation is correct.
#### Question 12. [[toc](README.md#table-of-content)]
:pencil: Write a function
`compute_distribution`
that will compute the
...
...
@@ -38,6 +45,32 @@ following signature:
of all the symbols in the text, represented as a dictionary where the keys are
the $(a_i)$ and the values are the associated $(p_i)$
> :bulb: **Note**
>
> another version of this would be to simply compute the number of occurrences, as the rest of the
> Huffman algorithm only needs to be able to sort symbols by "_frequency_" and the two quantities
> are equal up to a multiplicative constant.
:bulb: The distribution for our example text is the following
| symbol | frequency (3 decimals) |
| ------ | ---------------------- |
| t | $0.056$ |
| h | $0.056$ |
| i | $0.056$ |
| s | $0.056$ |
| | $0.194$ |
| a | $0.111$ |
| n | $0.056$ |
| e | $0.111$ |
| x | $0.028$ |
| m | $0.056$ |
| p | $0.028$ |
| l | $0.028$ |
| o | $0.028$ |
| f | $0.083$ |
| u | $0.028$ |
| r | $0.028$ |
#### Question 13. [[toc](README.md#table-of-content)]
:pencil: Write a function
`entropy`
that will compute the entropy of a
probability distribution. It should have the following signature:
...
...
@@ -46,6 +79,8 @@ probability distribution. It should have the following signature:
the keys are the $(a_i)$ and the values are the associated $(p_i)$
-
return: the entropy of
`x`
, i.e. $H(X)$
:bulb: The entropy of our example input is $3.714192447093237$.
:question: What is the entropy of
`README.md`
?
> :bulb: **Note**
...
...
@@ -59,13 +94,14 @@ binary sequence of $0$s and $1$s.
There are two properties that we want to have:
-
(1) the more common a symbol in the text, the shorter the encoding. And
vice-versa with the least common symbols.
vice-versa with the least common symbols, e.g. in English, "_e_" should have
the smallest encoding because it is the most common letter and "_z_" should
have the longest one.
-
(2) we want to use an encoding where no encoded symbol is the prefix of
another one
Intuitively, property (1) will help reduce the size of the text file by
assigning shorter encodings to the most common symbols in the text, e.g. the
letter "e" in French or in English.
assigning shorter encodings to the most common symbols in the text.
Property (2) will make sure there is no ambiguity in the encoding and greatly
help the decoding process.
...
...
@@ -76,12 +112,12 @@ tree.
The construction of the tree is as follows:
-
compute the probability distribution of the symbols in the text
-
put all the symbols with their "weights", i.e. their frequencies at the start,
in a priority queue
-
p
ick
the two least frequent symbols
in a
_
priority queue
_
-
p
op
the two least frequent symbols
from the _queue_
-
merge them into a tree where they are the two children
-
set their "weight" to the sum of the "weights" of the two children
-
pu
t
this tree back in the queue
-
rince and repeat as long as there are strictly more than $1$ item in the queue
-
pu
sh
this tree back in
to
the queue
-
rince and repeat as long as there are strictly more than $1$ item in the
_
queue
_
When there is only one item left in the queue, you can extract it and you should
have the _Huffman_ tree!
...
...
@@ -91,14 +127,22 @@ have the _Huffman_ tree!
following signature:
-
arguments:
-
`p`
: a probability distribution of all the symbols in the text
-
return: the _Huffman_, i.e. a nested tuple
-
return: the _Huffman_, i.e. a nested tuple (see the example tree below for what this can mean)
> :bulb: **None**
>
> the [`src/priority_queue.py`](src/priority_queue.py) provides a _naive_ implementation of a
> _priority queue_ that should be good enough for this class.
>
> you are encouraged to use it or feel free to use the built-in `heapq` module of Python.
>
> there is an example usage of the `src.priority_queue.PriorityQueue` class at the bottom of the module.
:question: Test your
`build_huffman_tree`
on the following sentence: "_this is an example of a huffman tree_"
Do you get the same tree as below?
:question: Test your
`build_huffman_tree`
on our example text. Do you get the same tree as below?
> :bulb: **Note**
>
> the weights are expressed as the number of occurences, which is equivalent to
> the weights are expressed as the number of occur
r
ences, which is equivalent to
> the frequencies you have computed, just need to divide / multiply by the total
> number of symbols to get from one to the other.
>
...
...
@@ -174,6 +218,47 @@ flowchart TD
12 --- spc
```
In terms of Python code, the your tree might look like the following _nested tuple_
```
python
# a nested tuple is simply tuples inside other tuples
(
# this is the root
(
# this is the left child of the root
(
'
a
'
,
'
e
'
),
(
(
'
t
'
,
'
h
'
),
# t and h are siblings
(
'
i
'
,
'
s
'
),
),
),
(
# this is the right child of the root
(
(
'
n
'
,
'
m
'
),
(
(
'
x
'
,
'
p
'
),
(
'
l
'
,
'
o
'
),
),
),
(
(
(
'
u
'
,
'
r
'
),
'
f
'
,
),
'
'
,
),
),
)
```
> :bulb: **Note**
>
> the exact placement of the nodes in this tree is not important. However, the
> depth at which the nodes are is really important and should depend on the
> frequency of each symbol compared to the other symbols.
>
> e.g. a and e could be swapped, depending on how they are sorted in the _priority queue_,
> however, they need to remain at the same depth, in order to have a encoding of
> the correct length!
### Coding books [[toc](README.md#table-of-content)]
The _Huffman_ tree contains all the information we need about the input text.
...
...
@@ -188,8 +273,8 @@ The idea is to traverse the tree from the root to each one of the leaves, our
symbols. When going to the left, a $0$ will be added to the sequence, going to
the right will add a $1$.
This will ensure, by construction of the _Huffman_ based on the frequencies
of
the symbols, that the two properties defined
below
are satisfied!
This will ensure, by construction of the _Huffman_
tree
based on the frequencies
of
the symbols, that the two
_Huffman_
properties defined
above
are satisfied!
#### Question 15. [[toc](README.md#table-of-content)]
:pencil: Write a function
`build_coding_book`
that will compute the coding book
...
...
@@ -199,8 +284,8 @@ of any _Huffman_ tree. It should have the following signature:
-
return: the coding book, represented as a dictionary where the keys are the
symbols and the values are the associated binary sequences
:question: Do you get the expected encoding for all symbols with
the sentenc
e
above, "_this is an example of a huffman tree_"
?
:question: Do you get the expected encoding for all symbols with
our exampl
e
sentence
?
Verify that you get the correct encoding, as below:
```
mermaid
...
...
@@ -273,6 +358,27 @@ flowchart TD
n5 -- 1 --- f
n12 -- 1 --- spc
```
Which translates to the following in Python
```
python
{
'
a
'
:
'
000
'
,
'
e
'
:
'
001
'
,
'
t
'
:
'
0100
'
,
'
h
'
:
'
0101
'
,
'
i
'
:
'
0110
'
,
'
s
'
:
'
0111
'
,
'
n
'
:
'
1000
'
,
'
m
'
:
'
1001
'
,
'
x
'
:
'
10100
'
,
'
p
'
:
'
10101
'
,
'
l
'
:
'
10110
'
,
'
o
'
:
'
10111
'
,
'
u
'
:
'
11000
'
,
'
r
'
:
'
11001
'
,
'
f
'
:
'
1101
'
,
'
'
:
'
111
'
,
}
```
:question: Can you find any encoded sequence of bits that is the prefix of
another one?
...
...
@@ -287,9 +393,18 @@ It is now time to put together the compression algorithm. The steps are as follo
-
compute the frequencies of all symbols
-
build the _Huffman_ tree
-
build the coding book
-
replace all symbols by their associated binary encoding
-
convert to binary
-
return the coding book, the binary sequence and the number of padding bits
-
build the compressed sequence or $0$s and $1$s by replacing all symbols by
their associated binary encoding
-
convert the sequence to real binary
-
return the coding book, the binary sequence and the number of _padding_ bits
But what is _padding_? Well, due to the nature of the _Huffman_ tree, we don't
know what the size of the encodings for each symbol will be. And we know even
less what the final size of the compressed bits will be! In particular, we don't
know if the compressed bits will be a multiple of $8$. To make sure the
compressed data fits perfectly into a whole number of bytes, we'll be adding
some _padding_ bits, e.g. let's say the bits are
`0101101011`
, we need to add 6
bits of _padding_ to get to 2 bytes, i.e.
`0101101011111111`
if we pad with $1$s.
#### Question 16. [[toc](README.md#table-of-content)]
:pencil: Write a function
`compress`
that implements the _Huffman_ compression.
...
...
@@ -301,12 +416,44 @@ It should have the following signature:
> :bulb: **Note**
>
> you can use the `into_bytes` function from `src.helpers` to convert a string
> of $0$s and $1$s into real bytes.
> you can use the `into_bytes` function from [`src.helpers`](src/helpers.py) to
> convert a string of $0$s and $1$s into real bytes.
>
> there is documentation and you can find some simple examples at the bottom of
> the module.
:question: Test your
`compress`
function by compressing a bunch of text.
For instance, on our example text from the start, the bits, arranged in groups
of $8$, should be
```
01000101
01100111
11101100
11111100
01000111
00110100
00010011
01011011
00011111
01111101
11100011
10101110
00110111
01100100
01000111
01001100
1001001
```
and there should be 1 bit of padding.
:question: Is the average number of bits used in the compressed output always
higher than the entropy of the input text?
:question: Test your
`compress`
function by compressing a bunch of text. Is the
average number of bits used in the compressed output always higher than the
entropy of the input text?
You can plot your results for
-
samples of different lengths in a real text file, e.g. written in English to
use the structure of the language
-
sample random character strings of various length, i.e. without structure
### Text decompression [[toc](README.md#table-of-content)]
...
...
@@ -333,7 +480,7 @@ vice-versa
> :bulb: **Note**
>
> this dictionary inversion only makes sense for dictionaries where all values
> are uniq.
> are uniq
ue
.
#### Question 18. [[toc](README.md#table-of-content)]
:pencil: Write a function
`decompress`
that implements the decompression. It
...
...
@@ -346,15 +493,28 @@ should have the following signature:
> :bulb: **Hint**
>
> First, you can use the `from_bytes` function from `src.helpers` that will
> First, you can use the `from_bytes` function from
[
`src.helpers`
](src/helpers.py)
that will
> convert bytes into their string $0$s and $1$s.
>
> there is documentation and you can find some simple examples at the bottom of
> the module.
>
> Second, you need to iterate over the $0$s and $1$s and read the "decode book"
> to rebuild the original sequence of symbols.
:question: Can you reconstruct some text with that last function? You can either
use a dummy phrase such as "_this is an example of a huffman tree_" or read a
file such as
`README.md`
.
:question: Can you reconstruct our original example text with that last function?
As the first byte of this example is
`01000101`
, the only symbol in our coding
book, thanks to the first _Huffman_ property, that fits is
`t`
with encoding
`0100`
. So we know the first character is
`t`
.
Then we can remove the first $4$ bits of the compressed bits and search for the
next symbol.
Without the bits of
`t`
, the first remaining $8$ bits are
`01010110`
. We see
that the only encoding that fits these first bits is
`0101`
and corresponds to
the symbol
`h`
.
And so on...
### Writing a CLI compression tool [[toc](README.md#table-of-content)]
...
...
@@ -408,6 +568,15 @@ disk following the file format above. It should have the following signature:
-
`compressed`
: the compressed binary data
-
`padding`
: the number of padding bits
> :bulb: **Note**
>
> below is a small example of how to write newline-separated binary data to a file
> ```python
> with open("my_file.bin", "wb") as handle:
> handle.write("hello world\n".encode())
> handle.write(b"abc")
> ```
##### Question 20. [[toc](README.md#table-of-content)]
:pencil: Write a function
`read`
that will read compression information from
disk following the file format above. It should have the following signature:
...
...
@@ -416,6 +585,15 @@ disk following the file format above. It should have the following signature:
-
return: the codebook that comes from the compression, the compressed binary
data and the number of padding bits
> :bulb: **Note**
>
> below is a small example of how to read newline-separated binary data from a file
> ```python
> with open("my_file.bin", 'rb') as handle:
> hello_world = handle.readline().decode()
> abc = handle.readline()
> ```
#### Wrapping up in a CLI application [[toc](README.md#table-of-content)]
In order to write a CLI application, we will be using the
`argparse`
module.
...
...
This diff is collapsed.
Click to expand it.
src/helpers.py
+
28
−
4
View file @
5546a27d
...
...
@@ -2,26 +2,50 @@ from typing import Iterable
def
read_text
(
filename
:
str
)
->
str
:
"""
Read text from a file on disk by taking care of the decoding
"""
with
open
(
filename
,
"
rb
"
)
as
handle
:
return
handle
.
read
().
decode
()
def
chunks
(
it
:
Iterable
,
n
:
int
)
->
Iterable
:
def
__
chunks
(
it
:
Iterable
,
n
:
int
)
->
Iterable
:
"""
Yield successive n-sized chunks from it.
"""
for
i
in
range
(
0
,
len
(
it
),
n
):
yield
it
[
i
:
i
+
n
]
def
into_bytes
(
bits
:
str
)
->
(
bytes
,
int
):
padding
=
8
-
len
(
bits
)
%
8
"""
Convert a sequence of 0s and 1s, represented as a simple string, into real
bytes, possibly with padding
The padding bits are 0s.
"""
padding
=
(
8
-
len
(
bits
)
%
8
)
%
8
bin
=
bytes
(
map
(
lambda
n
:
int
(
n
,
2
),
chunks
(
bits
+
'
0
'
*
padding
,
8
),
__
chunks
(
bits
+
'
0
'
*
padding
,
8
),
))
return
bin
,
padding
def
from_bytes
(
bin
:
bytes
,
padding
:
int
)
->
str
:
return
''
.
join
(
format
(
b
,
'
#010b
'
).
removeprefix
(
"
0b
"
)
for
b
in
bin
)[:
-
padding
]
"""
Convert a sequence of bytes to its string representation with 0s and 1s,
taking care of padding
"""
decoded
=
''
.
join
(
format
(
b
,
'
#010b
'
).
removeprefix
(
"
0b
"
)
for
b
in
bin
)
return
decoded
if
padding
==
0
else
decoded
[:
-
padding
]
if
__name__
==
"
__main__
"
:
assert
into_bytes
(
"
0101
"
)
==
(
b
'
P
'
,
4
)
assert
into_bytes
(
"
01000001
"
)
==
(
b
'
A
'
,
0
)
assert
into_bytes
(
"
010000010100001
"
)
==
(
b
'
AB
'
,
1
)
assert
"
0101
"
==
from_bytes
(
b
'
P
'
,
4
)
assert
"
01000001
"
==
from_bytes
(
b
'
A
'
,
0
)
assert
"
010000010100001
"
==
from_bytes
(
b
'
AB
'
,
1
)
This diff is collapsed.
Click to expand it.
src/priority_queue.py
+
39
−
1
View file @
5546a27d
...
...
@@ -4,18 +4,56 @@ Value = Any
class
PriorityQueue
:
def
__init__
(
self
,
values
:
List
[
Tuple
[
float
,
Value
]]):
"""
A naive implementation of a _priority queue_.
An element is said to have a
"
_higher priority_
"
if its value is smaller.
"""
def
__init__
(
self
,
values
:
List
[
Tuple
[
float
,
Value
]]
=
[]):
"""
Initialize a _priority queue_ with optional pre-inserted values.
"""
self
.
_values
=
values
def
is_empty
(
self
)
->
bool
:
"""
Determine if the _queue_ is empty.
"""
return
len
(
self
.
_values
)
==
0
def
__len__
(
self
)
->
int
:
"""
Give the number of remaining elements in the _queue_.
"""
return
len
(
self
.
_values
)
def
push
(
self
,
v
:
Tuple
[
float
,
Value
]):
"""
Push a value into the _queue_, it will be popped last, unless it has a
_higher priority_.
"""
self
.
_values
.
append
(
v
)
def
pop
(
self
)
->
Tuple
[
float
,
Value
]:
"""
Pop the element of the _queue_ with the _highest priority_.
"""
sort_key
=
[
w
for
w
,
_
in
self
.
_values
]
return
self
.
_values
.
pop
(
sort_key
.
index
(
min
(
sort_key
)))
if
__name__
==
"
__main__
"
:
from
random
import
randint
,
random
q
=
PriorityQueue
()
print
(
"
inserting 10 items with random weigths
"
)
for
_
in
range
(
10
):
x
=
(
random
(),
randint
(
0
,
100
))
print
(
x
)
q
.
push
(
x
)
print
(
"
popping every item out of the queue, in order
"
)
while
not
q
.
is_empty
():
print
(
q
.
pop
())
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment