huffman: add documentation and hints

5546a27d · STEVAN Antoine · a4378114 · 5546a27d · 5546a27d · 5546a27d
Verified Commit 5546a27d authored 10 months ago by STEVAN Antoine
--- a/huffman.md
+++ b/huffman.md
@@ -25,8 +25,15 @@ $i = 1, \ldots, n$, then the entropy $H(X)$ is defined by:
 $$H(X) = - \sum\limits_{i = 1}^{n}p_i \log_2 p_i$$

 The entropy gives a bound on the average number of bits used to code a symbol.
-In this part, we will first compute the entropy, and then verify that we can
-obtain such level of compression with the Huffman algorithm.
+That is, it is impossible to compress any binary to a size that is smaller than
+its entropy times the number of bytes.
+
+In the rest of this document, we will take as an example input text the following
+```text
+this is an example of a huffman tree
+```
+
+You can define it in `huffman.py` and use it later to make sure your implementation is correct.

 #### Question 12. [[toc](README.md#table-of-content)]
 :pencil: Write a function `compute_distribution` that will compute the
@@ -38,6 +45,32 @@ following signature:
  of all the symbols in the text, represented as a dictionary where the keys are
  the $(a_i)$ and the values are the associated $(p_i)$

+> :bulb: **Note**
+>
+> another version of this would be to simply compute the number of occurrences, as the rest of the
+> Huffman algorithm only needs to be able to sort symbols by "_frequency_" and the two quantities
+> are equal up to a multiplicative constant.
+
+:bulb: The distribution for our example text is the following
+| symbol | frequency (3 decimals) |
+| ------ | ---------------------- |
+| t      | $0.056$                |
+| h      | $0.056$                |
+| i      | $0.056$                |
+| s      | $0.056$                |
+|        | $0.194$                |
+| a      | $0.111$                |
+| n      | $0.056$                |
+| e      | $0.111$                |
+| x      | $0.028$                |
+| m      | $0.056$                |
+| p      | $0.028$                |
+| l      | $0.028$                |
+| o      | $0.028$                |
+| f      | $0.083$                |
+| u      | $0.028$                |
+| r      | $0.028$                |
+
 #### Question 13. [[toc](README.md#table-of-content)]
 :pencil: Write a function `entropy` that will compute the entropy of a
 probability distribution. It should have the following signature:
@@ -46,6 +79,8 @@ probability distribution. It should have the following signature:
    the keys are the $(a_i)$ and the values are the associated $(p_i)$
 - return: the entropy of `x`, i.e. $H(X)$

+:bulb: The entropy of our example input is $3.714192447093237$.
+
 :question: What is the entropy of `README.md`?

 > :bulb: **Note**
@@ -59,13 +94,14 @@ binary sequence of $0$s and $1$s.

 There are two properties that we want to have:
 - (1) the more common a symbol in the text, the shorter the encoding. And
-  vice-versa with the least common symbols.
+  vice-versa with the least common symbols, e.g. in English, "_e_" should have
+  the smallest encoding because it is the most common letter and "_z_" should
+  have the longest one.
 - (2) we want to use an encoding where no encoded symbol is the prefix of
  another one

 Intuitively, property (1) will help reduce the size of the text file by
-assigning shorter encodings to the most common symbols in the text, e.g. the
-letter "e" in French or in English.
+assigning shorter encodings to the most common symbols in the text.

 Property (2) will make sure there is no ambiguity in the encoding and greatly
 help the decoding process.
@@ -76,12 +112,12 @@ tree.
 The construction of the tree is as follows:
 - compute the probability distribution of the symbols in the text
 - put all the symbols with their "weights", i.e. their frequencies at the start,
-  in a priority queue
- pick the two least frequent symbols
+  in a _priority queue_
+- pop the two least frequent symbols from the _queue_
 - merge them into a tree where they are the two children
 - set their "weight" to the sum of the "weights" of the two children
- put this tree back in the queue
- rince and repeat as long as there are strictly more than $1$ item in the queue
+- push this tree back into the queue
+- rince and repeat as long as there are strictly more than $1$ item in the _queue_

 When there is only one item left in the queue, you can extract it and you should
 have the _Huffman_ tree!
@@ -91,14 +127,22 @@ have the _Huffman_ tree!
 following signature:
 - arguments:
  - `p`: a probability distribution of all the symbols in the text
- return: the _Huffman_, i.e. a nested tuple
+- return: the _Huffman_, i.e. a nested tuple (see the example tree below for what this can mean)
+
+> :bulb: **None**
+>
+> the [`src/priority_queue.py`](src/priority_queue.py) provides a _naive_ implementation of a
+> _priority queue_ that should be good enough for this class.
+>
+> you are encouraged to use it or feel free to use the built-in `heapq` module of Python.
+>
+> there is an example usage of the `src.priority_queue.PriorityQueue` class at the bottom of the module.

-:question: Test your `build_huffman_tree` on the following sentence: "_this is an example of a huffman tree_"
-Do you get the same tree as below?
+:question: Test your `build_huffman_tree` on our example text. Do you get the same tree as below?

 > :bulb: **Note**
 >
-> the weights are expressed as the number of occurences, which is equivalent to
+> the weights are expressed as the number of occurrences, which is equivalent to
 > the frequencies you have computed, just need to divide / multiply by the total
 > number of symbols to get from one to the other.
 >
@@ -174,6 +218,47 @@ flowchart TD
    12 --- spc
 ```

+In terms of Python code, the your tree might look like the following _nested tuple_
+
+```python
+# a nested tuple is simply tuples inside other tuples
+(  # this is the root
+  (  # this is the left child of the root
+    ('a', 'e'),
+    (
+      ('t', 'h'),   # t and h are siblings
+      ('i', 's'),
+    ),
+  ),
+  (  # this is the right child of the root
+    (
+      ('n', 'm'),
+      (
+        ('x', 'p'),
+        ('l', 'o'),
+      ),
+    ),
+    (
+      (
+        ('u', 'r'),
+        'f',
+      ),
+      ' ',
+   ),
+  ),
+)
+```
+
+> :bulb: **Note**
+>
+> the exact placement of the nodes in this tree is not important. However, the
+> depth at which the nodes are is really important and should depend on the
+> frequency of each symbol compared to the other symbols.
+>
+> e.g. a and e could be swapped, depending on how they are sorted in the _priority queue_,
+> however, they need to remain at the same depth, in order to have a encoding of
+> the correct length!
+
 ### Coding books [[toc](README.md#table-of-content)]

 The _Huffman_ tree contains all the information we need about the input text.
@@ -188,8 +273,8 @@ The idea is to traverse the tree from the root to each one of the leaves, our
 symbols. When going to the left, a $0$ will be added to the sequence, going to
 the right will add a $1$.

-This will ensure, by construction of the _Huffman_ based on the frequencies of
-the symbols, that the two properties defined below are satisfied!
+This will ensure, by construction of the _Huffman_ tree based on the frequencies
+of the symbols, that the two _Huffman_ properties defined above are satisfied!

 #### Question 15. [[toc](README.md#table-of-content)]
 :pencil: Write a function `build_coding_book` that will compute the coding book
@@ -199,8 +284,8 @@ of any _Huffman_ tree. It should have the following signature:
 - return: the coding book, represented as a dictionary where the keys are the
  symbols and the values are the associated binary sequences

-:question: Do you get the expected encoding for all symbols with the sentence
-above, "_this is an example of a huffman tree_"?
+:question: Do you get the expected encoding for all symbols with our example
+sentence?

 Verify that you get the correct encoding, as below:
 ```mermaid
@@ -273,6 +358,27 @@ flowchart TD
    n5     -- 1 --- f
    n12    -- 1 --- spc
 ```
+Which translates to the following in Python
+```python
+{
+    'a': '000',
+    'e': '001',
+    't': '0100',
+    'h': '0101',
+    'i': '0110',
+    's': '0111',
+    'n': '1000',
+    'm': '1001',
+    'x': '10100',
+    'p': '10101',
+    'l': '10110',
+    'o': '10111',
+    'u': '11000',
+    'r': '11001',
+    'f': '1101',
+    ' ': '111',
+}
+```

 :question: Can you find any encoded sequence of bits that is the prefix of
 another one?
@@ -287,9 +393,18 @@ It is now time to put together the compression algorithm. The steps are as follo
 - compute the frequencies of all symbols
 - build the _Huffman_ tree
 - build the coding book
- replace all symbols by their associated binary encoding
- convert to binary
- return the coding book, the binary sequence and the number of padding bits
+- build the compressed sequence or $0$s and $1$s by replacing all symbols by
+  their associated binary encoding
+- convert the sequence to real binary
+- return the coding book, the binary sequence and the number of _padding_ bits
+
+But what is _padding_? Well, due to the nature of the _Huffman_ tree, we don't
+know what the size of the encodings for each symbol will be. And we know even
+less what the final size of the compressed bits will be! In particular, we don't
+know if the compressed bits will be a multiple of $8$. To make sure the
+compressed data fits perfectly into a whole number of bytes, we'll be adding
+some _padding_ bits, e.g. let's say the bits are `0101101011`, we need to add 6
+bits of _padding_ to get to 2 bytes, i.e. `0101101011111111` if we pad with $1$s.

 #### Question 16. [[toc](README.md#table-of-content)]
 :pencil: Write a function `compress` that implements the _Huffman_ compression.
@@ -301,12 +416,44 @@ It should have the following signature:

 > :bulb: **Note**
 >
-> you can use the `into_bytes` function from `src.helpers` to convert a string
-> of $0$s and $1$s into real bytes.
+> you can use the `into_bytes` function from [`src.helpers`](src/helpers.py) to
+> convert a string of $0$s and $1$s into real bytes.
+>
+> there is documentation and you can find some simple examples at the bottom of
+> the module.
+
+:question: Test your `compress` function by compressing a bunch of text.
+
+For instance, on our example text from the start, the bits, arranged in groups
+of $8$, should be
+```
+01000101
+01100111
+11101100
+11111100
+01000111
+00110100
+00010011
+01011011
+00011111
+01111101
+11100011
+10101110
+00110111
+01100100
+01000111
+01001100
+1001001
+```
+and there should be 1 bit of padding.
+
+:question: Is the average number of bits used in the compressed output always
+higher than the entropy of the input text?

-:question: Test your `compress` function by compressing a bunch of text. Is the
-average number of bits used in the compressed output always higher than the
-entropy of the input text?
+You can plot your results for
+- samples of different lengths in a real text file, e.g. written in English to
+  use the structure of the language
+- sample random character strings of various length, i.e. without structure

 ### Text decompression [[toc](README.md#table-of-content)]

@@ -333,7 +480,7 @@ vice-versa
 > :bulb: **Note**
 >
 > this dictionary inversion only makes sense for dictionaries where all values
-> are uniq.
+> are unique.

 #### Question 18. [[toc](README.md#table-of-content)]
 :pencil: Write a function `decompress` that implements the decompression. It
@@ -346,15 +493,28 @@ should have the following signature:

 > :bulb: **Hint**
 >
-> First, you can use the `from_bytes` function from `src.helpers` that will
+> First, you can use the `from_bytes` function from [`src.helpers`](src/helpers.py) that will
 > convert bytes into their string $0$s and $1$s.
 >
+> there is documentation and you can find some simple examples at the bottom of
+> the module.
+>
 > Second, you need to iterate over the $0$s and $1$s and read the "decode book"
 > to rebuild the original sequence of symbols.

-:question: Can you reconstruct some text with that last function? You can either
-use a dummy phrase such as "_this is an example of a huffman tree_" or read a
-file such as `README.md`.
+:question: Can you reconstruct our original example text with that last function?
+
+As the first byte of this example is `01000101`, the only symbol in our coding
+book, thanks to the first _Huffman_ property, that fits is `t` with encoding
+`0100`. So we know the first character is `t`.
+Then we can remove the first $4$ bits of the compressed bits and search for the
+next symbol.
+Without the bits of `t`, the first remaining $8$ bits are `01010110`. We see
+that the only encoding that fits these first bits is `0101` and corresponds to
+the symbol `h`.
+
+And so on...
+

 ### Writing a CLI compression tool [[toc](README.md#table-of-content)]

@@ -408,6 +568,15 @@ disk following the file format above. It should have the following signature:
  - `compressed`: the compressed binary data
  - `padding`: the number of padding bits

+> :bulb: **Note**
+>
+> below is a small example of how to write newline-separated binary data to a file
+> ```python
+> with open("my_file.bin", "wb") as handle:
+>     handle.write("hello world\n".encode())
+>     handle.write(b"abc")
+> ```
+
 ##### Question 20. [[toc](README.md#table-of-content)]
 :pencil: Write a function `read` that will read compression information from
 disk following the file format above. It should have the following signature:
@@ -416,6 +585,15 @@ disk following the file format above. It should have the following signature:
 - return: the codebook that comes from the compression, the compressed binary
  data and the number of padding bits

+> :bulb: **Note**
+>
+> below is a small example of how to read newline-separated binary data from a file
+> ```python
+> with open("my_file.bin", 'rb') as handle:
+>     hello_world = handle.readline().decode()
+>     abc = handle.readline()
+> ```
+
 #### Wrapping up in a CLI application [[toc](README.md#table-of-content)]

 In order to write a CLI application, we will be using the `argparse` module.

--- a/src/helpers.py
+++ b/src/helpers.py
@@ -2,26 +2,50 @@ from typing import Iterable


 def read_text(filename: str) -> str:
+    """
+    Read text from a file on disk by taking care of the decoding
+    """
    with open(filename, "rb") as handle:
        return handle.read().decode()


-def chunks(it: Iterable, n: int) -> Iterable:
+def __chunks(it: Iterable, n: int) -> Iterable:
    """Yield successive n-sized chunks from it."""
    for i in range(0, len(it), n):
        yield it[i:i + n]


 def into_bytes(bits: str) -> (bytes, int):
-    padding = 8 - len(bits) % 8
+    """
+    Convert a sequence of 0s and 1s, represented as a simple string, into real
+    bytes, possibly with padding
+
+    The padding bits are 0s.
+    """
+    padding = (8 - len(bits) % 8) % 8

    bin = bytes(map(
        lambda n: int(n, 2),
-        chunks(bits + '0' * padding, 8),
+        __chunks(bits + '0' * padding, 8),
    ))

    return bin, padding


 def from_bytes(bin: bytes, padding: int) -> str:
-    return ''.join(format(b, '#010b').removeprefix("0b") for b in bin)[:-padding]
+    """
+    Convert a sequence of bytes to its string representation with 0s and 1s,
+    taking care of padding
+    """
+    decoded = ''.join(format(b, '#010b').removeprefix("0b") for b in bin)
+    return decoded if padding == 0 else decoded[:-padding]
+
+
+if __name__ == "__main__":
+    assert into_bytes("0101") == (b'P', 4)
+    assert into_bytes("01000001") == (b'A', 0)
+    assert into_bytes("010000010100001") == (b'AB', 1)
+
+    assert "0101" == from_bytes(b'P', 4)
+    assert "01000001" == from_bytes(b'A', 0)
+    assert "010000010100001" == from_bytes(b'AB', 1)
--- a/src/priority_queue.py
+++ b/src/priority_queue.py
@@ -4,18 +4,56 @@ Value = Any


 class PriorityQueue:
-    def __init__(self, values: List[Tuple[float, Value]]):
+    """
+    A naive implementation of a _priority queue_.
+
+    An element is said to have a "_higher priority_" if its value is smaller.
+    """
+
+    def __init__(self, values: List[Tuple[float, Value]] = []):
+        """
+        Initialize a _priority queue_ with optional pre-inserted values.
+        """
        self._values = values

    def is_empty(self) -> bool:
+        """
+        Determine if the _queue_ is empty.
+        """
        return len(self._values) == 0

    def __len__(self) -> int:
+        """
+        Give the number of remaining elements in the _queue_.
+        """
        return len(self._values)

    def push(self, v: Tuple[float, Value]):
+        """
+        Push a value into the _queue_, it will be popped last, unless it has a
+        _higher priority_.
+        """
        self._values.append(v)

    def pop(self) -> Tuple[float, Value]:
+        """
+        Pop the element of the _queue_ with the _highest priority_.
+        """
        sort_key = [w for w, _ in self._values]
        return self._values.pop(sort_key.index(min(sort_key)))
+
+
+if __name__ == "__main__":
+    from random import randint, random
+
+    q = PriorityQueue()
+
+    print("inserting 10 items with random weigths")
+    for _ in range(10):
+        x = (random(), randint(0, 100))
+        print(x)
+        q.push(x)
+
+    print("popping every item out of the queue, in order")
+    while not q.is_empty():
+        print(q.pop())