Entropy in information theory

Second part on entropy. Previous : Thermodynamics and classical entropy. Next : Entropy in statistical physics - proof of the ideal gas law.

The file compression problem

Imagine a list of separate elementary systems (say, molecules). The state of each system is assumed to be one in a known finite list of possible states (which may be the same list or different lists from one system to the next). Let us look for a method to specify all these states by digital information stored in a memory stick, in the most compressed possible format (to take small space in memory). (The case of an infinite list of possible states of each elementary system is not really excluded as it can be seen as the limit of the finite case, with probabilities converging to zero fast enough when states are listed from the few most frequent to the infinity of most exceptional ones).

Imagine the following storing method.
The states of successive systems are recorded as successive files in memory. Each possible state of an elementary system is encoded as "a file", that is a succession of bits, that is one in a given list of "possible files" (successions of bits), where "possible files" bijectively correspond to possible states of a system. For the sake of compression, these files will be just put aside each other, with neither any unused space between them nor any directory of files to distinguish their separation. To prevent ambiguities while the lengths of files may differ, the separation between files will be determined by the following method.

The state of the first given elementary system will be encoded by the first file (the first few bits) in memory, whose content is one in a list L of possibilities (in bijective correspondence with the possible states i of this system : let us denote this file as A_i, with length l_i bits. Then, whatever comes after this file will describe the next elementary systems (starting with the second one) in the same manner.
The condition to avoid any risk of ambiguity, is that for any possible memory content, only one element A_i of L (describing the first system) can be seen as its start (coincides with the sequence from the beginning with the same length). The forbidden case, where 2 files in L are each identical to the beginning of the same memory content with different lengths, would mean that one element of L coincides with the beginning of another.

So, the size of the first file can be inferred from a given memory content, by reading it from its first bits until they are found to form a sequence identical to one of the files in L. Then we know that the second file starts just after this.

Re-interpreting all possible memory contents as the binary expansions of real numbers between 0 and 1, each possible state i of the first system corresponds to the interval between A_i 2^-l_i and (A_i+1)2^-l_i, with length 2^-l_i, included in [0,1]. Now, the non-ambiguity condition is expressed as the non-overlap of these intervals. Therefore, the sum of all their lengths cannot exceed 1:

∑_i 2^-l_i ≤ 1

Here we chose binary digits to represent things. Why not use digits in another base instead of 2 ? For example, a digit with 8 values gives as much information as 3 binary digits, so that the size of files is counted by packs of 3 = log₂(8) bits. Now, to be neutral with respect to the choice of base and forget the particular case of binary representation, let us use ln(x) instead of log₂(x) to count the "objective size" that a digit in base x (i.e. with x possible values) is worth.
So, the "objective size" of a file with l_i bits, whose content is one of 2^l_i possible sequences of l_i bits, is the real number

S_i = ln(2^l_i) = ln(2).l_i

The above inequality is now written

∑_i e^-S_i ≤ 1.

Received entropy (average size of the file)

Now if each state i has probability p_i to occur (where ∑_ip_i = 1), the average size of the file is

S' = ∑_ip_iS_i

The quality of a compression is its ability to reduce this quantity. How may it depend on the choice of encoding ?

For any probabilistic state (p_i), let l_i = ⌈-log₂(p_i)⌉> (ceiling function), that is an idea of a file size "just less than 1 bit longer than -ln(p_i)".

Let p'_i = 2^-l_i = e^-S_i, so that S_i = ln(2).l_i = -ln p'_i
We have ∑_ip'_i ≤ ∑_i p_i = 1, thus we can choose a binary encoding where each state i is encoded by a file with length l_i, forming disjoint intervals in [0,1] with respective lengths p'_i.

In the particular case when all p_i are powers of 2, the encoding given by the above method is naturally expected to be a perfect one: it defines a partition of [0,1] into intervals with respective lengths p'_i = p_i. Thus, each p_i coincides with the probability that a perfectly random memory content, seen as a real number in [0,1], will fall in that interval. (Such a random real variable corresponds to a long random memory content, representing the limit for a long sequence of systems all satisfying this perfection condition).
Then the formula of the average size of the file, becomes the general definition of the entropy of any probabilistic state (p_i) of a system (where ∑_i p_i = 1):

S = - ∑_ip_i ln p_i

More generally, for any 2 probability laws p_i, p'_i on a given list of states, the above formula of S' defines the received entropy (*) of the probabilistic state (p_i) relative to (p'_i) :

S' = - ∑_i p_i ln p'_i

We can think of p'_i as an expected probability (a wrong expectation, since the correct one is p_i) : it is the probability for which this encoding would be perfect, thus the "expectation" implicitly expressed by the chosen encoding. It does not matter whether the condition put on (p'_i) is set as ∑_ip'_i ≤ 1 or as ∑_ip'_i = 1 because the former case can be reduced to the latter by adding one more state j to the list, with p_j = 0 and p'_j = 1−∑_ip'_i.

In particular for our above construction, -log₂(p_i) ≤ l_i< -log₂(p_i) + 1 gives S ≤ S' < S + ln 2.

Why is received entropy higher than real entropy

Let us show the (perhaps) obvious : that a storage format relative to wrong expectations (p'≠p) is sub-optimal, i.e. gives an S' > S.
For this, let us fix the p_i, and look at the variations of S' when the p'_i vary (keeping ∑_ip_i=∑_ip'_i = 1, thus ∑_i dp'_i =0):

dS'= - ∑_i (p_i/p'_i)dp'_i = ∑_i (1−(p_i/p'_i))dp'_i

The condition for this derivative to cancel is that p_i=p'_ifor all i : otherwise there must be an i for which p_i>p'_i, and a j for which p_j<p'_j ; then we see from this formula that in the case dp'_j = -dp'_i > 0 (while other variations of p' cancel), which means that p' goes further away from p, the value of S' increases (dS'>0).
In particular, for any binary encoding of a probabilistic state whose probabilities are not powers of 2, the average file size will always be higher than the entropy.

Additivity of entropy for independent systems

If 2 systems are independent (uncorrelated), i.e. any combined state (i,j) has probability p_i,j = a_ib_j (where (a_i) and (b_j) are the probabilities of states of each subsystem, independent of the other subsystem) then entropy is additive, as it is the average of

-ln p_i,j = (-ln a_i) + (-ln b_j)

This can be written completely formally, as

∑_i,j p_i,j ln p_i,j = ∑_i,ja_ib_j (ln a_i+ln b_j)
= ∑_ia_i(ln a_i) (∑_jb_j) + (∑_ia_i)(∑_jb_j ln b_j)
=∑_i a_i ln a_i+∑_j b_j ln b_j

Thus, even the one bit margin of extra entropy (+ln(2)) in the above compression method, can be almost eliminated (made to approach 0 per file), by gathering a large number of files representing uncorrelated systems, together as a big one: by looking for a compression, not of the state of one system, but of a large succession of independent systems. When treating the global system as one like above, we can find a compression with average size less than

(entropy of the global system + ln(2)) = (∑ entropies of individual systems) + ln(2)

Thus, the ln(2) margin comes only once for all instead of being repeated for each subsystem.
In conclusion, by such methods, even when probabilities are not powers of 2, the entropy keeps essentially representing the "size of information in its most compressed form".

Entropy of correlated systems

The case of correlated systems can be analyzed as follows:
Let a_i = ∑_j p_i,j and b_i,j = p_i,j/a_i (where the i such that a_i = 0 are eliminated from the list) so that p_i,j = a_i b_i,j and ∑_j b_i,j=1 for all i. Then

S = - ∑_i,j a_i b_i,j (ln a_i + ln b_i,j) = S_A + ∑_i a_i S_Bi

where S_Bi = - ∑_j b_i,j ln b_i,j is the entropy of the probabilistic state (b_i,j) of B (for fixed i and variable j).
Entropy is a concave function of the probabilistic state (sum of concave functions -x ln(x)), thus an average of entropies is lower than the entropy of the average of probabilistic states with the same weights :

∑_i a_i S_Bi ≤ S_B = -∑_i,j b_j ln b_j

where b_j=∑_i p_i,j = ∑_i a_i b_i,j.

We conclude

S ≤ S_A + S_B

(*) I had to invent the expression "received entropy" as I did not find this quantity defined and named elsewhere.

Links on entropy

Next page: Entropy in statistical physics
Table of contents : Foundations of physics