# Entropy in information theory

Second part on entropy. Previous : Thermodynamics and classical entropy. Next : Entropy in statistical physics - proof of the ideal gas law.

### The file compression problem

Imagine a list of separate elementary systems (say, molecules). The state of each system is assumed to be one in a known finite list of possible states (which may be the same list or different lists from one system to the next). Let us look for a method to specify all these states by digital information stored in a memory stick, in the most compressed possible format (to take small space in memory). (The case of an infinite list of possible states of each elementary system is not really excluded as it can be seen as the limit of the finite case, with probabilities converging to zero fast enough when states are listed from the few most frequent to the infinity of most exceptional ones).

Imagine the following storing method.
The states of successive systems are recorded as successive files in memory. Each possible state of an elementary system is encoded as "a file", that is a succession of bits, that is one in a given list of "possible files" (successions of bits), where "possible files" bijectively correspond to possible states of a system. For the sake of compression, these files will be just put aside each other, with neither any unused space between them nor any directory of files to distinguish their separation. To prevent ambiguities while the lengths of files may differ, the separation between files will be determined by the following method.

The state of the first given elementary system will be encoded by the first file (the first few bits) in memory, whose content is one in a list L of possibilities (in bijective correspondence with the possible states i of this system : let us denote this file as Ai, with length li bits. Then, whatever comes after this file will describe the next elementary systems (starting with the second one) in the same manner.
The condition to avoid any risk of ambiguity, is that for any possible memory content, only one element Ai of L (describing the first system) can be seen as its start (coincides with the sequence from the beginning with the same length). The forbidden case, where 2 files in L are each identical to the beginning of the same memory content with different lengths, would mean that one element of L coincides with the beginning of another.

So, the size of the first file can be inferred from a given memory content, by reading it from its first bits until they are found to form a sequence identical to one of the files in L. Then we know that the second file starts just after this.

Re-interpreting all possible memory contents as the binary expansions of real numbers between 0 and 1, each possible state i of the first system corresponds to the interval between Ai 2-li and (Ai+1)2-li, with length 2-li, included in [0,1]. Now, the non-ambiguity condition is expressed as the non-overlap of these intervals. Therefore, the sum of all their lengths cannot exceed 1:
i 2-li ≤ 1

Here we chose binary digits to represent things. Why not use digits in another base instead of 2 ? For example, a digit with 8 values gives as much information as 3 binary digits, so that the size of files is counted by packs of 3 = log2(8) bits. Now, to be neutral with respect to the choice of base and forget the particular case of binary representation, let us use ln(x) instead of log2(x) to count the "objective size" that a digit in base x (i.e. with x possible values) is worth.
So, the "objective size" of a file with li bits, whose content is one of 2li possible sequences of li bits, is the real number
Si = ln(2li) = ln(2).li
The above inequality is now written
i e-Si ≤ 1.

### Received entropy (average size of the file)

Now if each state i has probability pi to occur (where i pi = 1), the average size of the file is
S' = ipiSi

The quality of a compression is its ability to reduce this quantity. How may it depend on the choice of encoding ?

For any probabilistic state (pi), let li = ⌈-log2(pi)⌉> (ceiling function), that is an idea of a file size "just less than 1 bit longer than -ln(pi)".

Let p'i = 2-li = e-Si, so that Si = ln(2).li = -ln p'i
We have ∑ip'i ≤ ∑i pi = 1, thus we can choose a binary encoding where each state i is encoded by a file with length li, forming disjoint intervals in [0,1] with respective lengths p'i.

In the particular case when all pi are powers of 2, the encoding given by the above method is naturally expected to be a perfect one: it defines a partition of [0,1] into intervals with respective lengths p'i = pi. Thus, each pi coincides with the probability that a perfectly random memory content, seen as a real number in [0,1], will fall in that interval. (Such a random real variable corresponds to a long random memory content, representing the limit for a long sequence of systems all satisfying this perfection condition).
Then the formula of the average size of the file, becomes the general definition of the entropy of any probabilistic state (pi) of a system (where i pi = 1):

S = - i pi ln pi

More generally, for any 2 probability laws pi, p'i on a given list of states, the above formula of S' defines the received entropy (*) of the probabilistic state (pi) relative to (p'i) :

S' = - i pi ln p'i

We can think of p'i as an expected probability (a wrong expectation, since the correct one is pi) : it is the probability for which this encoding would be perfect, thus the "expectation" implicitly expressed by the chosen encoding. It does not matter whether the condition put on (p'i) is set as i p'i ≤ 1 or as i p'i  = 1 because the former case can be reduced to the latter by adding one more state j to the list, with pj = 0 and p'j = 1−i p'i.

In particular for our above construction, -log2(pi) ≤ li < -log2(pi) + 1 gives SS' < S + ln 2.

### Why is received entropy higher than real entropy

Let us show the (perhaps) obvious : that a storage format relative to wrong expectations (p'p) is sub-optimal, i.e. gives an S' > S.
For this, let us fix the pi, and look at the variations of S' when the p'i vary (keeping ipi=ip'i = 1, thus i dp'i =0):

dS'= - i (pi/p'i)dp'i = i (1−(pi/p'i))dp'i

The condition for this derivative to cancel is that pi=p'i for all i : otherwise there must be an i for which pi>p'i, and a j for which pj<p'j ; then we see from this formula that in the case dp'j = -dp'i > 0 (while other variations of p' cancel), which means that p' goes further away from p, the value of S' increases (dS'>0).
In particular, for any binary encoding of a probabilistic state whose probabilities are not powers of 2, the average file size will always be higher than the entropy.

### Additivity of entropy for independent systems

If 2 systems are independent (uncorrelated), i.e. any combined state (i,j) has probability pi,j = aibj (where (ai) and (bj) are the probabilities of states of each subsystem, independent of the other subsystem) then entropy is additive, as it is the average of
-ln pi,j = (-ln ai) + (-ln bj)
This can be written completely formally, as
i,j pi,j ln pi,j =i,j aibj (ln ai+ln bj)
= iai (ln ai) (jbj) + (∑iai)(jbj ln bj)
=i ai ln ai+j bj ln bj
Thus, even the one bit margin of extra entropy (+ln(2)) in the above compression method, can be almost eliminated (made to approach 0 per file), by gathering a large number of files representing uncorrelated systems, together as a big one: by looking for a compression, not of the state of one system, but of a large succession of independent systems. When treating the global system as one like above, we can find a compression with average size less than

(entropy of the global system + ln(2)) = ( entropies of individual systems) + ln(2)

Thus, the ln(2) margin comes only once for all instead of being repeated for each subsystem.
In conclusion, by such methods, even when probabilities are not powers of 2, the entropy keeps essentially representing the "size of information in its most compressed form".

### Entropy of correlated systems

The case of correlated systems can be analyzed as follows:
Let ai = j pi,j and bi,j = pi,j/ai (where the i such that ai = 0 are eliminated from the list) so that pi,j = ai bi,j and j bi,j=1 for all i. Then

S = - i,j ai bi,j (ln ai + ln bi,j) = SA + i ai SBi

where SBi = - j bi,j ln bi,j is the entropy of the probabilistic state (bi,j) of B (for fixed i and variable j).
Entropy is a concave function of the probabilistic state (sum of concave functions -x ln(x)), thus an average of entropies is lower than the entropy of the average of probabilistic states with the same weights :
i ai SBiSB = -i,j  bj ln bj
where bj=i pi,j = i ai bi,j.

We conclude
SSA + SB

(*) I had to invent the expression "received entropy" as I did not find this quantity defined and named elsewhere.