Entropy in information theory
Second part on entropy. Previous :
Thermodynamics and classical entropy. Next : Entropy in statistical physics - proof of the ideal
The file compression problem
Imagine a list of separate elementary systems (say, molecules). The
state of each system is assumed to be one in a known finite list of
possible states (which may be the same list or different lists from
one system to the next). Let us look for a method to specify all
these states by digital information stored in a memory stick, in the
most compressed possible format (to take small space in memory).
(The case of an infinite list of possible states of each elementary
system is not really excluded as it can be seen as the limit of the
finite case, with probabilities converging to zero fast enough when
states are listed from the few most frequent to the infinity of most
Imagine the following storing method.
The states of successive systems are recorded as successive files in
memory. Each possible state of an elementary system is encoded as "a
file", that is a succession of bits, that is one in a given list of
"possible files" (successions of bits), where "possible files"
bijectively correspond to possible states of a system. For the sake
of compression, these files will be just put aside each other, with
neither any unused space between them nor any directory of files to
distinguish their separation. To prevent ambiguities while the
lengths of files may differ, the separation between files will be
determined by the following method.
The state of the first given elementary system will be encoded by
the first file (the first few bits) in memory, whose content is one
in a list L of possibilities (in bijective correspondence
with the possible states i of this system : let us denote
this file as Ai, with length li
bits. Then, whatever comes after this file will describe the next
elementary systems (starting with the second one) in the same
The condition to avoid any risk of ambiguity, is that for any
possible memory content, only one element Ai of L
(describing the first system) can be seen as its start (coincides
with the sequence from the beginning with the same length). The
forbidden case, where 2 files in L are each identical to the
beginning of the same memory content with different lengths, would
mean that one element of L coincides with the beginning of
So, the size of the first file can be inferred from a given memory
content, by reading it from its first bits until they are found to
form a sequence identical to one of the files in L. Then we
know that the second file starts just after this.
Re-interpreting all possible memory contents as the binary
expansions of real numbers between 0 and 1, each possible state i
of the first system corresponds to the interval between Ai
2-li and (Ai+1)2-li,
with length 2-li, included in [0,1].
Now, the non-ambiguity condition is expressed as the non-overlap of
these intervals. Therefore, the sum of all their lengths cannot
Here we chose binary digits to represent things. Why not use digits
in another base instead of 2 ? For example, a digit with 8 values
gives as much information as 3 binary digits, so that the size of
files is counted by packs of 3 = log2(8) bits. Now, to be
neutral with respect to the choice of base and forget the particular
case of binary representation, let us use ln(x) instead of
log2(x) to count the "objective size" that a digit
in base x (i.e. with x possible values) is worth.
So, the "objective size" of a file with li bits,
whose content is one of 2li possible
sequences of li bits, is the real number
Si = ln(2li)
The above inequality is now written
Received entropy (average size of the file)
Now if each state i has probability pi to
occur (where ∑i pi
= 1), the average size of the file is
S' = ∑ipiSi
The quality of a compression is its ability to reduce this
quantity. How may it depend on the choice of encoding ?
For any probabilistic state (pi), let li
= ⌈-log2(pi)⌉> (ceiling
function), that is an idea of a file size
"just less than 1 bit longer than -ln(pi)".
Let p'i = 2-li = e-Si,
so that Si = ln(2).li = -ln p'i
We have ∑ip'i ≤ ∑i
pi = 1, thus we can choose a binary encoding
where each state i is encoded by a file with length li,
forming disjoint intervals in [0,1] with respective lengths p'i.
In the particular case when all pi are powers of
2, the encoding given by the above method is naturally expected to
be a perfect one: it defines a partition of [0,1] into intervals
with respective lengths p'i = pi.
Thus, each pi coincides with the probability that
a perfectly random memory content, seen as a real number in [0,1],
will fall in that interval. (Such a random real variable corresponds
to a long random memory content, representing the limit for a long
sequence of systems all satisfying this perfection condition).
Then the formula of the average size of the file, becomes the
general definition of the entropy of any probabilistic state
(pi) of a system (where ∑i
pi = 1):
S = - ∑i pi
More generally, for any 2 probability laws pi, p'i
on a given list of states, the above formula of S' defines
the received entropy (*) of the probabilistic state (pi)
relative to (p'i) :
S' = - ∑i
pi ln p'i
We can think of p'i as an expected
probability (a wrong expectation, since the correct one is pi)
: it is the probability for which this encoding would be perfect,
thus the "expectation" implicitly expressed by the chosen encoding.
It does not matter whether the condition put on (p'i)
is set as ∑i p'i ≤
1 or as ∑i p'i
= 1 because the former case can be reduced to the latter by adding
one more state j to the list, with pj = 0
and p'j = 1−∑i p'i.
In particular for our above construction, -log2(pi)
≤ li < -log2(pi)
+ 1 gives S ≤ S' < S + ln 2.
Why is received entropy higher than real entropy
Let us show the (perhaps) obvious : that a storage format relative
to wrong expectations (p'≠p) is sub-optimal, i.e. gives an S'
For this, let us fix the pi, and look at
the variations of S' when the p'i vary
= 1, thus ∑i dp'i
The condition for this derivative to cancel is that pi=p'i
for all i : otherwise there must be an i
for which pi>p'i, and a j
for which pj<p'j ;
then we see from this formula that in the case dp'j
= -dp'i > 0 (while other variations of p'
cancel), which means that p' goes further away from p,
the value of S' increases (dS'>0).
dS'= - ∑i (pi/p'i)dp'i
= ∑i (1−(pi/p'i))dp'i
In particular, for any binary encoding of a probabilistic state
whose probabilities are not powers of 2, the average file size will
always be higher than the entropy.
Additivity of entropy for independent systems
If 2 systems are independent (uncorrelated), i.e. any combined state
(i,j) has probability pi,j = aibj
(where (ai) and (bj) are the
probabilities of states of each subsystem, independent of the other
subsystem) then entropy is additive, as it is the average of
-ln pi,j = (-ln ai)
+ (-ln bj)
This can be written completely formally, as
ln pi,j = ∑i,j aibj
(ln ai+ln bj)
Thus, even the one bit margin of extra entropy (+ln(2)) in the above
compression method, can be almost eliminated (made to approach 0 per
file), by gathering a large number of files representing
uncorrelated systems, together as a big one: by looking for a
compression, not of the state of one system, but of a large
succession of independent systems. When treating the global system
as one like above, we can find a compression with average size less
= ∑iai (ln ai)
(∑jbj) + (∑iai)(∑jbj
=∑i ai ln ai+∑j
bj ln bj
(entropy of the global system + ln(2)) = (∑ entropies
of individual systems) + ln(2)
Thus, the ln(2) margin comes only once for all instead of being
repeated for each subsystem.
In conclusion, by such methods, even when probabilities are not
powers of 2, the entropy keeps essentially representing the "size of
information in its most compressed form".
Entropy of correlated systems
The case of correlated systems can be analyzed as follows:
Let ai = ∑j pi,j
and bi,j = pi,j/ai
(where the i such that ai = 0 are
eliminated from the list) so that pi,j = ai
bi,j and ∑j bi,j=1
for all i. Then
where SBi = - ∑j bi,j
ln bi,j is the entropy of the probabilistic state
(bi,j) of B (for fixed i and
S = - ∑i,j ai bi,j
(ln ai + ln bi,j) = SA
+ ∑i ai SBi
Entropy is a concave function of the probabilistic state (sum of
concave functions -x ln(x)), thus an average of
entropies is lower than the entropy of the average of probabilistic
states with the same weights :
∑i ai SBi
≤ SB = -∑i,j bj
where bj=∑i pi,j
= ∑i ai bi,j.
S ≤ SA + SB
(*) I had to invent the expression "received entropy" as I did
not find this quantity defined and named elsewhere.
Links on entropy
Next page: Entropy in statistical physics
Table of contents : Foundations