Information entropy
Encyclopedia : I : IN : INF : Information entropy
Entropy is a concept in thermodynamics (see thermodynamic entropy), statistical mechanics and information theory. The concepts of information and entropy have deep links with one another, although it took many years for the development of the theories of statistical mechanics and information theory to make this apparent. This article is about information entropy, the information-theoretic formulation of entropy. Information entropy is occasionally called Shannon's entropy in honor of Claude E. Shannon.
Introduction
The concept of entropy in information theory describes with how much randomness (or, alternatively, 'uncertainty') there is in a signal or random event. An alternative way to look at this is to talk about how much information is carried by the signal.
For example, consider some English text, encoded as a string of letters, spaces, and punctuation (so our signal is a string of characters). Since letter frequency for some characters is not very high (e.g. 'z'), while other letters are very common (e.g. 'e'), the string of characters is not really as random as it might be. On the other hand, since we cannot predict what the next character will be, it is, to some degree, 'random'. Entropy is a measure of this randomness, suggested by Shannon in his 1948 paper "[A Mathematical Theory of Communication]".
Shannon offers a definition of entropy which satisfies the assumptions that:
- The measure should be proportional (continuous) - i.e. changing the value of one of the probabilities by a very small amount should only change the entropy by a small amount.
- If all the outcomes (letters in the example above) are equally likely then increasing the number of letters should always increase the entropy.
- We should be able to make the choice (in our example of a letter) in two steps, in which case the entropy of the final result should be a weighted sum of the entropies of the two steps.
Formal definitions
Claude E. Shannon defines entropy in terms of a discrete random event x, with possible states (or outcomes) 1...n as:
- [H(x)=\sum_^np(i)\log_2 \left(\frac\right)=-\sum_^np(i)\log_2 p(i).\,\!]
Shannon shows that any definition of entropy satisfying his assumptions will be of the form:
- :[-K\sum_^np(i)\log p(i).\,\!]
Shannon defined a measure of entropy (H = − p1 log2 p1 − … − pn log2 pn) that, when applied to an information source, could determine the minimum channel capacity required to reliably transmit the source as encoded binary digits. The formula can be derived by calculating the mathematical expectation of the amount of information contained in a digit from the information source. Shannon's entropy measure came to be taken as a measure of the uncertainty about the realization of a random variable. It thus served as a proxy capturing the concept of information contained in a message as opposed to the portion of the message that is strictly determined (hence predictable) by inherent structures. For example, redundancy in language structure or statistical properties relating to the occurrence frequencies of letter or word pairs, triplets etc. See Markov chain.
Shannon's definition of entropy is closely related to thermodynamic entropy as defined by physicists and many chemists. Boltzmann and Gibbs did considerable work on statistical thermodynamics, which became the inspiration for adopting the word entropy in information theory. There are relationships between thermodynamic and informational entropy. In fact, in the view of Jaynes (1957), thermodynamics should be seen as an application of Shannon's information theory: the thermodynamic entropy is interpreted as being an estimate of the amount of further Shannon information (needed to define the detailed microscopic state of the system) that remains uncommunicated by a description solely in terms of the macroscopic variables of classical thermodynamics. (See article: MaxEnt thermodynamics). Similarly, Maxwell's demon reverses thermodynamic entropy with information; but if it is itself bound by the laws of thermodynamics, getting rid of that information exactly balances out the thermodynamic gain the demon would otherwise achieve.
It is important to remember that entropy is a quantity defined in the context of a probabilistic model for a data source. Independent fair coin flips have an entropy of 1 bit per flip. A source that always generates a long string of A's has an entropy of 0, since the next character will always be an 'A'.
The entropy rate of a data source means the average number of bits per symbol needed to encode it. Empirically, it seems that entropy of English text is between 1.1 and 1.6 bits per character, though clearly that will vary from text source to text source. Experiments with human predictors show an information rate of 1.1 or 1.6 bits per character, depending on the experimental setup; the PPM compression algorithm can achieve a compression ratio of 1.5 bits per character.
From the preceding example, note the following points:
- The amount of entropy is not always an integer number of bits.
- Many data bits may not convey information. For example, data structures often store information redundantly, or have identical sections regardless of the information in the data structure.
A common way to define entropy for text is based on the Markov model of text. For an order-0 source (each character is selected independent of the last characters), the binary entropy is:
- [H(\mathcal) = - \sum p_i \log_2 p_i, \,\!]
- [H(\mathcal) = - \sum_i p_i \sum_j \ p_i (j) \log_2 p_i (j), \,\!]
For a second order Markov source, the entropy rate is
- [ H(\mathcal) = -\sum_i p_i \sum_j p_i(j) \sum_k p_(k)\ \log \ p_(k). \,\!]
- [ H_b(\mathcal) = - \sum_^n p_i \log_b p_i \,\!]
Another way to define the entropy function H (not using the Markov model) is by proving that H is uniquely defined (as earlier mentioned) iff H satisfies 1) - 3):
1) H(p1, …, pn) is defined and continuous for all p1, …, pn where pi [\in][0,1] for all i = 1, …, n and p1 + … + pn = 1. (Remark that the function solely depends on the probability distribution, not the alphabet.)
2) For all positive integers n, H satisfies
- [H\underbrace, \ldots, \frac\right)}_} < H\underbrace, \ldots, \frac\right).}_}]
- [H\underbrace, \ldots, \frac\right)}_n = H\underbrace, \ldots, \frac\right)}_k + \sum_^k \frac H\underbrace, \ldots, \frac\right)}_.]
Efficiency
A source alphabet encountered in practice should be found to have a probability distribution which is less than optimal. If the source alphabet has n symbols, then it can be compared to an "optimized alphabet" with n symbols, whose probability distribution is uniform. The ratio of the entropy of the source alphabet with the entropy of its optimized version is the efficiency of the source alphabet, which can be expressed as a percentage.This implies that the efficiency of a source alphabet with n symbols can be defined simply as being equal to its n-ary entropy. See also Redundancy (information theory).
Derivation of Shannon's entropy
Since the entropy was given as a definition, it does not need to be derived. On the other hand, a "derivation" can be given which gives a sense of the motivation for the definition as well as the link to thermodynamic entropy.Q. Given a roulette with n pockets which are all equally likely to be landed on by the ball, what is the probability of obtaining a distribution (A1, A2, …, An) where Ai is the number of times pocket i was landed on and
- [ P = \sum_^n A_i \,\!]
A. The probability is a multinomial distribution, viz.
- [ p = = \left(\frac1n\right)^P \,\!]
- [ \Omega = \,\!]
- [ \Tau = n^P \ ]
Q. And what is the entropy?
A. The entropy of the distribution is obtained from the logarithm of Ω:
- [ H = \log \Omega = \log \frac \,\!]
- :[ = \log P! - \log A_1! - \log A_2! - \log A_3! - \cdots - \log A_n! \ ]
- :[ = \sum_i^P \log i - \sum_i^ \log i - \sum_i^ \log i - \cdots - \sum_i^ \log i \,\!]
- [ H = \int_1^P \log x \, dx - \int_1^ \log x \, dx - \int_1^ \log x \, dx - \cdots - \int_1^ \log x \, dx. \,\!]
- [ \int \log x \, dx = x \log x - \int x \, = x \log x - x. \,\!]
- [ H = (P \log P - P + 1) - (A_1 \log A_1 - A_1 + 1) - (A_2 \log A_2 - A_2 + 1) - \cdots - (A_n \log A_n - A_n + 1) ]
- :[ = (P \log P + 1) - (A_1 \log A_1 + 1) - (A_2 \log A_2 + 1) - \cdots - (A_n \log A_n + 1) ]
- :[ = P \log P - \sum_^n A_x \log A_x + (1 - n) \,\!]
- [ H = (1 - n) - \sum_^n p_x \log p_x \,\!]
- [ H = - \sum_^n p_x \log p_x \,\!].
- [ H = \log \Omega \ ]
- [ \mathcal = k \ln \Omega ],
Properties of Shannon's information entropy
We write H(X) as Hn(p1,...,pn). The Shannon entropy satisfies the following properties:
- For any n, Hn(p1,...,pn) is a continuous and symmetric function on variables p1, p2,...,pn.
- Event of probability zero does not contribute to the entropy, i.e. for any n,
- [H_(p_1,\ldots,p_n,0) = H_n(p_1,\ldots,p_n)].
- If [p_, 1\leq i \leq m, 1\leq j \leq n] are non-negative real numbers summing up to one, and [q_i = \sum_^n p_], then
Khinchin in 1957 showed that the only function satisfying the above assumptions is of the form
- [H_n(p_1,\ldots,p_n) = -k \sum_^n p_i \log p_i],
Extending discrete entropy to the continuous case: differential entropy
The Shannon entropy is restricted to finite sets. It seems that the formula
- [h[f] = -\int_^ f(x) \log f(x)\, dx,\quad (*)]
We wish to obtain a generally finite measure as the bin size goes to zero. In the discrete case, the bin size is the (implicit) width of each of the n (finite or infinite) bins whose probabilities are denoted by pn. As we generalize to the continuous domain, we must make this width explicit.
To do this, start with a continuous function f discretized as shown in the figure. As the figure indicates, by the mean-value theorem there exists a value xi in each bin such that
- [f(x_i) \Delta = \int_^ f(x)\, dx]
- [\int_^ f(x)\, dx = \lim_ \sum_^ f(x_i) \Delta]
We will denote
- [H^ :=- \sum_^ \Delta f(x_i) \log \Delta f(x_i)]
- [H^ = - \sum_^ \Delta f(x_i) \log \Delta f(x_i)]
- [ = - \sum_^ \Delta f(x_i) \log f(x_i) -\sum_^ f(x_i) \Delta \log \Delta.]
- [\sum_^ f(x_i) \Delta \to \int f(x)\, dx = 1]
- [\sum_^ \Delta f(x_i) \log f(x_i) \to \int f(x) \log f(x)\, dx.]
- [h[f] = \lim_ \left[H^ + log Deltaright] = -\int_^ f(x) \log f(x)\,dx,]
It turns out as a result that, unlike the Shannon entropy, the differential entropy is not in general a good measure of uncertainty or information. For example, the differential entropy can be negative; also it is not invariant under continuous co-ordinate transformations.
More useful for the continuous case is the relative entropy of a distribution, defined as the Kullback-Leibler divergence from the distribution to a reference measure m(x),
- [D_}(f(x)\|m(x)) = \int f(x)\log\frac\,dx]
This article incorporates material from on PlanetMath, which is licensed under the [Text of the GNU Free Documentation LicenseGFDL].
See also
- Entropy encoding
- Kolmogorov-Sinai entropy in dynamical systems
- Rényi entropy
- Perplexity
- Theil index
External links
- [Information is not entropy, information is not uncertainty !] - a discussion of the use of the terms "information" and "entropy".
- [I'm Confused: How Could Information Equal Entropy?] - a similar discussion on the bionet.info-theory FAQ.
- [Java "entropy pool" for cryptographically-secure unguessable random numbers]
- [Description of information entropy from "Tools for Thought" by Howard Rheingold]
- [A cool little java applet representing Shannon's Experiment to Calculate the Entropy of English]
From Wikipedia, the Free Encyclopedia. Original article here. Support Wikipedia by contributing or donating.
All text is available under the terms of the GNU Free Documentation License See Wikipedia Copyrights for details.
