Mutual information
Encyclopedia : M : MU : MUT : Mutual information
In probability theory and, in particular, information theory, the mutual information, or transinformation, of two random variables is a quantity that measures the mutual dependence of the two variables. The most common unit of measurement of mutual information is the bit, in which case the logarithms below should be taken to the base 2.
Intuitively, mutual information measures the information about X that is shared by Y. If X and Y are independent, then X contains no information about Y and vice versa, so their mutual information is zero. If X and Y are identical then all information conveyed by X is shared with Y: knowing X reveals nothing new about Y and vice versa, therefore the mutual information is the same as the information conveyed by X (or Y) alone, namely the entropy of X. In a specific sense (see below), mutual information quantifies the distance between the joint distribution of X and Y and the product of their marginal distributions.
Formally, the mutual information of two discrete random variables X and Y can be defined as:
- [ I(X;Y) = \sum_ \sum_ p(x,y) \log \frac, ]
In the continuous case, we replace summation by a definite double integral:
- [ I(X;Y) = \int_Y \int_X p(x,y) \log \frac \; dx \,dy, \!]
Mutual information is a measure of independence in the following sense: I(X; Y) = 0 iff X and Y are independent random variables. This is easy to see in one direction: if X and Y are independent, then p(x,y) = p(x) × p(y), and therefore:
- [ \log \frac = \log 1 = 0. \!]
Several generalizations of mutual information to more than two random variables have been proposed, but a widely agreed on definition has not yet emerged.
Relation to other quantities
Mutual information can be equivalently expressed as
- [ I(X;Y) = H(X) - H(X|Y) \,]
- :[ = H(Y) - H(Y|X) \,]
- :[ = H(X) + H(Y) - H(X,Y) \,]
Note that H(X|X) = 0 and therefore H(X) = I(X;X). This is the reason why entropy is often called self-information. Thus I(X;X) ≥ I(X;Y), and one can formulate the basic principle that a variable contains more information about itself than any other variable can provide.
Mutual information can also be expressed as a Kullback-Leibler divergence, of the product p(x) × p(y) of the marginal distributions of the two random variables X and Y, from p(x,y) the random variables' joint distribution:
- [ I(X;Y) = D_}(p(x,y)\|p(x)p(y)). ]
- [ I(X;Y) = \sum_y p(y) \sum_x p(x|y) \log_2 \frac \!]
- : [ = \sum_y p(y) \; D_}(p(x|y)\|p(x)) \!]
- : [ = \mathbb_Y\}(p(x|y)\|p(x))\}. \!]
Applications of mutual information
In many applications, one wants to maximize mutual information (thus increasing dependencies), which is often equivalent to minimizing conditional entropy. Examples include:
- Discriminative training procedures for hidden Markov models have been proposed based on the maximum mutual information (MMI) criterion.
- Mutual information has been used as a criterion for feature selection and feature transformations in machine learning.
- Mutual information is often used as a significance function for the computation of collocations in corpus linguistics.
- Mutual information is used in medical imaging for image registration. Given a reference image (for example, a brain scan), and a second image which needs to be put the same coordinate system as the reference image, this image is deformed until the mutual information between it and the reference image is maximized.
References
- Athanasios Papoulis. Probability, Random Variables, and Stochastic Processes, second edition. New York: McGraw-Hill, 1984. (See Chapter 15.)
- Kenneth Ward Church and Patrick Hanks. Word association norms, mutual information, and lexicography, Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, 1989.
From Wikipedia, the Free Encyclopedia. Original article here. Support Wikipedia by contributing or donating.
All text is available under the terms of the GNU Free Documentation License See Wikipedia Copyrights for details.
