# Information theory

Information theory is the mathematical theory of data communication and storage founded in 1948 by Claude E. Shannon. Modern information theory is concerned with error-correction, data compression, cryptography, communications systems, and related topics.

## Scope

Claude E. Shannon (19162001) founded information theory with his classic paper "A Mathematical Theory of Communication", published in the Bell System Technical Journal in July and October of 1948. At the beginning of his paper, Shannon asserted that

"The fundamental problem of communication is that of reproducing at one point, either exactly or approximately, a message selected at another point."

His theory for the first time considered communication as a rigorously stated mathematical problem in statistics and gave communications engineers a way to determine the capacity of a communication channel in terms of the common currency of bits. This problem is called the channel coding problem. The transmission part of the theory is not concerned with the meaning (semantics) of the message conveyed.

A second set of ideas in information theory relates to data compression. Using a statistical description for data, information theory quantifies the number of bits needed to describe the data. There are two formulations for the compression problem -- in lossless data compression the data must be reconstructed exactly, whereas lossy data compression examines how many bits are needed to reconstruct the data to within a specified fidelity level. This fidelity level is measured by a function called a distortion function. In information theory this is called rate distortion theory. Both lossless and lossy source codes produce bits at the output which can be used as the inputs to the channel codes mentioned above.

This division of information theory into compression and transmission is justified by the information transmission theorems, or source-channel separation theorems that justify the use of bits as the universal currency for information in many contexts. However, these theorems only hold in the situation where one transmitting user wishes to communicate to one receiving user. In scenarios with more than one transmitter (the multiple-access channel), more than one receiver (the broadcast channel) or intermediary "helpers" (the relay channel), or more general networks, compression followed by transmission may no longer be optimal. Network information theory refers to these multi-agent communication models.

## History

It is generally believed that the modern discipline of information theory began with the publication of Shannon's article "A Mathematical Theory of Communication" in the Bell System Technical Journal in July and October of 1948. This work drew on earlier publications by Harry Nyquist and Ralph Hartley. In the process of working out a theory of communications that could be applied by electrical engineers to design better telecommunications systems, Shannon defined a measure of information content of a message

$I(m_{i})=-\log p(m_{i})\$

and entropy, or mean amount of information per message:

$H=\sum _{i}p(m_{i})I(m_{i})=-\sum _{i}p(m_{i})\log p(m_{i})\$

(where p(mi) is the probability of message mi) that, when applied to an information source, could determine the capacity of the channel required to transmit the source. If the logarithm in the formula is taken to base 2, then it gives a measures of message information and entropy in bits. Shannon's measure of entropy came to be taken as a measure of the information contained in a message, as opposed to the portion of the message that is strictly determined (hence predictable) by inherent structures, such as redundancy in the structure of languages or the statistical properties of a language relating to the frequencies of occurrence of different letter or word pairs, triplets, etc. See Markov chains.

Recently, however, it has emerged that entropy was defined and used during the Second World War by Alan Turing at Bletchley Park. Turing named it "weight of evidence" and measured it in units called bans and decibans. (This is not to be confused with the weight of evidence defined by I.J. Good and described in the article statistical inference, which Turing also tackled and named "log-odds" or "lods".) Turing and Shannon collaborated during the war but it appears that they independently created the concept. (References are given in Alan Turing: The Enigma by Andrew Hodges.)

## Relation with thermodynamic entropy

There are close links between information entropy as defined by Shannon and thermodynamic entropy, and in particular its statistical interpretation, established by Ludwig Boltzmann and J. Willard Gibbs in the 1870s. A quantity defined by the entropy formula was first introduced by Boltzmann in 1872, in the context of his H-theorem.

The story goes that when Shannon was deciding what to call the quantity, fearing that 'information' was already over-used, John von Neumann told him: "You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage."

At a theoretical level, as E. T. Jaynes pointed out in 1957, equilibrium statistical mechanics gives the prescription that the probability distribution which should be assigned for the unknown microstate of a thermodynamic system is that which has maximum Shannon entropy, given that it must also satisfy the macroscopic description of the system. But this is just an application of a quite general rule in information theory, if one wishes to a maximally uninformative distribution. The thermodynamic entropy, measuring the disorder of this equilibrium distribution, is then just this maximum Shannon entropy, multiplied for historical reasons by Boltzmann's constant. (See article: MaxEnt thermodynamics).

At a physical level, the connection between physics and information theory establishes physical limits to computation: for example, it is not possible to physically erase one bit of information without realising kB ln 2 J k-1 of heat. This has been applied to the paradox of Maxwell's demon which would need to process information to reverse thermodynamic entropy; but erasing that information, to begin again, exactly balances out the thermodynamic gain that the demon would otherwise achieve.

## Related quantities

Among other useful measures of information is mutual information, a measure of the statistical dependence between two random variables. Mutual information is defined for two events $X$ and $Y$ as

$I(X;Y)=H(X)+H(Y)-H(X,Y)=H(X)-H(X|Y)=H(Y)-H(Y|X)\,$

where $H(X,Y)$ is the joint entropy, defined as

$H(X,Y)=-\sum _{{x,y}}p(x,y)\log p(x,y)\,$

and $H(X|Y)$ is the conditional entropy of $X$ conditioned on observing $Y$. As such, the mutual information can be intuitively considered the amount of uncertainty in $X$ that is eliminated by observations of $Y$ and vice versa. When the random variables in question are continuous rather than discrete, the sums can be replaced with integrals and densities used in place of probability mass functions.

Mutual information is closely related to the log-likelihood ratio test in the context of contingency tables and the Multinomial distribution and to Pearson's χ2 test: mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution. Also, mutual information can be expressed through the Kullback-Leibler divergence:

$I(X;Y)=D(P(X,Y)\|P(X)P(Y))\,$

Shannon information is appropriate for measuring uncertainty over an unordered space. An alternative measure of information was created by Fisher for measuring uncertainty over an ordered space. For example, Shannon information is used over the space of alphabetic letters, as letters do not have 'distances' between them. For information about the value of a continuous parameter such as a person's height, Fisher information is used, as estimated heights do have a well-defined distance.

A. N. Kolmogorov introduced an information measure that is based on the shortest algorithm that can recreate it; see Kolmogorov complexity.

## Extensions in progress

In 1995, Tim Palmer signalled two unwritten assumptions about Shannon's definition of information that made it inapplicable as such to quantum mechanics:

• The supposition that there is such a thing as an observable state (for instance the upper face a die or a coin) before the observation begins
• The fact that knowing this state does not depend on the order in which observations are made (commutativity)

The article Conceptual inadequacy of the Shannon information in quantum measurement [1], published in 2001 by Anton Zeilinger [2] and Caslav Brukner, synthesized and developed these remarks. The so-called Zeilinger's principle suggests that the quantization observed in QM could be bound to information quantification (one cannot observe less than one bit, and what is not observed is by definition "random").

But these claims remain highly controversial. For a detailed discussion of the applicability of the Shannon information in quantum mechanics and an argument that Zeilinger's principle cannot explain quantization, see Timpson [3] 2003 [4] and also Hall 2000 [5] and Mana 2004 [6]

For a tutorial on quantum information see [7].

## Controversies

Dr. William Dembski, a proponent of Intelligent Design, controversially suggested that what he called "specified" Shannon information is relevant to making a "Design inference" that is an inference that something was in some sense planned by a intelligent agent. The theory is almost universally rejected by the scientific community, though some feel it might be able to create algorithms which could detect intelligence in purely naturalistic settings, and that Dembski's idea might actually have some utility, though not in the way he intended.

## Applications

• Information theory is widely used in
• Composer James Tenney, among others such as his teacher Lejaren Hiller, has used information theory in the composition of musical works such as Ergodos.