Information Theory
Shannon's framework for quantifying information — entropy, bits, KL divergence, and why it underlies compression, communication, and machine learning.
The Central Question
How do you measure information? Claude Shannon asked this in 1948 and answered it with a mathematical framework that reshaped communications, cryptography, and now machine learning.
The key insight: information is about surprise. A message you already expected carries no information. A message you didn’t expect carries a lot. The rarer the event, the more information its occurrence conveys.
Self-Information
The information content of a single event with probability p:
I(p) = log₂(1/p) = −log₂(p) (in bits)
- p = 1 (certain): I = log₂(1) = 0 bits — no surprise, no information
- p = 1/2: I = 1 bit — one fair coin flip
- p = 1/4: I = 2 bits — two fair coin flips
- p = 1/8: I = 3 bits
- p = 1/1024: I = 10 bits
Why logarithm? Because information should be additive for independent events. Seeing two independent events with probabilities p and q gives I(pq) = −log(pq) = −log(p) − log(q) = I(p) + I(q). Only the log function satisfies this.
Bit: one binary digit. The minimum unit of information — the answer to one perfectly balanced yes/no question.
Entropy — Average Information
Entropy H(X) is the expected (average) information content of a random variable X:
H(X) = −Σ p(x) log₂ p(x)
It measures the average surprise — or equivalently, the uncertainty — in the distribution.
Coin flip examples:
Fair coin (p = 0.5 heads):
H = −(0.5 log₂ 0.5 + 0.5 log₂ 0.5)
= −(0.5 × −1 + 0.5 × −1)
= 1 bit
Biased coin (p = 0.9 heads):
H = −(0.9 log₂ 0.9 + 0.1 log₂ 0.1)
= −(0.9 × −0.152 + 0.1 × −3.322)
≈ 0.469 bits
The fair coin has maximum entropy — maximum uncertainty. The biased coin has lower entropy — you’re less surprised on average, because heads is very likely.
Properties of entropy
- H(X) ≥ 0 always
- H(X) = 0 iff the outcome is certain (one probability = 1)
- H(X) is maximised by the uniform distribution — equal probabilities
- For n outcomes: maximum H = log₂(n) bits
Entropy as Compression Limit
Shannon’s source coding theorem: you cannot compress a message below its entropy — on average. Entropy is the theoretical minimum bits per symbol.
Example: a language where letters appear with probability p(a) = 0.5, p(b) = 0.25, p(c) = 0.125, p(d) = 0.125.
H = −(0.5 log₂ 0.5 + 0.25 log₂ 0.25 + 0.125 log₂ 0.125 + 0.125 log₂ 0.125)
= 0.5×1 + 0.25×2 + 0.125×3 + 0.125×3
= 0.5 + 0.5 + 0.375 + 0.375
= 1.75 bits per symbol
Huffman coding assigns: a = 0 (1 bit), b = 10 (2 bits), c = 110 (3 bits), d = 111 (3 bits). Average: 0.5×1 + 0.25×2 + 0.125×3 + 0.125×3 = 1.75 bits per symbol. Achieves the entropy bound exactly.
ZIP, gzip, JPEG, MP3 — all compression algorithms are approximating this theoretical limit.
Joint and Conditional Entropy
Joint entropy: uncertainty of two variables together.
H(X, Y) = −Σ p(x, y) log₂ p(x, y)
Conditional entropy: remaining uncertainty in Y after knowing X.
H(Y|X) = H(X, Y) − H(X)
Chain rule of entropy:
H(X, Y) = H(X) + H(Y|X)
Total uncertainty = uncertainty in X + remaining uncertainty in Y given X.
Mutual Information
How much knowing X reduces uncertainty about Y:
I(X; Y) = H(X) + H(Y) − H(X, Y)
= H(Y) − H(Y|X)
- I(X; Y) = 0: X and Y are independent — knowing one tells you nothing about the other
- I(X; Y) = H(Y): knowing X completely determines Y
Mutual information is symmetric: I(X; Y) = I(Y; X). It’s a general measure of statistical dependence — unlike correlation, it catches non-linear relationships.
In machine learning: mutual information measures feature relevance. A feature X is useful for predicting Y if I(X; Y) is large.
KL Divergence
KL (Kullback-Leibler) divergence measures how much one probability distribution differs from another:
D_KL(P || Q) = Σ P(x) log₂(P(x) / Q(x))
Read: how much extra information is needed to encode samples from P using a code designed for Q, instead of the optimal code for P.
Properties:
- D_KL(P || Q) ≥ 0 always (Gibbs’ inequality)
- D_KL(P || Q) = 0 iff P = Q
- Not symmetric: D_KL(P || Q) ≠ D_KL(Q || P) in general
KL divergence is not a distance — the asymmetry means it doesn’t satisfy the triangle inequality. But it measures “how surprised you’d be if you expected Q but got P.”
Cross-Entropy
The cross-entropy of Q relative to P:
H(P, Q) = −Σ P(x) log₂ Q(x)
= H(P) + D_KL(P || Q)
Cross-entropy = entropy of P + KL divergence from P to Q.
When P is the true distribution and Q is your model:
- H(P) is irreducible — the inherent uncertainty
- D_KL(P || Q) is the extra cost from your model being wrong
Cross-entropy loss in machine learning: during classification, minimising cross-entropy between the true label distribution P and the model’s predicted distribution Q is equivalent to minimising KL divergence — making the model’s predictions as close to the true distribution as possible.
Loss = −Σ yᵢ log(ŷᵢ)
For a one-hot true label (one class is certain), this simplifies to −log(ŷ_true) — the negative log probability of the correct class. Maximising the probability of correct predictions = minimising cross-entropy.
Channel Capacity
Shannon’s channel coding theorem: every noisy communication channel has a capacity C (bits per second). You can send information at any rate below C with arbitrarily low error (by using smart encoding). Above C, errors are unavoidable.
C = B log₂(1 + S/N)
Where B = bandwidth, S/N = signal-to-noise ratio. This is the Shannon-Hartley theorem — it sets hard limits on any communication system: telephone, Wi-Fi, 5G.
Why Entropy Appears Everywhere
Thermodynamics: Boltzmann’s entropy formula S = k log W is the same mathematical structure. Physical entropy and information entropy are deeply related — Maxwell’s demon (a thought experiment about using information to violate the second law) is resolved by recognising that erasing information costs physical entropy.
Language models: a language model’s perplexity is 2^H, where H is the cross-entropy of the model’s predictions against actual text. Lower perplexity = lower entropy = better predictions = more compression. GPT’s training objective (next-token prediction) is entropy minimisation.
Decision trees: information gain — the reduction in entropy when splitting on a feature — is the criterion for choosing which feature to split on. ID3, C4.5, and variants all use entropy directly.
Neuroscience: the brain may be an efficient encoder, minimising the bits needed to represent the sensory world — efficient coding hypothesis.
Information theory is the mathematics of uncertainty and communication. Shannon built a framework in 1948 that now underlies virtually everything digital.