Information Theory — Bench

The Central Question

How do you measure information? Claude Shannon asked this in 1948 and answered it with a mathematical framework that reshaped communications, cryptography, and now machine learning.

The key insight: information is about surprise. A message you already expected carries no information. A message you didn’t expect carries a lot. The rarer the event, the more information its occurrence conveys.

Self-Information

The information content of a single event with probability p:

I(p) = log₂(1/p) = −log₂(p)    (in bits)

p = 1 (certain): I = log₂(1) = 0 bits — no surprise, no information
p = 1/2: I = 1 bit — one fair coin flip
p = 1/4: I = 2 bits — two fair coin flips
p = 1/8: I = 3 bits
p = 1/1024: I = 10 bits

Why logarithm? Because information should be additive for independent events. Seeing two independent events with probabilities p and q gives I(pq) = −log(pq) = −log(p) − log(q) = I(p) + I(q). Only the log function satisfies this.

Bit: one binary digit. The minimum unit of information — the answer to one perfectly balanced yes/no question.

Entropy — Average Information

Entropy H(X) is the expected (average) information content of a random variable X:

H(X) = −Σ p(x) log₂ p(x)

It measures the average surprise — or equivalently, the uncertainty — in the distribution.

Coin flip examples:

Fair coin (p = 0.5 heads):

H = −(0.5 log₂ 0.5 + 0.5 log₂ 0.5)
  = −(0.5 × −1 + 0.5 × −1)
  = 1 bit

Biased coin (p = 0.9 heads):

H = −(0.9 log₂ 0.9 + 0.1 log₂ 0.1)
  = −(0.9 × −0.152 + 0.1 × −3.322)
  ≈ 0.469 bits

The fair coin has maximum entropy — maximum uncertainty. The biased coin has lower entropy — you’re less surprised on average, because heads is very likely.

Properties of entropy

H(X) ≥ 0 always
H(X) = 0 iff the outcome is certain (one probability = 1)
H(X) is maximised by the uniform distribution — equal probabilities
For n outcomes: maximum H = log₂(n) bits

Entropy as Compression Limit

Shannon’s source coding theorem: you cannot compress a message below its entropy — on average. Entropy is the theoretical minimum bits per symbol.

Example: a language where letters appear with probability p(a) = 0.5, p(b) = 0.25, p(c) = 0.125, p(d) = 0.125.

H = −(0.5 log₂ 0.5 + 0.25 log₂ 0.25 + 0.125 log₂ 0.125 + 0.125 log₂ 0.125)
  = 0.5×1 + 0.25×2 + 0.125×3 + 0.125×3
  = 0.5 + 0.5 + 0.375 + 0.375
  = 1.75 bits per symbol

Huffman coding assigns: a = 0 (1 bit), b = 10 (2 bits), c = 110 (3 bits), d = 111 (3 bits). Average: 0.5×1 + 0.25×2 + 0.125×3 + 0.125×3 = 1.75 bits per symbol. Achieves the entropy bound exactly.

ZIP, gzip, JPEG, MP3 — all compression algorithms are approximating this theoretical limit.

Joint and Conditional Entropy

Joint entropy: uncertainty of two variables together.

H(X, Y) = −Σ p(x, y) log₂ p(x, y)

Conditional entropy: remaining uncertainty in Y after knowing X.

H(Y|X) = H(X, Y) − H(X)

Chain rule of entropy:

H(X, Y) = H(X) + H(Y|X)

Total uncertainty = uncertainty in X + remaining uncertainty in Y given X.

Mutual Information

How much knowing X reduces uncertainty about Y:

I(X; Y) = H(X) + H(Y) − H(X, Y)
         = H(Y) − H(Y|X)

I(X; Y) = 0: X and Y are independent — knowing one tells you nothing about the other
I(X; Y) = H(Y): knowing X completely determines Y

Mutual information is symmetric: I(X; Y) = I(Y; X). It’s a general measure of statistical dependence — unlike correlation, it catches non-linear relationships.

In machine learning: mutual information measures feature relevance. A feature X is useful for predicting Y if I(X; Y) is large.

KL Divergence

KL (Kullback-Leibler) divergence measures how much one probability distribution differs from another:

D_KL(P || Q) = Σ P(x) log₂(P(x) / Q(x))

Read: how much extra information is needed to encode samples from P using a code designed for Q, instead of the optimal code for P.

Properties:

D_KL(P || Q) ≥ 0 always (Gibbs’ inequality)
D_KL(P || Q) = 0 iff P = Q
Not symmetric: D_KL(P || Q) ≠ D_KL(Q || P) in general

KL divergence is not a distance — the asymmetry means it doesn’t satisfy the triangle inequality. But it measures “how surprised you’d be if you expected Q but got P.”

Cross-Entropy

The cross-entropy of Q relative to P:

H(P, Q) = −Σ P(x) log₂ Q(x)
         = H(P) + D_KL(P || Q)

Cross-entropy = entropy of P + KL divergence from P to Q.

When P is the true distribution and Q is your model:

H(P) is irreducible — the inherent uncertainty
D_KL(P || Q) is the extra cost from your model being wrong

Cross-entropy loss in machine learning: during classification, minimising cross-entropy between the true label distribution P and the model’s predicted distribution Q is equivalent to minimising KL divergence — making the model’s predictions as close to the true distribution as possible.

Loss = −Σ yᵢ log(ŷᵢ)

For a one-hot true label (one class is certain), this simplifies to −log(ŷ_true) — the negative log probability of the correct class. Maximising the probability of correct predictions = minimising cross-entropy.

Channel Capacity

Shannon’s channel coding theorem: every noisy communication channel has a capacity C (bits per second). You can send information at any rate below C with arbitrarily low error (by using smart encoding). Above C, errors are unavoidable.

C = B log₂(1 + S/N)

Where B = bandwidth, S/N = signal-to-noise ratio. This is the Shannon-Hartley theorem — it sets hard limits on any communication system: telephone, Wi-Fi, 5G.

Why Entropy Appears Everywhere

Thermodynamics: Boltzmann’s entropy formula S = k log W is the same mathematical structure. Physical entropy and information entropy are deeply related — Maxwell’s demon (a thought experiment about using information to violate the second law) is resolved by recognising that erasing information costs physical entropy.

Language models: a language model’s perplexity is 2^H, where H is the cross-entropy of the model’s predictions against actual text. Lower perplexity = lower entropy = better predictions = more compression. GPT’s training objective (next-token prediction) is entropy minimisation.

Decision trees: information gain — the reduction in entropy when splitting on a feature — is the criterion for choosing which feature to split on. ID3, C4.5, and variants all use entropy directly.

Neuroscience: the brain may be an efficient encoder, minimising the bits needed to represent the sensory world — efficient coding hypothesis.

Information theory is the mathematics of uncertainty and communication. Shannon built a framework in 1948 that now underlies virtually everything digital.