Markov Chains — Bench — Vishnu Dileesh

The Core Idea

A Markov chain is a system that moves between states where the next state depends only on the current state — not on how you got there. This is the Markov property, sometimes called memorylessness.

Formally: given the current state, the future is independent of the past.

P(Xₙ₊₁ = s | Xₙ = sₙ, Xₙ₋₁ = sₙ₋₁, ..., X₀ = s₀) = P(Xₙ₊₁ = s | Xₙ = sₙ)

The entire history collapses into a single number — where you are now.

States and Transitions

A Markov chain is defined by:

A set of states — {s₁, s₂, …, sₙ}
A transition matrix P, where Pᵢⱼ = probability of moving from state i to state j

Each row of P sums to 1 (it’s a stochastic matrix — from any state, you have to go somewhere).

Example — weather model:

	Sunny	Rainy
Sunny	0.8	0.2
Rainy	0.4	0.6

If today is sunny, there’s an 80% chance tomorrow is sunny and 20% chance it’s rainy. Tomorrow’s weather depends only on today’s — not last week’s.

Running the Chain Forward

If you know the current state distribution as a row vector π, the distribution after one step is:

π' = π · P

After n steps:

πₙ = π₀ · Pⁿ

This is just matrix exponentiation. The chain evolves by repeated multiplication by P.

What this means practically: you can compute the probability of being in any state after any number of steps without simulating each step individually.

Steady-State Distribution

For many chains, repeated multiplication by P converges — the distribution stops changing:

π = π · P

This is the stationary distribution (or steady-state). It’s a left eigenvector of P with eigenvalue 1.

For the weather example, solving π = π · P:

π(Sunny) = 2/3 ≈ 0.667
π(Rainy) = 1/3 ≈ 0.333

In the long run, regardless of where you start, the system spends two-thirds of its time in sunny states. Starting conditions wash out.

When Steady-State Exists

Not all chains converge. Two conditions matter:

Irreducible — every state is reachable from every other state (the chain can’t get stuck in a subset). If some states are absorbing — you can enter but never leave — the chain doesn’t mix.

Aperiodic — the chain doesn’t oscillate in fixed cycles. A chain that alternates deterministically between two states never settles.

A chain that is both irreducible and aperiodic is ergodic — it has a unique stationary distribution, and the chain converges to it from any starting point.

Absorbing States

An absorbing state is one where Pᵢᵢ = 1 — once entered, you never leave. Chains with absorbing states don’t have a stationary distribution in the mixing sense; instead you ask: what’s the probability of eventually being absorbed, and how long does it take?

Classic example: gambler’s ruin. A gambler starts with £k, wins or loses £1 each round. At £0 or £N, the game ends. Both endpoints are absorbing states. The chain answers: what’s the probability of reaching £N before going broke?

Continuous Time

Discrete-time Markov chains jump at fixed steps. Continuous-time Markov chains (CTMCs) jump at random times drawn from exponential distributions. Instead of a transition matrix, you have a rate matrix Q (also called a generator), where Qᵢⱼ gives the rate of jumping from i to j.

Used extensively in queueing theory, chemical kinetics, epidemiology (SIR models), and reliability engineering.

Where Markov Chains Actually Show Up

PageRank — Google’s original algorithm models the web as a Markov chain. States are pages; transitions are links. A random surfer follows links with probability d and jumps to a random page with probability 1-d (the damping factor, preventing absorbing states). PageRank is the stationary distribution of this chain.

Language models (pre-transformer) — n-gram models are Markov chains over words. A bigram model predicts the next word from only the previous word. A trigram model uses the previous two. Transformers broke this by attending to the full context — they violate the Markov property entirely.

MCMC (Markov Chain Monte Carlo) — a technique for sampling from complex probability distributions. You design a Markov chain whose stationary distribution is the target distribution, run it long enough to mix, and collect samples. Metropolis-Hastings and Gibbs sampling are the canonical algorithms. Used everywhere in Bayesian inference.

Hidden Markov Models — the underlying system is a Markov chain, but you can’t observe states directly. You observe emissions that depend probabilistically on the hidden state. Used in speech recognition, biological sequence analysis, and financial modeling.

Reinforcement learning — the environment in an RL problem is modeled as a Markov Decision Process (MDP) — a Markov chain augmented with actions and rewards. The agent’s policy determines which transitions occur. Value functions and Q-learning are rooted in Markov chain theory.

The Linear Algebra Connection

Steady-state analysis is eigenvalue analysis. The stationary distribution is the left eigenvector of P corresponding to eigenvalue 1. The rate of convergence to steady-state is governed by the second-largest eigenvalue — the spectral gap. A large spectral gap means fast mixing; a small gap means slow mixing (the chain is nearly decomposable into weakly connected parts).

This connects Markov chains to graph theory: the transition matrix of a random walk on a graph is a Markov chain, and spectral graph theory characterizes its mixing properties.