Statistics — Mean, Variance, Standard Deviation, and the CLT
Descriptive statistics, measures of spread, and the Central Limit Theorem — why averages behave so predictably.
Two Branches
Descriptive statistics: summarise and describe data you have. Inferential statistics: draw conclusions about a population from a sample.
They use the same tools but ask different questions. Descriptive statistics is about the data in hand; inferential statistics is about what you can conclude beyond it.
Measures of Centre
Mean (arithmetic average):
x̄ = (x₁ + x₂ + ... + xₙ) / n = Σxᵢ / n
Sensitive to outliers — one extreme value pulls the mean significantly.
Median: the middle value when sorted. For even n, average the two middle values.
Resistant to outliers — the median of {1, 2, 3, 4, 100} is 3; the mean is 22. For skewed data (income, house prices), median is the more representative “typical” value.
Mode: the most frequently occurring value. Can be multiple (bimodal, multimodal) or none (all unique). More useful for categorical data than numerical.
When to use which
- Symmetric distribution: mean ≈ median, either works
- Right-skewed (long tail right): mean > median — use median for “typical”
- Left-skewed: mean < median
- Categorical data: mode
Measures of Spread
Range: max − min. Simple but very sensitive to outliers.
Interquartile range (IQR): Q3 − Q1, the spread of the middle 50%.
- Q1: 25th percentile (median of lower half)
- Q2: 50th percentile (median)
- Q3: 75th percentile (median of upper half)
- IQR = Q3 − Q1
Robust to outliers. Used in box plots.
Variance:
Population: σ² = Σ(xᵢ − μ)² / N
Sample: s² = Σ(xᵢ − x̄)² / (n−1)
Average squared deviation from the mean. Squaring ensures positive values and penalises large deviations more than small ones.
Why n−1 for samples (Bessel’s correction): Using n underestimates the true population variance because the sample mean is closer to sample data than the true population mean is. Dividing by n−1 corrects this — it makes s² an unbiased estimator of σ².
Standard deviation: σ (population) or s (sample) = √variance. Same units as the data. The typical distance from the mean.
The Normal Distribution and Standard Deviations
For normally distributed data, standard deviations have predictable meaning:
x̄ ± 1s contains ≈ 68% of data
x̄ ± 2s contains ≈ 95% of data
x̄ ± 3s contains ≈ 99.7% of data
An observation 3 standard deviations from the mean is unusual (happens ~0.3% of the time). 6 standard deviations (“six sigma”) is extraordinarily rare.
Standardising (z-score):
z = (x − x̄) / s
Converts a value to “number of standard deviations from the mean.” Allows comparison across different scales.
The Central Limit Theorem
The CLT: if you take sufficiently large random samples from any population with finite mean μ and variance σ², the distribution of sample means will be approximately normal:
X̄ ~ N(μ, σ²/n)
The mean of the sampling distribution is μ. The standard deviation is σ/√n — called the standard error.
What this means:
- Doesn’t matter what shape the original population has — uniform, skewed, bimodal
- The distribution of sample means becomes normal as n grows
- With n ≥ 30, the approximation is usually good enough
Why it’s powerful: you can apply the well-understood normal distribution to real-world data even when the underlying distribution is unknown. Almost all classical statistical tests are built on this.
Example: Roll a die (uniform distribution) 30 times, record the mean. Do this thousands of times. The distribution of those means is approximately N(3.5, 35/36/30) — a normal distribution, despite the original uniform distribution.
Standard Error vs Standard Deviation
These are frequently confused:
Standard deviation (s): how spread out individual data points are. A property of the data.
Standard error (SE = s/√n): how spread out sample means are. A property of your estimate.
As n increases, SE decreases (√n in denominator). Larger samples → more precise estimates of the mean. This is why bigger studies give tighter results.
Confidence Intervals
A 95% confidence interval means: if you repeated this sampling procedure many times, 95% of the intervals constructed this way would contain the true population mean.
CI = x̄ ± z* × (s/√n)
For 95%: z* = 1.96. For 99%: z* = 2.576.
Sample mean = 50, s = 10, n = 100
SE = 10/√100 = 1
95% CI = 50 ± 1.96 × 1 = (48.04, 51.96)
Common misinterpretation: “95% probability that the true mean is in this interval” is wrong. The true mean is fixed — it’s either in the interval or not. The 95% refers to the procedure, not any particular interval.
Correlation
Pearson correlation coefficient r measures linear association between two variables:
r = Σ(xᵢ − x̄)(yᵢ − ȳ) / (n−1)sₓsᵧ
Range: −1 to +1.
- r = 1: perfect positive linear relationship
- r = −1: perfect negative linear relationship
- r = 0: no linear relationship (but could be nonlinear)
Correlation ≠ causation. Two variables can correlate because one causes the other, because a third variable causes both (confounding), or by pure chance. The number alone tells you nothing about mechanism.
Anscombe’s Quartet: four datasets with identical means, variances, and correlations (r ≈ 0.816) but completely different shapes — one linear, one curved, one with an outlier. Always plot your data.
Outliers
An outlier is a value unusually far from the rest. Common rule: x is an outlier if it falls more than 1.5 × IQR below Q1 or above Q3.
Effect on statistics:
- Mean and variance: heavily affected by outliers
- Median and IQR: robust to outliers
What to do with outliers: investigate before removing. An outlier might be:
- A data entry error → correct or remove
- A genuine extreme observation → keep
- The most interesting finding in the dataset → investigate further
Removing outliers because they’re inconvenient is a form of data manipulation.
Distributions in Practice
| Statistic | Best with | Sensitive to |
|---|---|---|
| Mean | Symmetric data | Outliers |
| Median | Skewed data | Nothing much |
| Standard deviation | Normal data | Outliers |
| IQR | Any shape | Nothing much |
The choice of statistic is itself a modelling decision — it encodes assumptions about the shape of your data.