Statistics — Mean, Variance, Standard Deviation, and the CLT — Bench

Two Branches

Descriptive statistics: summarise and describe data you have. Inferential statistics: draw conclusions about a population from a sample.

They use the same tools but ask different questions. Descriptive statistics is about the data in hand; inferential statistics is about what you can conclude beyond it.

Measures of Centre

Mean (arithmetic average):

x̄ = (x₁ + x₂ + ... + xₙ) / n = Σxᵢ / n

Sensitive to outliers — one extreme value pulls the mean significantly.

Median: the middle value when sorted. For even n, average the two middle values.

Resistant to outliers — the median of {1, 2, 3, 4, 100} is 3; the mean is 22. For skewed data (income, house prices), median is the more representative “typical” value.

Mode: the most frequently occurring value. Can be multiple (bimodal, multimodal) or none (all unique). More useful for categorical data than numerical.

When to use which

Symmetric distribution: mean ≈ median, either works
Right-skewed (long tail right): mean > median — use median for “typical”
Left-skewed: mean < median
Categorical data: mode

Measures of Spread

Range: max − min. Simple but very sensitive to outliers.

Interquartile range (IQR): Q3 − Q1, the spread of the middle 50%.

Q1: 25th percentile (median of lower half)
Q2: 50th percentile (median)
Q3: 75th percentile (median of upper half)
IQR = Q3 − Q1

Robust to outliers. Used in box plots.

Variance:

Population: σ² = Σ(xᵢ − μ)² / N
Sample:     s² = Σ(xᵢ − x̄)² / (n−1)

Average squared deviation from the mean. Squaring ensures positive values and penalises large deviations more than small ones.

Why n−1 for samples (Bessel’s correction): Using n underestimates the true population variance because the sample mean is closer to sample data than the true population mean is. Dividing by n−1 corrects this — it makes s² an unbiased estimator of σ².

Standard deviation: σ (population) or s (sample) = √variance. Same units as the data. The typical distance from the mean.

The Normal Distribution and Standard Deviations

For normally distributed data, standard deviations have predictable meaning:

x̄ ± 1s contains ≈ 68% of data
x̄ ± 2s contains ≈ 95% of data
x̄ ± 3s contains ≈ 99.7% of data

An observation 3 standard deviations from the mean is unusual (happens ~0.3% of the time). 6 standard deviations (“six sigma”) is extraordinarily rare.

Standardising (z-score):

z = (x − x̄) / s

Converts a value to “number of standard deviations from the mean.” Allows comparison across different scales.

The Central Limit Theorem

The CLT: if you take sufficiently large random samples from any population with finite mean μ and variance σ², the distribution of sample means will be approximately normal:

X̄ ~ N(μ, σ²/n)

The mean of the sampling distribution is μ. The standard deviation is σ/√n — called the standard error.

What this means:

Doesn’t matter what shape the original population has — uniform, skewed, bimodal
The distribution of sample means becomes normal as n grows
With n ≥ 30, the approximation is usually good enough

Why it’s powerful: you can apply the well-understood normal distribution to real-world data even when the underlying distribution is unknown. Almost all classical statistical tests are built on this.

Example: Roll a die (uniform distribution) 30 times, record the mean. Do this thousands of times. The distribution of those means is approximately N(3.5, 35/36/30) — a normal distribution, despite the original uniform distribution.

Standard Error vs Standard Deviation

These are frequently confused:

Standard deviation (s): how spread out individual data points are. A property of the data.

Standard error (SE = s/√n): how spread out sample means are. A property of your estimate.

As n increases, SE decreases (√n in denominator). Larger samples → more precise estimates of the mean. This is why bigger studies give tighter results.

Confidence Intervals

A 95% confidence interval means: if you repeated this sampling procedure many times, 95% of the intervals constructed this way would contain the true population mean.

CI = x̄ ± z* × (s/√n)

For 95%: z* = 1.96. For 99%: z* = 2.576.

Sample mean = 50, s = 10, n = 100
SE = 10/√100 = 1
95% CI = 50 ± 1.96 × 1 = (48.04, 51.96)

Common misinterpretation: “95% probability that the true mean is in this interval” is wrong. The true mean is fixed — it’s either in the interval or not. The 95% refers to the procedure, not any particular interval.

Correlation

Pearson correlation coefficient r measures linear association between two variables:

r = Σ(xᵢ − x̄)(yᵢ − ȳ) / (n−1)sₓsᵧ

Range: −1 to +1.

r = 1: perfect positive linear relationship
r = −1: perfect negative linear relationship
r = 0: no linear relationship (but could be nonlinear)

Correlation ≠ causation. Two variables can correlate because one causes the other, because a third variable causes both (confounding), or by pure chance. The number alone tells you nothing about mechanism.

Anscombe’s Quartet: four datasets with identical means, variances, and correlations (r ≈ 0.816) but completely different shapes — one linear, one curved, one with an outlier. Always plot your data.

Outliers

An outlier is a value unusually far from the rest. Common rule: x is an outlier if it falls more than 1.5 × IQR below Q1 or above Q3.

Effect on statistics:

Mean and variance: heavily affected by outliers
Median and IQR: robust to outliers

What to do with outliers: investigate before removing. An outlier might be:

A data entry error → correct or remove
A genuine extreme observation → keep
The most interesting finding in the dataset → investigate further

Removing outliers because they’re inconvenient is a form of data manipulation.

Distributions in Practice

Statistic	Best with	Sensitive to
Mean	Symmetric data	Outliers
Median	Skewed data	Nothing much
Standard deviation	Normal data	Outliers
IQR	Any shape	Nothing much

The choice of statistic is itself a modelling decision — it encodes assumptions about the shape of your data.