Neural Networks and Backpropagation — Bench

What a Neural Network Actually Is

Strip away the biological metaphor — the “neurons,” the “synapses,” the “learning” — and a neural network is a parameterized function. Given an input vector x, it produces an output y through a sequence of linear transformations and nonlinear activation functions. The parameters — the weights and biases at each layer — determine what function it computes. Training is the process of adjusting those parameters until the function computes something useful.

A single layer applies a linear map: y = Wx + b, where W is the weight matrix and b is the bias vector. This is just a generalization of linear regression — it can only represent linear relationships between inputs and outputs. The entire point of depth is the nonlinearities between layers. After each linear map, an activation function is applied elementwise — historically sigmoid or tanh, today mostly ReLU (Rectified Linear Unit, f(x) = max(0, x)). Without nonlinearities, composing multiple linear layers still produces a linear function. With them, the composition can represent arbitrarily complex functions.

The Universal Approximation Theorem (Cybenko, 1989; Hornik, 1991) states that a feedforward network with a single hidden layer and a sufficient number of neurons can approximate any continuous function to arbitrary accuracy, given enough neurons. This is a theorem about expressiveness — it says neural networks can in principle represent any function. It says nothing about whether gradient descent will actually find the right parameters, or how many neurons “sufficient” requires in practice. The theorem is widely cited and widely misunderstood; it establishes that the hypothesis class is flexible, not that training is tractable.

Backpropagation

Given a loss function L that measures how wrong the network’s output is (e.g., mean squared error for regression, cross-entropy for classification), training requires computing the gradient of L with respect to every parameter in the network. Backpropagation is the algorithm that does this efficiently.

The insight is the chain rule. If y = f(g(x)), then dy/dx = (dy/dg) · (dg/dx). For a network with L layers, the gradient of the loss with respect to weights in an early layer is the product of gradients through all the layers above it. Backpropagation computes these products layer by layer, starting from the output and propagating backward — hence the name.

The algorithm itself is not deep. The chain rule is applied in the obvious order: compute the gradient of the loss with respect to the final layer’s outputs (a simple derivative of the loss function), then use that to compute the gradient with respect to the final layer’s weights (a matrix multiply), then propagate backward through each layer applying the chain rule to the nonlinearities. The computational cost of a backward pass is roughly two to three times the cost of a forward pass — small enough to make gradient-based training practical.

Backpropagation was known in the control theory literature in the 1960s and in the neural network community by the early 1970s, but it was the 1986 Rumelhart, Hinton, and Williams paper in Nature — demonstrating it working on practical problems — that established it as the standard training algorithm. The algorithm didn’t change; the demonstration of its practical utility did.

Gradient Descent and Its Variants

Once gradients are computed, they’re used to update the parameters: θ ← θ − η∇L(θ), where η (the learning rate) controls the step size. This is vanilla gradient descent — computed on the entire dataset, taking a step in the direction that locally decreases the loss most steeply.

In practice, gradients are computed on small random subsets of the training data — minibatches of 32 to 512 examples. This is stochastic gradient descent (SGD). The gradient estimate from a minibatch is noisy relative to the full-data gradient, but the noise turns out to be beneficial: it prevents the optimizer from getting stuck in the exact bottom of a local minimum (which may generalize poorly) and allows it to escape to flatter minima (which often generalize better). The noise is a regularizer.

Modern optimizers add momentum and adaptive learning rates. Adam (Adaptive Moment Estimation, Kingma and Ba, 2014) maintains running estimates of the first moment (mean) and second moment (variance) of the gradients and uses these to adapt the learning rate per-parameter. Parameters whose gradients consistently point in the same direction get a larger effective learning rate; parameters whose gradients are noisy get a smaller one. Adam converges faster than vanilla SGD on most problems and requires less learning rate tuning.

The Loss Landscape

Understanding why gradient descent finds good solutions is not fully understood. The loss function of a large neural network is a high-dimensional surface with many local minima, saddle points, and flat regions. The naive worry is that gradient descent will get stuck in bad local minima that generalize poorly.

Empirically, this worry appears overstated for large overparameterized networks. Large networks trained on sufficient data consistently find solutions that generalize well, despite the non-convexity of the optimization problem. Theoretical work (Goodfellow, Vinyals, Saxe; Zhang, Bengio, Recht) has suggested that local minima in high-dimensional loss landscapes tend to be near-globally-optimal — the saddle points that exist in lower-dimensional problems become rare or escapable in high dimensions.

The more empirically relevant failure modes are: learning rate too large (divergence), learning rate too small (slow convergence or settling into a bad region), vanishing gradients in deep networks (gradients near zero in early layers, so they don’t update), and exploding gradients (gradients that grow exponentially through layers, causing numerical instability).

Vanishing gradients are why deep networks were impractical before ~2012. The sigmoid activation function saturates — its gradient approaches zero for large positive or negative inputs. Composing many sigmoid layers multiplicatively reduces gradients in early layers to near zero. ReLU avoids this: its gradient is exactly 1 for positive inputs, so gradients pass through unchanged (no saturation in the positive regime). The switch from sigmoid to ReLU, combined with better initialization schemes, made very deep networks trainable.

What Networks Are Actually Learning

The distributional perspective on what learned representations mean is still developing. A network trained on image classification doesn’t explicitly learn edge detectors, then shape detectors, then object detectors — but it does appear to learn a hierarchy of features, with early layers responding to low-level structure (edges, colors, textures) and later layers responding to high-level semantic content.

This was shown compellingly by feature visualization (Zeiler and Fergus, 2013): by running gradient ascent on the input (finding the input that maximally activates a given unit), you can see what each unit is “looking for.” Early layer units in a convolutional image network respond to oriented edges and color blobs. Intermediate layers respond to textures and parts. Late layers respond to specific object types or combinations.

The deeper question — what is the network doing computationally when it recognizes an image? — is what mechanistic interpretability attempts to answer. The neurons and their activations are visible; the algorithm being implemented is not. This is the interpretability problem, and it’s why understanding trained neural networks requires a separate research program beyond simply running them.

The Foundation for Everything Else

Backpropagation and gradient descent are unchanged in the largest language models, image generators, and protein folding networks that dominate current AI. The ingredients scaled: more parameters, more data, more compute, better architectures. The training algorithm is recognizably the same as what Rumelhart, Hinton, and Williams demonstrated in 1986.

This is worth holding onto. The revolution in AI capability over the past decade was not primarily a revolution in learning algorithms. It was a revolution in scale — in the engineering systems that make very large gradient descent tractable — and in architecture, specifically the transformer, which changed what the gradient descent is optimizing over. The foundation remained: a parameterized function, a loss, a gradient, a step.