How LLMs Work — via Llama 3 — Bench

The Core Loop

An LLM does one thing: given a sequence of tokens, predict the next token. That’s it. Everything else — coherence, reasoning, instruction-following — emerges from doing that one thing extremely well, across an enormous amount of text.

Tokens

Text is not fed in as characters or words — it’s split into tokens, subword chunks produced by a learned tokenizer (Llama 3 uses a BPE tokenizer with a 128k vocabulary).

“hello” → 1 token
“unbelievable” → 3 tokens
Code, punctuation, and whitespace tokenize differently than prose

Why it matters: the model has no concept of characters. It never “sees” letters — only token IDs. Spelling errors, letter-counting tasks, and character manipulation are hard for LLMs because the structure they operate on is tokens, not glyphs.

Embeddings

Each token ID maps to a high-dimensional vector — its embedding. Llama 3 8B uses 4096-dimensional vectors. These vectors are learned during training and encode semantic meaning: similar concepts cluster together in this space.

The embedding layer is just a lookup table. Token ID → vector. That vector is what the model actually processes.

The Transformer

The transformer is a stack of identical layers. Llama 3 8B has 32 of them. Each layer has two sub-components:

Self-Attention

Attention lets every token look at every other token in the context and decide how much to “attend” to it. For each token, the layer computes:

Query (Q): what this token is looking for
Key (K): what this token offers to others
Value (V): what this token passes along if attended to

The attention score between two tokens is Q · K / √d — a dot product, scaled, then softmaxed into a probability distribution. High score = attend more, pull more of that token’s value into this one’s representation.

This is how “the cat sat on the mat — it was tired” knows that “it” refers to “the cat” — attention connects them.

Llama 3 uses Grouped Query Attention (GQA): multiple query heads share a single key/value head. Faster inference, less memory, nearly identical quality.

Feed-Forward Network

After attention, each token’s representation passes through a feed-forward network independently — a two-layer MLP with a SwiGLU activation. This is where most of the model’s “knowledge” is thought to be stored — a compressed lookup of patterns seen during training.

Positional Encoding

Transformers have no built-in sense of order — attention is just set operations. Position is injected via RoPE (Rotary Position Embedding): token vectors are rotated in embedding space by an amount proportional to their position. Nearby tokens rotate similarly; distant tokens diverge. This lets attention naturally weight recency and proximity.

Llama 3’s context window is 128k tokens — RoPE scales to handle that range.

The Output

After all 32 layers, the final token’s vector is projected back into the 128k-dimensional vocabulary space. A softmax turns those scores into a probability distribution over all possible next tokens. Sampling picks one.

Temperature controls how peaked or flat that distribution is. Low temperature → greedy, high-probability picks. High temperature → more spread, more surprise.

Training in One Line

The model is trained to minimize prediction error across trillions of tokens of text — next-token prediction as self-supervised learning. No labels needed. The signal is the text itself.

The instruct variant (llama3.1:8b-instruct) is further trained with RLHF and DPO to follow instructions and decline harmful requests.

What It Actually Is

A very deep function: tokens in → probability distribution over next tokens out. Repeated autoregressively to generate text. The “intelligence” is the geometry learned in those 4096-dimensional spaces — relationships, patterns, structure — compressed into 8 billion parameters.