How LLMs Work — via Llama 3
Grounding the fundamentals — tokens, embeddings, attention, and next-token prediction — using Llama 3 as the concrete reference.
The Core Loop
An LLM does one thing: given a sequence of tokens, predict the next token. That’s it. Everything else — coherence, reasoning, instruction-following — emerges from doing that one thing extremely well, across an enormous amount of text.
Tokens
Text is not fed in as characters or words — it’s split into tokens, subword chunks produced by a learned tokenizer (Llama 3 uses a BPE tokenizer with a 128k vocabulary).
- “hello” → 1 token
- “unbelievable” → 3 tokens
- Code, punctuation, and whitespace tokenize differently than prose
Why it matters: the model has no concept of characters. It never “sees” letters — only token IDs. Spelling errors, letter-counting tasks, and character manipulation are hard for LLMs because the structure they operate on is tokens, not glyphs.
Embeddings
Each token ID maps to a high-dimensional vector — its embedding. Llama 3 8B uses 4096-dimensional vectors. These vectors are learned during training and encode semantic meaning: similar concepts cluster together in this space.
The embedding layer is just a lookup table. Token ID → vector. That vector is what the model actually processes.
The Transformer
The transformer is a stack of identical layers. Llama 3 8B has 32 of them. Each layer has two sub-components:
Self-Attention
Attention lets every token look at every other token in the context and decide how much to “attend” to it. For each token, the layer computes:
- Query (Q): what this token is looking for
- Key (K): what this token offers to others
- Value (V): what this token passes along if attended to
The attention score between two tokens is Q · K / √d — a dot product, scaled, then softmaxed into a probability distribution. High score = attend more, pull more of that token’s value into this one’s representation.
This is how “the cat sat on the mat — it was tired” knows that “it” refers to “the cat” — attention connects them.
Llama 3 uses Grouped Query Attention (GQA): multiple query heads share a single key/value head. Faster inference, less memory, nearly identical quality.
Feed-Forward Network
After attention, each token’s representation passes through a feed-forward network independently — a two-layer MLP with a SwiGLU activation. This is where most of the model’s “knowledge” is thought to be stored — a compressed lookup of patterns seen during training.
Positional Encoding
Transformers have no built-in sense of order — attention is just set operations. Position is injected via RoPE (Rotary Position Embedding): token vectors are rotated in embedding space by an amount proportional to their position. Nearby tokens rotate similarly; distant tokens diverge. This lets attention naturally weight recency and proximity.
Llama 3’s context window is 128k tokens — RoPE scales to handle that range.
The Output
After all 32 layers, the final token’s vector is projected back into the 128k-dimensional vocabulary space. A softmax turns those scores into a probability distribution over all possible next tokens. Sampling picks one.
Temperature controls how peaked or flat that distribution is. Low temperature → greedy, high-probability picks. High temperature → more spread, more surprise.
Training in One Line
The model is trained to minimize prediction error across trillions of tokens of text — next-token prediction as self-supervised learning. No labels needed. The signal is the text itself.
The instruct variant (llama3.1:8b-instruct) is further trained with RLHF and DPO to follow instructions and decline harmful requests.
What It Actually Is
A very deep function: tokens in → probability distribution over next tokens out. Repeated autoregressively to generate text. The “intelligence” is the geometry learned in those 4096-dimensional spaces — relationships, patterns, structure — compressed into 8 billion parameters.