Loading...
Loading...

Transformers, attention, tokens, temperature — explained with code analogies instead of math papers. Finally understand the AI you use every day.
I used Claude and ChatGPT for 6 months before I actually understood what happens under the hood. Most explanations are either dumbed-down analogies ("it's like autocomplete!") or incomprehensible math papers.
Here's the developer-friendly middle ground. No PhD required.
An LLM is a massive function that takes a sequence of tokens (words) and predicts the most probable next token. That's it. Everything else is optimization.
It doesn't "know" JavaScript is popular. It learned statistical patterns from billions of text examples.
Think of it like a pipeline:
| Stage | What Happens | Developer Analogy |
|-------|-------------|-------------------|
| Tokenization | Split text into tokens | "Hello world".split() but smarter |
| Embedding | Convert tokens to vectors | Like a hash map: word → [numbers] |
| Attention | Find relationships between tokens | Like SQL JOINs between words |
| Feed-Forward | Transform the representations | Like middleware processing |
| Output | Predict next token probabilities | Like softmax() → probability distribution |
LLMs don't see words. They see tokens — chunks that might be words, parts of words, or single characters.
💡 This is why LLMs are bad at counting letters in words — they literally don't see individual letters. They see token chunks.
The transformer attention mechanism is what made modern AI possible. Here's the intuition:
Attention lets the model figure out that "it" refers to "cat" (not "mat") by computing relevance scores between every pair of tokens.
| Token | Attention to "it" |
|-------|------------------|
| The | 0.02 |
| cat | 0.45 ← highest! |
| sat | 0.03 |
| on | 0.01 |
| the | 0.02 |
| mat | 0.12 |
| because | 0.05 |
| it | 0.20 |
| was | 0.08 |
| tired | 0.02 |
The model "attends" most to "cat" when processing "it." This is computed for every token pair in the input. That's why it's called "self-attention."
| Parameter | What It Does | Low Value | High Value |
|-----------|-------------|-----------|------------|
| Temperature | Controls randomness | Deterministic, focused | Creative, varied |
| Top-K | Only consider top K tokens | More focused | More diverse |
| Top-P | Only consider tokens summing to P probability | More focused | More diverse |
| Max Tokens | Output length limit | Short responses | Long responses |
| Model | Parameters | Training Data | Training Cost |
|-------|-----------|--------------|---------------|
| GPT-2 (2019) | 1.5B | 40GB text | ~$50K |
| GPT-3 (2020) | 175B | 570GB text | ~$4.6M |
| GPT-4 (2023) | ~1.8T (rumored) | ~13T tokens | ~$100M |
| Claude Opus | Undisclosed | Undisclosed | Undisclosed |
| Llama 3.1 405B | 405B | 15T tokens | ~$30M+ |
One "parameter" is one number the model learned during training. GPT-4 has roughly 1.8 trillion learned numbers. Your brain has ~100 trillion synapses. We're getting closer.
| Limitation | Technical Reason |
|-----------|-----------------|
| Math errors | Tokens aren't numbers — "7 + 8" is three tokens, not arithmetic |
| Hallucination | Model maximizes probability, not truth |
| Can't count letters | Tokenization hides individual characters |
| No real-time knowledge | Training data has a cutoff date |
| Inconsistent outputs | Sampling is probabilistic by design |
Understanding how LLMs work makes you a better prompt engineer and a better AI app builder:
You don't need to build an LLM. But understanding the machine you're using every day? That's just being a good engineer. 🧠