The whole idea in one minute
An LLM, or large language model, takes your text, turns it into numbers, runs those numbers through many transformer layers, and predicts what text should come next.
That is the simple version. The useful version is this:
- Your prompt is split into tokens, which are small text pieces.
- Each token becomes a vector, which is a list of numbers that carries learned meaning.
- The model adds information about order, because
dog bites manandman bites dogdo not mean the same thing. - Attention lets each token decide which earlier tokens matter.
- A feed-forward network does deeper processing for each token.
- Residual connections and normalization keep the many layers stable.
- The model outputs scores for the next possible token.
- One token is chosen, added to the text, and the loop repeats.
flowchart LR
A["You type a prompt"] --> B["Tokenizer<br>text pieces"]
B --> C["Embeddings<br>meaning as numbers"]
C --> D["Position signal<br>word order"]
D --> E["Attention<br>what should matter?"]
E --> F["Feed-forward layer<br>deeper processing"]
F --> G["Next-token scores"]
G --> H["Pick one token"]
H --> I["Add it to the text"]
I --> E| Part | Plain-English job | Why it matters |
|---|---|---|
| Tokens | Break text into pieces | The model cannot read raw words or letters directly. |
| Embeddings | Turn pieces into meaning-shaped numbers | Similar ideas can sit near each other in number-space. |
| Position | Tell the model where each piece appears | Order changes meaning. |
| Attention | Let tokens look at useful previous tokens | This is how context flows through the sentence. |
| Feed-forward network | Process each token more deeply | A lot of learned structure lives here. |
| Next-token prediction | Score likely continuations | This is the generation loop behind every answer. |
1. Tokens: the model's alphabet is not your alphabet
Models do not see your sentence the way you do. You see words. The model sees token IDs.
A tokenizer might split a sentence like this:
Text: "The sleepy robot writes poetry."
Tokens: ["The", " sleepy", " robot", " writes", " poetry", "."]
IDs: [791, 47823, 11205, 13004, 24465, 13]Those ID numbers are what enter the model. The specific numbers differ across model families, but the pattern is the same: text becomes a sequence of integers.
Why not just use whole words? Because language is messy. New names, typos, code, slang, and other languages would explode the vocabulary. Tokens sit between letters and words: flexible enough for rare text, efficient enough for common text.
Slightly technical: why the strawberry counting problem happens
When you ask a model how many letters are in a word, the model may not be looking at separate letters. It may see a word as one or a few tokens. That means character-level questions can be awkward unless the model deliberately reasons about spelling.
const vocabulary = {
"The": 791,
" sleepy": 47823,
" robot": 11205,
" writes": 13004,
" poetry": 24465,
".": 13,
};
const prompt = ["The", " sleepy", " robot", " writes", " poetry", "."];
const tokenIds = prompt.map((piece) => vocabulary[piece]);
console.log(tokenIds);
// [791, 47823, 11205, 13004, 24465, 13]2. Embeddings: IDs become meaning-shaped numbers
A token ID by itself is just a label. ID 11205 does not mean robot unless the model has a learned table that says what vector should represent that token.
That table is called the embedding matrix. Think of it as a huge spreadsheet:
- Every token ID gets one row.
- Every row contains many numbers.
- Those numbers are learned during training.
- The row becomes the token's starting representation.
If two tokens are used in similar situations, their vectors often end up close together. Words like doctor, nurse, and hospital tend to live near related medical concepts. This was not hand-labeled by a person; it emerges because those relationships help the model predict text.
Slightly technical: vector arithmetic
An embedding is a vector, meaning a list of numbers. With enough training, directions in vector space can behave like meaning shifts. That is why famous examples like king - man + woman ≈ queen can sometimes work. It is geometry, not a dictionary.
3. Position: the model needs word order
A bag of tokens is not enough. These two sentences contain almost the same pieces but mean very different things:
The dog chased the boy.
The boy chased the dog.
The model therefore needs a position signal. Older transformers added a position vector to each token embedding. Many modern LLMs use RoPE, short for Rotary Position Embeddings, where position is represented by rotating parts of the vector.
You do not need the math to understand the purpose: position makes the model aware that one token came before another, and roughly how far apart they are.
Slightly technical: why long context is still hard
Even if a model can accept a huge prompt, that does not mean it uses every part equally well. Attention has to compare many tokens, and retrieval quality can drop when the answer is hidden in the middle of a long context window.
4. Attention: tokens decide what to pay attention to
Attention is the heart of the transformer. It lets each token ask: which previous tokens should shape my current meaning?
For each token, the model creates three learned views:
| Name | Question it answers | Everyday analogy |
|---|---|---|
| Query | What am I looking for? | A search request |
| Key | What do I match with? | A label on stored information |
| Value | What information should be passed along? | The content you copy after finding a match |
Imagine the sentence:
The cat that I saw yesterday was sleeping.
When the model reaches was, it needs to know what was sleeping. Attention can give more weight to cat than to yesterday, because cat is more useful for understanding the verb.
import math
scores = {"cat": 3.0, "yesterday": 0.2, "saw": 0.7}
# Softmax turns raw scores into weights that add up to 1.
exp_scores = {word: math.exp(score) for word, score in scores.items()}
total = sum(exp_scores.values())
weights = {word: value / total for word, value in exp_scores.items()}
print(weights)
# cat gets most of the weight5. Multi-head attention: many views at once
One attention pattern is not enough for language. A sentence can contain grammar, references, tone, code syntax, and long-range dependencies at the same time.
Multi-head attention runs several attention operations in parallel. One head might track subject-verb relationships. Another might follow quotation marks. Another might notice that a variable name in code was used earlier.
Slightly technical: heads are learned projections, not fixed slices
Each head learns its own projections from the full token vector into a smaller query/key/value space. So a head is not simply handed a pre-cut piece of the vector. It learns its own way to view the whole token representation.
The model then combines the outputs from all heads and sends the result onward.
Token representation
├─ attention head 1: grammar relationship
├─ attention head 2: nearby phrase structure
├─ attention head 3: repeated pattern
└─ attention head 4: reference or pronoun link
↓
Combined into one updated token representationA practical detail: during generation, the model stores old key and value vectors in a KV cache. That way it does not need to recompute the entire conversation every time it adds one new token.
6. Feed-forward networks: where a lot of learned structure lives
After attention mixes information between tokens, each token goes through a feed-forward network.
Attention is about tokens communicating. The feed-forward network is more like each token doing private thinking.
The rough pattern is:
- Expand the vector into a larger space.
- Apply a non-linear function.
- Compress it back down.
The non-linear step matters because it lets the model learn richer patterns. Without it, many stacked layers would collapse into something much simpler.
Slightly technical: dense models vs mixture of experts
In a dense transformer, every token uses the same feed-forward network in a layer. In a mixture-of-experts model, a small router chooses only a few expert networks for each token. This can increase total model capacity without making every token run through every parameter.
7. Residual stream and normalization: keeping deep models trainable
A modern LLM can have dozens or even hundreds of layers. If each layer simply replaced the previous representation, training would be fragile.
Residual connections solve part of that problem. Instead of replacing the vector, a block adds its output back to the existing vector.
new_vector = old_vector + block_outputThis creates a running stream of information through the network. Each layer can add a refinement without destroying everything that came before.
Layer normalization keeps the numbers stable. Without it, values can grow too large or shrink too much as they pass through many layers.
8. Next-token prediction: the answer is built one piece at a time
At the end of the stack, the model turns the final vector into scores for possible next tokens. These raw scores are called logits. A softmax converts them into probabilities.
Then a decoding strategy chooses one token.
| Setting | Plain-English effect | When useful |
|---|---|---|
| Temperature | Controls randomness | Lower for precise answers, higher for creative drafts |
| Top-k | Only considers the k most likely tokens | Prevents very unlikely choices |
| Top-p | Considers the smallest likely group whose probabilities add up to p | Flexible sampling without fixed k |
text = "The capital of France is"
while not done:
token_ids = tokenize(text)
vectors = transformer(token_ids)
next_token_scores = unembed(vectors[-1])
next_token = sample(next_token_scores, temperature=0.7)
text += detokenize(next_token)That loop is the machine behind the fluent paragraph. The model writes by repeatedly asking: given everything so far, what token should come next?
9. Architecture vs weights: why models feel different
Many modern LLMs share the same broad transformer-family shape. What makes them feel different is usually a combination of:
- Training data: what they learned from.
- Scale: how many layers, heads, parameters, and tokens were used.
- Architecture choices: dense or mixture-of-experts, attention variants, context length, tokenizer.
- Post-training: instruction tuning, preference training, safety behavior, tool use, and product-level rules.
So when people compare GPT, Claude, Gemini, Llama, Mistral, Qwen, or Gemma, they are often comparing siblings in a broad transformer family rather than completely unrelated species of model.
Slightly technical: modern transformer vocabulary
RoPE: position through vector rotation.
RMSNorm: a cheaper normalization variant used in many modern open models.
SwiGLU: a popular activation/feed-forward design.
GQA: grouped-query attention, which reduces KV-cache memory.
MoE: mixture of experts, where only selected expert networks run for each token.
10. GPT-2 and MoE: two useful milestones
Two research threads make the mechanics above feel more concrete. GPT-2 showed how far plain next-token prediction could go when scaled. Mixture of Experts shows how a model can grow more capable without forcing every token to use every parameter.
GPT-2: scaling the next-token game
OpenAI's 2019 paper Language Models are Unsupervised Multitask Learners made a simple bet famous: train a transformer to continue internet text, then test whether that same model can handle many tasks by phrasing them as text continuation.
- It was autoregressive: it generated left to right, one token at a time.
- It was dense: every token passed through the same model weights.
- It helped popularize the idea that scale plus simple training can produce surprisingly general behavior.
# Simplified GPT-2-style objective
prompt = "Translate to French: hello"
target_next_token = " bon"
# Training nudges the model so this next token becomes more likely.
loss = cross_entropy(model(prompt), target_next_token)MoE: not every token needs the whole building
A dense transformer usually runs every token through the same feed-forward network. In a Mixture-of-Experts model, a small router chooses only a few expert networks for each token. The model can have many more total parameters, while each token activates only a subset.
| Concept | Dense LLM | MoE LLM |
|---|---|---|
| Work per token | Uses the same main blocks | Uses selected experts |
| Analogy | One big generalist team | Router plus specialist teams |
| Tradeoff | Simpler to train and serve | More capacity, more routing complexity |
Slightly technical: where the MoE papers fit
Switch Transformers simplified MoE routing by sending each token to one expert. Efficient Large Scale Language Modeling with Mixtures of Experts studied autoregressive MoE language models at scale. Mixtral of Experts is a modern sparse MoE example where each token is routed to two feed-forward experts.
11. The AI ecosystem: MCP, tools, RAG, agents, and evals
The transformer is the engine, but real AI products usually add a stack around it. That stack gives the model fresh information, lets it take actions, checks its work, and keeps the system observable.
| Term | Simple meaning | What it helps with | Watch out for |
|---|---|---|---|
| Prompt | Instructions and context | Steering behavior without changing weights | Vague prompts create vague answers |
| Tool calling | The model asks your app to run a function | Weather, search, payments, calendars, databases | Validate every argument before doing anything real |
| MCP | A shared protocol for connecting AI apps to tools/data | Reusable integrations across different hosts | Permissions, auth, and tool descriptions matter |
| RAG | Retrieve relevant documents before answering | Fresh facts and private knowledge | Bad retrieval creates confident wrong answers |
| Embeddings | Meaning as searchable vectors | Semantic search and clustering | Similar does not always mean correct |
| Agent | A model inside a task loop | Planning, tool use, retries, handoffs | Needs limits, logs, and stop conditions |
| Fine-tuning | Training on examples of desired behavior | Style, format, classification, repeated edge cases | Do evals first; do not use it as a fact database |
| Evals | Tests for model behavior | Comparing prompts, tools, models, and releases | Tiny demo tests miss real-world messiness |
MCP: the USB-C idea for AI tools
MCP stands for Model Context Protocol. Instead of every AI app inventing a custom connector for every service, MCP defines a common client-server pattern. An AI app is the host. It creates an MCP client. That client connects to an MCP server, which exposes things like tools, resources, and prompts.
AI app / host
└─ MCP client
└─ MCP server
├─ tools: actions the model may request
├─ resources: files, docs, database records, logs
└─ prompts: reusable instruction templatesThe key idea is not that MCP makes the model smarter by itself. It makes integrations more standard. A coding agent can connect to GitHub, a support assistant can connect to tickets, and a research assistant can connect to document stores using the same basic pattern.
RAG: giving the model an open book
RAG means Retrieval-Augmented Generation. The model does not rely only on what it learned during training. Your app first searches a knowledge base, pulls the most relevant chunks into the prompt, and asks the model to answer using that context.
- Split documents into chunks.
- Turn each chunk into an embedding vector.
- Store those vectors in a search index or vector database.
- When the user asks something, search for similar chunks.
- Put the best chunks into the model context and ask for a grounded answer.
const question = "What is our refund policy?";
const hits = await vectorSearch(embed(question), { topK: 5 });
const answer = await llm.generate({
instructions: "Answer only from the provided policy snippets.",
context: hits.map((hit) => hit.text),
input: question,
});Agents: the loop around the model
An agent is not a new kind of brain. It is usually an LLM plus an orchestration loop: read the goal, choose a next step, maybe call a tool, inspect the result, update the plan, and continue until done or stopped.
goal → think about next step → call tool → observe result → adjust plan → final answerSlightly technical: how these pieces fit in one product
A production assistant might use MCP to discover tools, RAG to fetch private documents, tool calling to take controlled actions, structured outputs to return clean JSON, evals to measure quality, tracing to debug failures, and guardrails to block unsafe or unauthorized actions.
A friendly checklist for understanding any LLM answer
Further reading
- Kato, How LLMs Actually Work
- Vaswani et al., Attention Is All You Need
- Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding
- Liu et al., Lost in the Middle: How Language Models Use Long Contexts
- Radford et al., Language Models are Unsupervised Multitask Learners (GPT-2)
- Fedus, Zoph, and Shazeer, Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
- Artetxe et al., Efficient Large Scale Language Modeling with Mixtures of Experts
- Jiang et al., Mixtral of Experts
- Model Context Protocol, Architecture overview
- OpenAI, Function calling / tool calling guide
- OpenAI, Introducing text and code embeddings
- OpenAI Agents SDK, Agents guide
- OpenAI, Supervised fine-tuning guide
Polished enough to read like an essay, structured enough to use as a reference, and simple enough that you can explain it to a friend after one pass.