oreoro

How LLMs Actually Work: A Friendly Map for Humans

Sat, 06 Jun 2026 00:00:00 GMT

Note: This RSS feed strips out SVGs and embeds. You might want to read the post on the webpage here.

June 6, 2026

🧭

LLMs are not magic brains. They are prediction machines built from a few repeatable parts: tokens, vectors, attention, memory-like feed-forward layers, and a loop that keeps choosing the next likely piece of text.

✍️

Source note: this is an original, beginner-friendly rewrite inspired by Kato's article How LLMs Actually Work, with extra examples, code, tables, and Notion-native structure.

The whole idea in one minute

An LLM, or large language model, takes your text, turns it into numbers, runs those numbers through many transformer layers, and predicts what text should come next.

That is the simple version. The useful version is this:

Your prompt is split into tokens, which are small text pieces.
Each token becomes a vector, which is a list of numbers that carries learned meaning.
The model adds information about order, because dog bites man and man bites dog do not mean the same thing.
Attention lets each token decide which earlier tokens matter.
A feed-forward network does deeper processing for each token.
Residual connections and normalization keep the many layers stable.
The model outputs scores for the next possible token.
One token is chosen, added to the text, and the loop repeats.

flowchart LR
    A["You type a prompt"] --> B["Tokenizer<br>text pieces"]
    B --> C["Embeddings<br>meaning as numbers"]
    C --> D["Position signal<br>word order"]
    D --> E["Attention<br>what should matter?"]
    E --> F["Feed-forward layer<br>deeper processing"]
    F --> G["Next-token scores"]
    G --> H["Pick one token"]
    H --> I["Add it to the text"]
    I --> E

💡

A good mental model: an LLM is like an autocomplete system that has read a massive library and learned incredibly subtle patterns about what usually follows what.

Part	Plain-English job	Why it matters
Tokens	Break text into pieces	The model cannot read raw words or letters directly.
Embeddings	Turn pieces into meaning-shaped numbers	Similar ideas can sit near each other in number-space.
Position	Tell the model where each piece appears	Order changes meaning.
Attention	Let tokens look at useful previous tokens	This is how context flows through the sentence.
Feed-forward network	Process each token more deeply	A lot of learned structure lives here.
Next-token prediction	Score likely continuations	This is the generation loop behind every answer.

1. Tokens: the model's alphabet is not your alphabet

Models do not see your sentence the way you do. You see words. The model sees token IDs.

A tokenizer might split a sentence like this:

Text:   "The sleepy robot writes poetry."Tokens: ["The", " sleepy", " robot", " writes", " poetry", "."]IDs:    [791, 47823, 11205, 13004, 24465, 13]

Those ID numbers are what enter the model. The specific numbers differ across model families, but the pattern is the same: text becomes a sequence of integers.

Why not just use whole words? Because language is messy. New names, typos, code, slang, and other languages would explode the vocabulary. Tokens sit between letters and words: flexible enough for rare text, efficient enough for common text.

Slightly technical: why the strawberry counting problem happens

When you ask a model how many letters are in a word, the model may not be looking at separate letters. It may see a word as one or a few tokens. That means character-level questions can be awkward unless the model deliberately reasons about spelling.

const vocabulary = {  "The": 791,  " sleepy": 47823,  " robot": 11205,  " writes": 13004,  " poetry": 24465,  ".": 13,};const prompt = ["The", " sleepy", " robot", " writes", " poetry", "."];const tokenIds = prompt.map((piece) => vocabulary[piece]);console.log(tokenIds);// [791, 47823, 11205, 13004, 24465, 13]

2. Embeddings: IDs become meaning-shaped numbers

A token ID by itself is just a label. ID 11205 does not mean robot unless the model has a learned table that says what vector should represent that token.

That table is called the embedding matrix. Think of it as a huge spreadsheet:

Every token ID gets one row.
Every row contains many numbers.
Those numbers are learned during training.
The row becomes the token's starting representation.

If two tokens are used in similar situations, their vectors often end up close together. Words like doctor, nurse, and hospital tend to live near related medical concepts. This was not hand-labeled by a person; it emerges because those relationships help the model predict text.

🧠

Embeddings are not definitions. They are coordinates learned from usage. The model learns that concepts are related because they appear in related contexts.

Slightly technical: vector arithmetic

An embedding is a vector, meaning a list of numbers. With enough training, directions in vector space can behave like meaning shifts. That is why famous examples like king - man + woman ≈ queen can sometimes work. It is geometry, not a dictionary.

3. Position: the model needs word order

A bag of tokens is not enough. These two sentences contain almost the same pieces but mean very different things:

The dog chased the boy.

The boy chased the dog.

The model therefore needs a position signal. Older transformers added a position vector to each token embedding. Many modern LLMs use RoPE, short for Rotary Position Embeddings, where position is represented by rotating parts of the vector.

You do not need the math to understand the purpose: position makes the model aware that one token came before another, and roughly how far apart they are.

📌

Practical takeaway: important context usually works best near the start or end of a long prompt. Many models are weaker at using information buried in the middle.

Slightly technical: why long context is still hard

Even if a model can accept a huge prompt, that does not mean it uses every part equally well. Attention has to compare many tokens, and retrieval quality can drop when the answer is hidden in the middle of a long context window.

4. Attention: tokens decide what to pay attention to

Attention is the heart of the transformer. It lets each token ask: which previous tokens should shape my current meaning?

For each token, the model creates three learned views:

Name	Question it answers	Everyday analogy
Query	What am I looking for?	A search request
Key	What do I match with?	A label on stored information
Value	What information should be passed along?	The content you copy after finding a match

Imagine the sentence:

The cat that I saw yesterday was sleeping.

When the model reaches was, it needs to know what was sleeping. Attention can give more weight to cat than to yesterday, because cat is more useful for understanding the verb.

import mathscores = {"cat": 3.0, "yesterday": 0.2, "saw": 0.7}# Softmax turns raw scores into weights that add up to 1.exp_scores = {word: math.exp(score) for word, score in scores.items()}total = sum(exp_scores.values())weights = {word: value / total for word, value in exp_scores.items()}print(weights)# cat gets most of the weight

🔒

GPT-style models use causal masking: while predicting the next token, they can look backward but not forward. Future text is hidden because it has not been generated yet.

5. Multi-head attention: many views at once

One attention pattern is not enough for language. A sentence can contain grammar, references, tone, code syntax, and long-range dependencies at the same time.

Multi-head attention runs several attention operations in parallel. One head might track subject-verb relationships. Another might follow quotation marks. Another might notice that a variable name in code was used earlier.

Slightly technical: heads are learned projections, not fixed slices

Each head learns its own projections from the full token vector into a smaller query/key/value space. So a head is not simply handed a pre-cut piece of the vector. It learns its own way to view the whole token representation.

The model then combines the outputs from all heads and sends the result onward.

Token representation   ├─ attention head 1: grammar relationship   ├─ attention head 2: nearby phrase structure   ├─ attention head 3: repeated pattern   └─ attention head 4: reference or pronoun link        ↓Combined into one updated token representation

A practical detail: during generation, the model stores old key and value vectors in a KV cache. That way it does not need to recompute the entire conversation every time it adds one new token.

6. Feed-forward networks: where a lot of learned structure lives

After attention mixes information between tokens, each token goes through a feed-forward network.

Attention is about tokens communicating. The feed-forward network is more like each token doing private thinking.

The rough pattern is:

Expand the vector into a larger space.
Apply a non-linear function.
Compress it back down.

The non-linear step matters because it lets the model learn richer patterns. Without it, many stacked layers would collapse into something much simpler.

🧱

A lot of model parameters live in feed-forward layers. This is one reason they are often discussed as the model's learned store of patterns, facts, and associations.

Slightly technical: dense models vs mixture of experts

In a dense transformer, every token uses the same feed-forward network in a layer. In a mixture-of-experts model, a small router chooses only a few expert networks for each token. This can increase total model capacity without making every token run through every parameter.

7. Residual stream and normalization: keeping deep models trainable

A modern LLM can have dozens or even hundreds of layers. If each layer simply replaced the previous representation, training would be fragile.

Residual connections solve part of that problem. Instead of replacing the vector, a block adds its output back to the existing vector.

new_vector = old_vector + block_output

This creates a running stream of information through the network. Each layer can add a refinement without destroying everything that came before.

Layer normalization keeps the numbers stable. Without it, values can grow too large or shrink too much as they pass through many layers.

🛠️

The boring-sounding parts matter. Residual connections and normalization are major reasons very deep transformer stacks can actually train.

8. Next-token prediction: the answer is built one piece at a time

At the end of the stack, the model turns the final vector into scores for possible next tokens. These raw scores are called logits. A softmax converts them into probabilities.

Then a decoding strategy chooses one token.

Setting	Plain-English effect	When useful
Temperature	Controls randomness	Lower for precise answers, higher for creative drafts
Top-k	Only considers the k most likely tokens	Prevents very unlikely choices
Top-p	Considers the smallest likely group whose probabilities add up to p	Flexible sampling without fixed k

text = "The capital of France is"while not done:    token_ids = tokenize(text)    vectors = transformer(token_ids)    next_token_scores = unembed(vectors[-1])    next_token = sample(next_token_scores, temperature=0.7)    text += detokenize(next_token)

That loop is the machine behind the fluent paragraph. The model writes by repeatedly asking: given everything so far, what token should come next?

⚠️

This also explains hallucinations. The base training objective rewards plausible continuation, not guaranteed truth. Post-training, retrieval, tool use, and evaluation are added to make outputs more useful and reliable.

9. Architecture vs weights: why models feel different

Many modern LLMs share the same broad transformer-family shape. What makes them feel different is usually a combination of:

Training data: what they learned from.
Scale: how many layers, heads, parameters, and tokens were used.
Architecture choices: dense or mixture-of-experts, attention variants, context length, tokenizer.
Post-training: instruction tuning, preference training, safety behavior, tool use, and product-level rules.

So when people compare GPT, Claude, Gemini, Llama, Mistral, Qwen, or Gemma, they are often comparing siblings in a broad transformer family rather than completely unrelated species of model.

Slightly technical: modern transformer vocabulary

RoPE: position through vector rotation.

RMSNorm: a cheaper normalization variant used in many modern open models.

SwiGLU: a popular activation/feed-forward design.

GQA: grouped-query attention, which reduces KV-cache memory.

MoE: mixture of experts, where only selected expert networks run for each token.

10. GPT-2 and MoE: two useful milestones

Two research threads make the mechanics above feel more concrete. GPT-2 showed how far plain next-token prediction could go when scaled. Mixture of Experts shows how a model can grow more capable without forcing every token to use every parameter.

🧩

Plain-English mental model: GPT-2 is like one very large generalist team. MoE is like a building with specialist rooms, where a router sends each token to only the rooms that seem useful.

GPT-2: scaling the next-token game

OpenAI's 2019 paper Language Models are Unsupervised Multitask Learners made a simple bet famous: train a transformer to continue internet text, then test whether that same model can handle many tasks by phrasing them as text continuation.

It was autoregressive: it generated left to right, one token at a time.
It was dense: every token passed through the same model weights.
It helped popularize the idea that scale plus simple training can produce surprisingly general behavior.

# Simplified GPT-2-style objectiveprompt = "Translate to French: hello"target_next_token = " bon"# Training nudges the model so this next token becomes more likely.loss = cross_entropy(model(prompt), target_next_token)

MoE: not every token needs the whole building

A dense transformer usually runs every token through the same feed-forward network. In a Mixture-of-Experts model, a small router chooses only a few expert networks for each token. The model can have many more total parameters, while each token activates only a subset.

Concept	Dense LLM	MoE LLM
Work per token	Uses the same main blocks	Uses selected experts
Analogy	One big generalist team	Router plus specialist teams
Tradeoff	Simpler to train and serve	More capacity, more routing complexity

Slightly technical: where the MoE papers fit

Switch Transformers simplified MoE routing by sending each token to one expert. Efficient Large Scale Language Modeling with Mixtures of Experts studied autoregressive MoE language models at scale. Mixtral of Experts is a modern sparse MoE example where each token is routed to two feed-forward experts.

⚖️

Important nuance: MoE does not automatically mean smarter. Data quality, routing balance, training stability, inference hardware, and post-training still matter.

11. The AI ecosystem: MCP, tools, RAG, agents, and evals

The transformer is the engine, but real AI products usually add a stack around it. That stack gives the model fresh information, lets it take actions, checks its work, and keeps the system observable.

🗺️

Plain-English map: the LLM is the text brain, tools are the hands, RAG is the open-book notes, MCP is a standard plug for external systems, agents are the loop that decides what to do next, and evals are the tests that tell you if any of it works.

Term	Simple meaning	What it helps with	Watch out for
Prompt	Instructions and context	Steering behavior without changing weights	Vague prompts create vague answers
Tool calling	The model asks your app to run a function	Weather, search, payments, calendars, databases	Validate every argument before doing anything real
MCP	A shared protocol for connecting AI apps to tools/data	Reusable integrations across different hosts	Permissions, auth, and tool descriptions matter
RAG	Retrieve relevant documents before answering	Fresh facts and private knowledge	Bad retrieval creates confident wrong answers
Embeddings	Meaning as searchable vectors	Semantic search and clustering	Similar does not always mean correct
Agent	A model inside a task loop	Planning, tool use, retries, handoffs	Needs limits, logs, and stop conditions
Fine-tuning	Training on examples of desired behavior	Style, format, classification, repeated edge cases	Do evals first; do not use it as a fact database
Evals	Tests for model behavior	Comparing prompts, tools, models, and releases	Tiny demo tests miss real-world messiness

MCP: the USB-C idea for AI tools

MCP stands for Model Context Protocol. Instead of every AI app inventing a custom connector for every service, MCP defines a common client-server pattern. An AI app is the host. It creates an MCP client. That client connects to an MCP server, which exposes things like tools, resources, and prompts.

AI app / host  └─ MCP client       └─ MCP server            ├─ tools: actions the model may request            ├─ resources: files, docs, database records, logs            └─ prompts: reusable instruction templates

The key idea is not that MCP makes the model smarter by itself. It makes integrations more standard. A coding agent can connect to GitHub, a support assistant can connect to tickets, and a research assistant can connect to document stores using the same basic pattern.

🔐

Security rule: treat tools like real permissions, not decorations. If a tool can send email, delete files, spend money, or publish content, the app should require clear approval, scoped access, logging, and argument validation.

RAG: giving the model an open book

RAG means Retrieval-Augmented Generation. The model does not rely only on what it learned during training. Your app first searches a knowledge base, pulls the most relevant chunks into the prompt, and asks the model to answer using that context.

Split documents into chunks.
Turn each chunk into an embedding vector.
Store those vectors in a search index or vector database.
When the user asks something, search for similar chunks.
Put the best chunks into the model context and ask for a grounded answer.

const question = "What is our refund policy?";const hits = await vectorSearch(embed(question), { topK: 5 });const answer = await llm.generate({  instructions: "Answer only from the provided policy snippets.",  context: hits.map((hit) => hit.text),  input: question,});

Agents: the loop around the model

An agent is not a new kind of brain. It is usually an LLM plus an orchestration loop: read the goal, choose a next step, maybe call a tool, inspect the result, update the plan, and continue until done or stopped.

goal → think about next step → call tool → observe result → adjust plan → final answer

🧪

Evals are what turn AI from a cool demo into an engineering system. Before shipping a new prompt, model, tool, or agent flow, test it on examples that represent real users, failure cases, and edge cases.

Slightly technical: how these pieces fit in one product

A production assistant might use MCP to discover tools, RAG to fetch private documents, tool calling to take controlled actions, structured outputs to return clean JSON, evals to measure quality, tracing to debug failures, and guardrails to block unsafe or unauthorized actions.

A friendly checklist for understanding any LLM answer

✅

Did the model receive the right information in the prompt?

✅

Was the important context near the beginning or end?

✅

Is the task asking for facts, reasoning, creativity, or formatting?

✅

Would retrieval or a tool make the answer more grounded?

✅

Should the output be checked against a source before trusting it?

✅

If you remember one thing, remember this: LLMs transform text into numbers, let those numbers exchange context through attention, and then predict the next token again and again until an answer appears.

The 2026 AI Landscape: A Hacker's Deep Dive 🤖

Thu, 04 Jun 2026 00:00:00 GMT

Note: This RSS feed strips out SVGs and embeds. You might want to read the post on the webpage here.

June 4, 2026

🤖

This deep dive is structured as a native Notion article: use the table of contents below to jump between architecture, agents, RAG, protocols, frameworks, prompting, vector databases, glossary, and code appendix.

Layer	Native Notion treatment
Navigation	Built-in table of contents plus semantic headings
Reference material	Native tables and collapsible glossary sections
Implementation detail	Language-aware code blocks and equation blocks
Long-form reading	Callouts, dividers, and structured sections

☕

Everything you need to understand the current AI moment — from transformers to agents, RAG pipelines to MCP, and every buzzword in between. Grab a coffee.

1. The Big Picture: Where AI Stands in 2026

The AI landscape of 2026 is defined by a single, sweeping shift: from chat to action. The previous era was dominated by raw model intelligence — who had the biggest, smartest LLM. The current era prioritizes orchestration layers that unify multiple models and tools to automate complex, end-to-end business workflows. We've moved from "AI that talks" to "AI that does."[1][2]

A few landmark data points paint the picture clearly:

700 million people use ChatGPT weekly as of mid-2025[3]
Anthropic now holds ~40% of enterprise LLM API spend; OpenAI has dropped to 27%, down from ~50% in 2023[4]
2026 is the year of autonomous AI agents — goal-driven systems combining reasoning, planning, and tool use, marking the biggest functional jump since GPT-3[1]
Open-source models (Meta's Llama 4 family with 10M token context windows) have narrowed the gap with proprietary models dramatically[4]

The defining theme is the Agentic Web — where AI agents serve as the primary gateway to the internet, navigating backends through APIs rather than humans switching between websites.[2]

2. The GPT Evolution — A Complete Timeline

The journey from GPT-1 to GPT-5.5 is arguably the fastest capability evolution in computing history — parameters grew from 117 million to 175+ billion, a 1,495× increase in two years (GPT-1 to GPT-3).[3]

Model	Date	Params	Key Leap
GPT-1	June 2018	117M	Proved unsupervised pre-training works[3]
GPT-2	Feb 2019	1.5B	Coherent long-form text; initially "too dangerous to release"[5]
GPT-3	May 2020	175B	First commercially viable model; in-context learning[5]
GPT-3.5 / InstructGPT	2022	~175B	RLHF introduced; gave us ChatGPT[5]
GPT-4	March 2023	Undisclosed	Multimodal (text + image); reasoning at scale[3]
GPT-4o	May 2024	Undisclosed	Omnimodal (text, image, audio natively); 2× speed[6]
GPT-4.5	Feb 2025	Undisclosed	Stronger world knowledge, fewer hallucinations[6]
GPT-5	Aug 2025	Undisclosed	94.6% on advanced math; 45% fewer hallucinations vs GPT-4o[3]
GPT-5.5	April 2026	Undisclosed	Native omnimodal; autonomous computer use; agentic coding[6]

The o-Series: Reasoning Models

Alongside the GPT-series, OpenAI launched a separate reasoning-first architecture — the o1, o3, and o4-mini series — which departed from pure next-token prediction. These models "think before they speak," running chain-of-thought reasoning internally before generating output. The o3 and o4-mini launched in April 2025 with significantly enhanced reasoning, particularly in STEM.[5][6]

3. How LLMs Actually Work: The Transformer Architecture

Every modern LLM is built on the Transformer, introduced in the landmark 2017 paper "Attention Is All You Need". Here's the architecture unwrapped:[7]

3.1 The Four Building Blocks

A transformer layer has four core components:[8]

Token Embeddings — Convert words/subwords into numerical vectors in high-dimensional space
Positional Encoding — Inject information about the order of tokens (since attention has no built-in notion of sequence)
Attention — The magic: lets each token "look at" every other token
Feed-Forward Block — A pair of linear transformations applied position-wise

3.2 Self-Attention: The Core Insight

Self-attention answers the question: "Which other words should I focus on to understand my own meaning?" For every token, three vectors are computed:[9][10]

Q (Query) — "What am I looking for?"
K (Key) — "What do I contain?"
V (Value) — "What information do I carry?"

The attention score is computed as:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)VAttention(Q,K,V)=softmax(dkQKT)V

where dkd_kdk is the key dimension. The dk\sqrt{d_k}dk scaling prevents the dot products from growing too large and making softmax gradients vanish.[11]

import torchimport torch.nn.functional as Fimport mathdef scaled_dot_product_attention(Q, K, V, mask=None):    """    Core self-attention mechanism.    Q, K, V: (batch_size, seq_len, d_k)    """    d_k = Q.size(-1)    # Compute attention scores    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)    # Apply optional mask (for decoder / causal attention)    if mask is not None:        scores = scores.masked_fill(mask == 0, float('-inf'))    # Softmax to get attention weights    attn_weights = F.softmax(scores, dim=-1)    # Weighted sum of values    output = torch.matmul(attn_weights, V)    return output, attn_weights# Example: 2 sentences, 4 tokens, 8-dim embeddingsbatch_size, seq_len, d_model = 2, 4, 8d_k = 8Q = torch.randn(batch_size, seq_len, d_k)K = torch.randn(batch_size, seq_len, d_k)V = torch.randn(batch_size, seq_len, d_k)output, weights = scaled_dot_product_attention(Q, K, V)print(f"Output shape: {output.shape}")        # (2, 4, 8)print(f"Attention weights: {weights.shape}")  # (2, 4, 4)

3.3 Multi-Head Attention

Instead of computing attention once, transformers run multiple attention heads in parallel — GPT-3 uses 96 attention heads per block. Each head learns a different "relevance function." The outputs are concatenated and projected:[7]

import torch.nn as nnclass MultiHeadAttention(nn.Module):    def __init__(self, d_model, num_heads):        super().__init__()        assert d_model % num_heads == 0        self.d_k = d_model // num_heads        self.num_heads = num_heads        self.W_q = nn.Linear(d_model, d_model)        self.W_k = nn.Linear(d_model, d_model)        self.W_v = nn.Linear(d_model, d_model)        self.W_o = nn.Linear(d_model, d_model)    def split_heads(self, x):        B, T, d = x.shape        # (B, T, d) -> (B, num_heads, T, d_k)        return x.view(B, T, self.num_heads, self.d_k).transpose(1, 2)    def forward(self, x, mask=None):        Q = self.split_heads(self.W_q(x))        K = self.split_heads(self.W_k(x))        V = self.split_heads(self.W_v(x))        attn_out, _ = scaled_dot_product_attention(Q, K, V, mask)        # Merge heads: (B, num_heads, T, d_k) -> (B, T, d_model)        B, H, T, d_k = attn_out.shape        attn_out = attn_out.transpose(1, 2).contiguous().view(B, T, H * d_k)        return self.W_o(attn_out)

3.4 Modern Architecture Improvements

The vanilla transformer has been significantly optimized:[11]

Innovation	What It Does
FlashAttention	Reduces memory traffic 2–4× by optimizing GPU SRAM access patterns
RoPE (Rotary Position Embeddings)	Better position encoding enabling longer context windows
MoE (Mixture of Experts)	Activates only a subset of parameters per token — enables huge models at lower compute cost
GQA (Grouped Query Attention)	Multiple query heads share key/value heads, reducing KV-cache memory
Linear Attention	Reduces complexity from O(n²) to O(n) for long documents

4. The Agentic AI Revolution

The single biggest shift in 2026 is from generative AI (creates content) to agentic AI (autonomous systems that plan, decide, and execute). Where a generative model answers your question, an agentic AI accomplishes your goal.[12]

4.1 What Makes an AI Agent?

An agent has four capabilities that a plain chatbot lacks:[12]

Goal Understanding — Decompose a complex objective into sub-tasks
Multi-Step Planning — Create and revise a plan of action
Tool Use — Execute functions, call APIs, browse the web, write code
Self-Correction — Observe outcomes and adjust behavior in a loop

4.2 The ReAct Pattern

ReAct (Reasoning + Acting) is the foundational pattern for agents, introduced by Yao et al. The model interleaves reasoning traces with actions:[13]

Thought: I need to find the current population of Karachi.Action: search("Karachi population 2026")Observation: Karachi population is approximately 16.5 million.Thought: Now I can answer the question.Answer: Karachi has approximately 16.5 million people.

from openai import OpenAIimport jsonclient = OpenAI()# Define tools the agent can usetools = [    {        "type": "function",        "function": {            "name": "web_search",            "description": "Search the web for current information",            "parameters": {                "type": "object",                "properties": {                    "query": {"type": "string", "description": "Search query"}                },                "required": ["query"]            }        }    },    {        "type": "function",         "function": {            "name": "run_python",            "description": "Execute Python code and return the result",            "parameters": {                "type": "object",                "properties": {                    "code": {"type": "string", "description": "Python code to run"}                },                "required": ["code"]            }        }    }]def run_react_agent(task: str, max_steps: int = 5):    messages = [{"role": "user", "content": task}]    for step in range(max_steps):        response = client.chat.completions.create(            model="gpt-4o",            messages=messages,            tools=tools,            tool_choice="auto"        )        msg = response.choices[0].message        # No tool call = final answer        if not msg.tool_calls:            return msg.content        # Execute tool calls        messages.append(msg)        for tool_call in msg.tool_calls:            result = execute_tool(tool_call.function.name,                                   json.loads(tool_call.function.arguments))            messages.append({                "role": "tool",                "tool_call_id": tool_call.id,                "content": str(result)            })    return "Max steps reached"

4.3 Agentic Patterns

Four core patterns drive agent behavior:[14]

Pattern	Description	Example
Reflection	Agent critiques its own output and revises	Code reviewer that re-checks generated code
Planning	Decompose goal into ordered sub-tasks	Research agent building a structured outline
Tool Use	Call external APIs and functions	Weather agent calling a weather API
Multi-Agent	Spawn specialized sub-agents	Orchestrator delegates to coder + tester agents

4.4 Multi-Agent Systems (MAS)

By 2026, the field has moved beyond single-purpose agents to Multi-Agent Systems — AI "teams" where specialized agents collaborate to achieve a shared objective, mirroring microservice architecture in traditional software.[12]

                    ┌─────────────────┐                    │   Orchestrator   │                    │   (Planner LLM) │                    └────────┬────────┘            ┌───────────────┼───────────────┐            ▼               ▼               ▼    ┌──────────────┐ ┌─────────────┐ ┌──────────────┐    │  Researcher  │ │   Coder     │ │   Reviewer   │    │   Agent      │ │   Agent     │ │   Agent      │    └──────────────┘ └─────────────┘ └──────────────┘

5. RAG: Retrieval-Augmented Generation Deep Dive

RAG is how you give an LLM access to your private data without retraining it. In 2026, RAG as default infrastructure for enterprise LLM applications.[1]

5.1 Naive RAG Pipeline

The vanilla RAG flow is simple:[15]

User Query    │    ▼Embed Query (vector)    │    ▼Similarity Search → Vector DB → Top-K Documents    │    ▼Inject Context into LLM Prompt    │    ▼LLM generates grounded response

from openai import OpenAIimport numpy as npclient = OpenAI()# Step 1: Embed documents at ingestion timedef embed_text(text: str) -> list[float]:    response = client.embeddings.create(        model="text-embedding-3-small",        input=text    )    return response.data[0].embedding# Step 2: Simple cosine similarity searchdef cosine_similarity(a: list, b: list) -> float:    a, b = np.array(a), np.array(b)    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))def retrieve(query: str, documents: list[dict], top_k: int = 3):    query_vec = embed_text(query)    scored = [        (doc, cosine_similarity(query_vec, doc["embedding"]))        for doc in documents    ]    return sorted(scored, key=lambda x: x[1], reverse=True)[:top_k]# Step 3: Generate grounded answerdef rag_answer(query: str, docs: list[str]) -> str:    context = "\n\n".join(docs)    prompt = f"""Answer the question using ONLY the context below.If the answer isn't in the context, say "I don't know."Context:{context}Question: {query}"""    response = client.chat.completions.create(        model="gpt-4o-mini",        messages=[{"role": "user", "content": prompt}]    )    return response.choices[0].message.content

5.2 Advanced RAG Techniques

Naive RAG breaks at scale. Production systems use:[16]

Technique	Problem It Solves
Chunking strategies	Semantic splitting preserves context better than fixed-size chunks
Hybrid search	BM25 (keyword) + vector search for better recall
Reranking	Cross-encoder reranks top-K results for precision
HyDE	Generate a hypothetical answer, embed it, then search — better for abstract queries
MMR (Maximal Marginal Relevance)	Diversify retrieved documents to avoid redundancy
Multi-vector retrieval	Store summary + detailed chunks separately

5.3 Agentic RAG

Agentic RAG supercharges RAG by adding an agent layer that can iterate, re-retrieve, and validate before answering:[15]

from typing import Optionalclass AgenticRAGPipeline:    """    Agentic RAG: Agent decides WHEN and WHAT to retrieve,    can iterate multiple retrieval rounds, and validates output.    """    def __init__(self, retriever, llm_client):        self.retriever = retriever        self.client = llm_client        self.retrieved_docs = []    def should_retrieve_more(self, current_answer: str, query: str) -> bool:        """Ask the LLM if it needs more context."""        check_prompt = f"""Query: {query}Current draft answer: {current_answer}Is this answer complete and well-supported? Reply with JSON: {{"complete": true/false, "missing": "what's missing"}}"""        response = self.client.chat.completions.create(            model="gpt-4o-mini",            messages=[{"role": "user", "content": check_prompt}],            response_format={"type": "json_object"}        )        import json        result = json.loads(response.choices[0].message.content)        return not result["complete"], result.get("missing", "")    def run(self, query: str, max_iterations: int = 3) -> str:        answer = ""        for i in range(max_iterations):            # Retrieve relevant docs            search_query = query if i == 0 else f"{query} - focusing on: {answer}"            new_docs = self.retriever.search(search_query, top_k=5)            self.retrieved_docs.extend(new_docs)            # Generate answer with all accumulated context            context = "\n---\n".join(self.retrieved_docs)            answer = self._generate(query, context)            # Check if we need more info            needs_more, missing = self.should_retrieve_more(answer, query)            if not needs_more:                break        return answer    def _generate(self, query: str, context: str) -> str:        response = self.client.chat.completions.create(            model="gpt-4o",            messages=[{                "role": "user",                "content": f"Context:\n{context}\n\nAnswer: {query}"            }]        )        return response.choices[0].message.content

6. MCP: The USB-C for AI Tools

Model Context Protocol (MCP) is an open standard introduced by Anthropic in November 2024 to standardize how AI systems integrate with external tools, data sources, and services. Think of it as the USB-C port for AI — one standard connector for everything.[17]

6.1 Why MCP Matters

Before MCP, every AI-tool integration was a custom one-off. MCP provides:[18]

Tools — Functions the AI can call (e.g., run_sql, read_file)
Resources — Data the AI can read (files, database records, API responses)
Prompts — Reusable prompt templates
Sampling — The server can ask the client to run an LLM query

┌─────────────────────────────────────────────────┐│                  MCP Architecture                ││                                                  ││  ┌──────────────┐      MCP Protocol              ││  │  AI Client   │◄──────────────────►┐           ││  │ (Claude/GPT) │                    │           ││  └──────────────┘         ┌──────────┴────────┐  ││                            │    MCP Server     │  ││                            │  (your tools)     │  ││                            │                   │  ││                            │  ┌─────────────┐  │  ││                            │  │  Tools      │  │  ││                            │  │  Resources  │  │  ││                            │  │  Prompts    │  │  ││                            │  └─────────────┘  │  ││                            └───────────────────┘  │└─────────────────────────────────────────────────┘

6.2 Building an MCP Server

# Minimal MCP server using the official Python SDKfrom mcp.server import Serverfrom mcp.server.stdio import stdio_serverfrom mcp.types import Tool, TextContentimport mcp.types as typesapp = Server("my-mcp-server")@app.list_tools()async def list_tools() -> list[Tool]:    return [        Tool(            name="get_weather",            description="Get the current weather for a city",            inputSchema={                "type": "object",                "properties": {                    "city": {                        "type": "string",                        "description": "City name"                    }                },                "required": ["city"]            }        ),        Tool(            name="run_sql",            description="Execute a read-only SQL query against our DB",            inputSchema={                "type": "object",                "properties": {                    "query": {"type": "string"}                },                "required": ["query"]            }        )    ]@app.call_tool()async def call_tool(name: str, arguments: dict) -> list[TextContent]:    if name == "get_weather":        city = arguments["city"]        # Call your actual weather API here        return [TextContent(type="text", text=f"Weather in {city}: 28°C, sunny")]    elif name == "run_sql":        query = arguments["query"]        # Execute query safely        results = execute_readonly_query(query)        return [TextContent(type="text", text=str(results))]async def main():    async with stdio_server() as (read_stream, write_stream):        await app.run(read_stream, write_stream, app.create_initialization_options())if __name__ == "__main__":    import asyncio    asyncio.run(main())

6.3 MCP vs Traditional APIs

Dimension	REST API	MCP Server
Discovery	Manual (read docs)	Auto-discovery via `list_tools()`
Schema	OpenAPI/Swagger	JSON Schema, AI-readable
Context sharing	Per-request	Stateful sessions with context
AI integration	Custom glue code	Native, standardized
Multi-tool	N integrations	One MCP layer

7. A2A: The Protocol for Agent Collaboration

While MCP handles agent-to-tool communication, A2A (Agent2Agent) — announced by Google in April 2025 — handles agent-to-agent communication.[19]

7.1 The Problem A2A Solves

Imagine a travel booking agent that needs to coordinate with a payment agent from a different company, a hotel API agent, and an airline agent — all built on different frameworks. A2A enables them to discover each other, understand capabilities, and coordinate tasks without sharing internals.[19]

MCP Layer (Vertical):          A2A Layer (Horizontal):Agent ◄──tools──► APIs         Agent ◄──collaborate──► AgentAgent ◄──tools──► Databases    Agent ◄──delegate───► AgentAgent ◄──tools──► Files        Agent ◄──coordinate──► Agent

7.2 Agent Cards: The Discovery Mechanism

Every A2A agent publishes an Agent Card — a JSON document at a well-known URL that describes the agent's capabilities:[20]

{  "name": "payment-processor-agent",  "version": "1.2.0",  "description": "Handles payment processing for e-commerce workflows",  "endpoint": "https://payments.example.com/a2a",  "skills": [    {      "id": "process_payment",      "name": "Process Payment",      "description": "Charge a customer for a transaction",      "inputModes": ["text", "json"],      "outputModes": ["json"]    },    {      "id": "refund",      "name": "Issue Refund",      "description": "Refund a previously processed payment"    }  ],  "authentication": {    "schemes": ["Bearer"]  }}

7.3 MCP + A2A: The Full Stack

Google positioned A2A as complementary to MCP, not competitive:[20]

┌──────────────────────────────────────────────────┐│              Enterprise AI Architecture           ││                                                   ││    ┌─────────┐   A2A   ┌─────────┐               ││    │ Agent A │◄───────►│ Agent B │               ││    └────┬────┘         └────┬────┘               ││         │ MCP               │ MCP                 ││    ┌────┴────┐         ┌────┴────┐               ││    │ Tools & │         │ Tools & │               ││    │  Data   │         │  Data   │               ││    └─────────┘         └─────────┘               ││                                                   ││  MCP = vertical (agent ↔ tools)                  ││  A2A = horizontal (agent ↔ agent)                │└──────────────────────────────────────────────────┘

8. AI Agent Frameworks Compared

Choosing the wrong framework costs weeks. Here's the production-tested ranking for 2026:[21]

Framework	Best For	Architecture Style	MCP/A2A Support
LangGraph	Complex stateful production workflows	Graph-based, explicit state machines	✅ MCP
Claude Agent SDK	Anthropic-native production agents	Native Claude hooks + subagents	✅ MCP native
CrewAI	Fast multi-agent prototypes	Role-based crews	✅ MCP
AutoGen / AG2	Research-style conversational agents	Conversational multi-agent	✅ MCP
Semantic Kernel	Enterprise / .NET stacks	Plugin-based, Azure-first	✅
LlamaIndex	RAG-grounded agents	Data-layer first	✅
Pydantic AI	Type-safe Python agents	Pydantic validation throughout	✅

8.1 LangGraph: Production-Ready State Machines

from langgraph.graph import StateGraph, ENDfrom typing import TypedDict, Annotatedimport operatorclass AgentState(TypedDict):    messages: Annotated[list, operator.add]    query: str    documents: list[str]    answer: str    needs_more_info: booldef retrieve_node(state: AgentState) -> AgentState:    """Retrieve relevant documents."""    docs = vector_store.search(state["query"], k=5)    return {"documents": docs}def generate_node(state: AgentState) -> AgentState:    """Generate answer from retrieved docs."""    context = "\n".join(state["documents"])    answer = llm.invoke(f"Context: {context}\nQuestion: {state['query']}")    return {"answer": answer, "needs_more_info": False}def check_node(state: AgentState) -> str:    """Route: done or need more retrieval?"""    return "done" if not state["needs_more_info"] else "retrieve"# Build the graphworkflow = StateGraph(AgentState)workflow.add_node("retrieve", retrieve_node)workflow.add_node("generate", generate_node)workflow.set_entry_point("retrieve")workflow.add_edge("retrieve", "generate")workflow.add_conditional_edges("generate", check_node, {    "done": END,    "retrieve": "retrieve"  # Loop back if needed})app = workflow.compile()# Run itresult = app.invoke({"query": "What is the capital of Punjab?", "messages": []})print(result["answer"])

8.2 CrewAI: Role-Based Multi-Agent Teams

from crewai import Agent, Task, Crew, Process# Define specialized agentsresearcher = Agent(    role="Senior Research Analyst",    goal="Find accurate, up-to-date information on the given topic",    backstory="Expert researcher with access to web search and databases",    verbose=True,    allow_delegation=False,    tools=[web_search_tool, wikipedia_tool])writer = Agent(    role="Technical Writer",    goal="Write clear, engaging blog posts from research findings",    backstory="Experienced tech blogger who makes complex topics accessible",    verbose=True,    allow_delegation=False)# Define tasksresearch_task = Task(    description="Research the latest developments in {topic}. "                "Find key facts, statistics, and expert opinions.",    expected_output="A structured research brief with citations",    agent=researcher)writing_task = Task(    description="Write a 1000-word blog post based on the research brief. "                "Make it engaging for a technical audience.",    expected_output="A complete, publication-ready blog post in Markdown",    agent=writer,    context=[research_task]  # Uses output from research_task)# Assemble and run the crewcrew = Crew(    agents=[researcher, writer],    tasks=[research_task, writing_task],    process=Process.sequential,    verbose=2)result = crew.kickoff(inputs={"topic": "MCP protocol for AI agents"})print(result)

9. Prompt Engineering Playbook

Prompt engineering is the art of communicating precisely with LLMs. Here are the techniques every practitioner needs:[22]

9.1 Core Techniques

# ─── Zero-Shot ──────────────────────────────────────────────────────────────# No examples. Rely on the model's training.zero_shot = "Classify the sentiment of this review: 'The app crashes constantly.'"# ─── Few-Shot ────────────────────────────────────────────────────────────────# Provide examples to guide the modelfew_shot = """Classify sentiment. Examples:Input: "Love this product!" → PositiveInput: "Terrible experience." → NegativeInput: "It's okay, nothing special." → NeutralNow classify: "The battery life is surprisingly good.""""# ─── Chain-of-Thought (CoT) ──────────────────────────────────────────────────# Force step-by-step reasoning for complex taskscot = """Solve this step by step:A store sells apples for Rs. 50 each. If Ali buys 12 apples with Rs. 700, how much change does he get?Think through it step by step before giving the final answer."""# ─── ReAct Pattern ───────────────────────────────────────────────────────────# Alternate reasoning and actionreact_system = """You are an agent. For each task:1. Thought: reason about what to do next2. Action: choose a tool [search | calculate | respond]3. Observation: note what the tool returned4. Repeat until you have the final answer."""# ─── Self-Consistency ────────────────────────────────────────────────────────# Sample multiple reasoning paths, pick the majority answerimport refrom collections import Counterdef self_consistent_answer(question: str, client, n_samples: int = 5) -> str:    answers = []    for _ in range(n_samples):        response = client.chat.completions.create(            model="gpt-4o-mini",            messages=[{                "role": "user",                 "content": f"{question}\nThink step by step."            }],            temperature=0.7  # Some variability for diversity        )        # Extract final answer (last number or key phrase)        text = response.choices[0].message.content        answers.append(text.strip().split("\n")[-1])    # Return most common answer    return Counter(answers).most_common(1)[0][0]

9.2 System Prompt Architecture

SYSTEM_PROMPT_TEMPLATE = """## RoleYou are {role_name}, a {expertise_level} specialist in {domain}.## Objective  {primary_objective}## Constraints- Always cite sources when making factual claims- If uncertain, say "I'm not sure" rather than guessing  - Keep responses under {max_length} words unless asked for detail- Output format: {output_format}## ContextToday's date: {date}User's technical level: {user_level}## Examples{few_shot_examples}"""

10. Vector Databases: The Memory Layer

Vector databases store embeddings — dense numerical representations of meaning — enabling semantic search (search by meaning, not keywords).[23]

10.1 How Embeddings Work

from openai import OpenAIimport numpy as npclient = OpenAI()def get_embedding(text: str) -> list[float]:    return client.embeddings.create(        model="text-embedding-3-small",        input=text    ).data[0].embedding# Semantic similarity demosentences = [    "Karachi is the largest city in Pakistan",    "The metropolitan area of Karachi has 16 million people",    "I like to eat biryani",    "Python is a programming language"]embeddings = [get_embedding(s) for s in sentences]def cosine_sim(a, b):    a, b = np.array(a), np.array(b)    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))query_emb = get_embedding("What is the population of Karachi?")for s, emb in zip(sentences, embeddings):    score = cosine_sim(query_emb, emb)    print(f"{score:.3f} | {s}")# Output:# 0.812 | Karachi is the largest city in Pakistan  ← high# 0.798 | The metropolitan area of Karachi...      ← high  # 0.312 | I like to eat biryani                    ← low# 0.289 | Python is a programming language         ← low

10.2 Vector Database Comparison

DB	Best For	Hosting	Hybrid Search	Notes
Pinecone	Managed, production scale	Cloud-only	✅	Easiest setup
Weaviate	GraphQL + semantic queries	Self/Cloud	✅	MCP support in v3.0
Qdrant	High-performance, Rust core	Self/Cloud	✅	Best perf/$
Chroma	Local dev & testing	Self-host	Limited	Dead-simple Python API
Milvus	Massive scale (billions)	Self/Cloud	✅	GPU-accelerated
pgvector	Already using Postgres	Self-host	✅	No new infra needed

# Quick start: Chroma (local, perfect for prototyping)import chromadbfrom chromadb.utils import embedding_functionsclient = chromadb.Client()openai_ef = embedding_functions.OpenAIEmbeddingFunction(    api_key="YOUR_KEY",    model_name="text-embedding-3-small")collection = client.create_collection(    name="knowledge_base",    embedding_function=openai_ef)# Add documentscollection.add(    documents=[        "MCP is an open standard by Anthropic for AI tool integration",        "RAG stands for Retrieval-Augmented Generation",        "LangGraph is a framework for building stateful agent workflows"    ],    ids=["doc1", "doc2", "doc3"])# Queryresults = collection.query(    query_texts=["How do AI agents connect to external tools?"],    n_results=2)print(results["documents"])# [['MCP is an open standard by Anthropic...', #   'LangGraph is a framework...']]

11. The Complete AI Dictionary

A comprehensive reference of every term you'll encounter, from beginner to deep-technical.

🔤 Foundational Concepts

Artificial Intelligence (AI) The broad field of building systems that perform tasks that typically require human intelligence — reasoning, learning, perception, language understanding.

Machine Learning (ML) A subfield of AI where systems learn from data rather than being explicitly programmed. The model improves with experience.

Deep Learning (DL) ML using neural networks with many layers ("deep"). Powers all modern LLMs, image models, and speech systems.

Neural Network A computational model loosely inspired by biological neurons. Consists of layers of mathematical functions that transform inputs into outputs.

Parameters / Weights The learnable numerical values in a neural network. GPT-3 has 175 billion parameters. More parameters ≠ always better, but generally more capability.[3]

Training The process of adjusting a model's parameters on a large dataset to minimize prediction error. Requires massive compute (GPU clusters).

Inference Running a trained model to generate outputs. What happens when you type a prompt into ChatGPT.

Token The basic unit of text for LLMs. A token is roughly 0.75 words in English. "Hello, world!" = 4 tokens. LLMs process and generate text as token sequences.[16]

Context Window The maximum number of tokens an LLM can "see" at once. GPT-4 had 128K tokens; Llama 4 supports 10 million tokens. Larger = more context, higher cost.[4]

Embedding A dense numerical vector (array of floats) representing the semantic meaning of text, images, or other data. Similar meanings cluster together in embedding space.[23]

🧠 LLM Architecture Terms

Transformer The neural network architecture underlying all major LLMs, introduced in 2017. Key innovation: the attention mechanism replaces sequential processing with parallel processing.[7]

Attention Mechanism The core innovation of transformers. Lets each token attend to (learn from) every other token in context, regardless of distance. Computes Q/K/V matrices.[10]

Self-Attention Attention where the query, key, and value all come from the same sequence. Enables a model to understand words in context of each other.[10]

Multi-Head Attention Running multiple attention operations in parallel, each learning different relationships. GPT-3 uses 96 attention heads.[7]

Positional Encoding A mechanism to inject token position information into embeddings, since attention is position-agnostic by default.[11]

RoPE (Rotary Position Embeddings) A modern positional encoding scheme that encodes position through rotation matrices, enabling better generalization to longer contexts than the original model was trained on.[11]

MoE (Mixture of Experts) Architecture where only a subset of model parameters ("experts") activate per token, enabling models to have far more total parameters at similar inference cost.[11]

Encoder / Decoder

Encoder-only (e.g., BERT): Builds rich representations; best for classification, NER
Decoder-only (e.g., GPT): Generates text autoregressively; best for generation tasks
Encoder-Decoder (e.g., T5): Good for translation and summarization

Autoregressive Generation How LLMs generate text: one token at a time, each new token conditioned on all previous tokens. This is why they can't "edit" — they always predict left-to-right.

Temperature Controls randomness in generation. Temperature=0: always pick the most likely token (deterministic). Temperature=1: sample proportionally. Temperature>1: more random/creative.

Top-P (Nucleus Sampling) Restricts sampling to the smallest set of tokens whose cumulative probability exceeds P. More robust than Temperature alone for controlling output quality.

KV Cache Stores computed key and value matrices for previously processed tokens so they don't need to be recomputed during autoregressive generation. Critical for inference efficiency.

🏋️ Training & Alignment

Pre-training Initial training of an LLM on massive unlabeled text data (web, books, code). Learns statistical patterns of language. Requires enormous compute.

Fine-tuning Further training on a smaller, task-specific dataset to specialize a pre-trained model. Cheaper than pre-training.

SFT (Supervised Fine-Tuning) Fine-tuning on human-curated input-output pairs. "Given this input, produce this output."

RLHF (Reinforcement Learning from Human Feedback) The technique that transformed GPT-3 into ChatGPT. Human evaluators rate outputs; a reward model is trained on those ratings; the LLM is fine-tuned to maximize the reward. Dramatically improves alignment and reduces harmful outputs.[24]

PEFT (Parameter-Efficient Fine-Tuning) Fine-tuning techniques that update only a small fraction of parameters (e.g., LoRA, QLoRA), making fine-tuning feasible on consumer hardware.[1]

LoRA (Low-Rank Adaptation) Popular PEFT method that adds small trainable rank-decomposition matrices to existing weight matrices. Often achieves 90%+ of full fine-tune quality at 1% of the compute.

DPO (Direct Preference Optimization) A simpler alternative to RLHF that directly optimizes the model against human preference data without a separate reward model.

Constitutional AI Anthropic's technique for alignment: the model critiques its own responses against a set of principles and revises them.

Hallucination When an LLM generates confident, fluent, but factually wrong information. Caused by training on noisy data and the autoregressive generation process. Major active research area.[25]

Alignment Ensuring AI systems behave according to human values and intentions. Includes safety, helpfulness, and harmlessness.

🤖 Agentic AI Terms

AI Agent An AI system that autonomously perceives its environment, makes decisions, executes actions, and pursues goals over multiple steps — not just responds to a single prompt.[12]

Agentic AI The paradigm where LLMs act as autonomous agents that plan, use tools, and complete multi-step tasks without constant human direction.[12]

Tool Use / Function Calling The ability of an LLM to call external functions, APIs, and services based on user requests. Core capability enabling agents to "do things" rather than just "say things."[26]

Orchestrator The "brain" of a multi-agent system — the component that plans tasks, delegates to sub-agents, and synthesizes results. Often a more powerful LLM.[2]

Sub-agent / Worker Agent Specialized agents that execute specific tasks delegated by an orchestrator. Examples: a web-search agent, a code-execution agent, a database agent.

Human-in-the-Loop (HITL) A system design where a human can intervene, approve, or redirect an agent at key decision points. Critical for high-stakes workflows.

ReAct (Reasoning + Acting) A foundational prompting/architecture pattern for agents where the model interleaves reasoning ("Thought:") and actions ("Action:") in a loop.[13]

Planning The agent's ability to decompose a goal into an ordered sequence of sub-tasks. Types include: Plan-then-Execute, ReAct (interleaved), and Tree-of-Thought (branching).

Reflection An agentic pattern where the agent reviews and critiques its own previous output to improve it — a form of self-correction.[14]

Memory (Agent Memory) How agents retain information:

In-context: Within the current prompt window (ephemeral)
External: Stored in vector DBs or traditional DBs (persistent)
Episodic: Records of past interactions
Semantic: General knowledge/facts

Multi-Agent System (MAS) A network of specialized agents collaborating to solve problems that exceed any single agent's capability.[12]

📡 Protocols & Infrastructure

MCP (Model Context Protocol) Open standard by Anthropic (Nov 2024) for connecting AI agents to external tools and data sources through a standardized, AI-readable interface. The "USB-C for AI."[17]

A2A (Agent2Agent Protocol) Open protocol by Google (April 2025) enabling AI agents from different vendors to discover, communicate, and collaborate with each other.[19]

Agent Card A JSON document published by an A2A agent that describes its capabilities, endpoint, and authentication requirements — enabling other agents to discover it.[20]

RAG (Retrieval-Augmented Generation) Grounding LLM outputs in relevant documents retrieved from a knowledge base at inference time, reducing hallucination and enabling access to private/current data.[15]

Agentic RAG RAG enhanced with agent capabilities — the agent can iteratively retrieve, evaluate, and re-retrieve context before generating the final answer.[15]

Vector Database A database optimized for storing and querying high-dimensional vector embeddings via similarity search (ANN algorithms).[23]

Semantic Search Search that finds results by meaning rather than keyword matching, using embedding similarity.[27]

Hybrid Search Combining vector similarity search (semantic) with keyword-based search (BM25) in a single query for better recall and precision.[27]

HNSW (Hierarchical Navigable Small World) A graph-based indexing algorithm used in vector databases for fast approximate nearest neighbor (ANN) search.[23]

Chunking Breaking documents into smaller pieces before indexing in RAG systems. Semantic chunking (by meaning) outperforms fixed-size chunking.[16]

Reranking A second-pass step in RAG pipelines where retrieved documents are rescored using a more accurate (but slower) cross-encoder model to improve precision.

📝 Prompting & Generation

Prompt Engineering The practice of designing input prompts to maximize LLM output quality. A rapidly evolving discipline with significant impact on model performance.[22]

Zero-Shot Prompting Instructing a model to perform a task with no examples — relying purely on the model's pre-trained knowledge.[22]

Few-Shot Prompting Providing a small number of input-output examples in the prompt to guide the model's behavior.[22]

Chain-of-Thought (CoT) Prompting Instructing the model to show its reasoning step-by-step before giving a final answer. Dramatically improves performance on math, logic, and multi-step tasks.[22]

System Prompt A special prompt (hidden from the user) that sets the model's role, persona, constraints, and behavior for an entire conversation.

Temperature / Sampling Parameters controlling the randomness and diversity of LLM outputs.

Prompt Injection An attack where malicious content in the environment (e.g., a webpage an agent reads) attempts to override the agent's instructions.

Jailbreaking Attempts to bypass an LLM's safety guardrails through cleverly crafted prompts.

📊 Evaluation & Safety

Benchmark A standardized test for measuring model capability. Examples: MMLU (knowledge), HumanEval (coding), MATH (mathematics), GPQA (PhD-level science).

Hallucination LLM-generated content that is factually incorrect but stated with confidence.[25]

Groundedness The extent to which an LLM's outputs are supported by provided context (e.g., retrieved documents in RAG).

Red-teaming Adversarially probing an AI system to find safety vulnerabilities, jailbreaks, and failure modes before deployment.

AI Safety The field studying how to build AI systems that reliably do what humans intend and avoid unintended harmful behaviors.

Guardrails Programmatic constraints applied to LLM inputs and outputs to enforce safety, content policies, and format requirements.

🏢 Model Families (2026 Landscape)

Family	Company	Notable Models
GPT	OpenAI	GPT-5, GPT-5.5, o3, o4-mini[6]
Claude	Anthropic	Claude 3.5 Sonnet, Claude 4[28]
Gemini	Google	Gemini 2.0 Flash, Gemini Ultra 2[28]
Llama	Meta	Llama 4 (10M context, multimodal)[4]
Mistral	Mistral AI	Mistral Large, Codestral, Mixtral MoE
Phi	Microsoft	Phi-4 (small, surprisingly capable)[4]

12. Code Appendix: Build It Yourself

12.1 A Complete RAG + Agent System

"""Full-stack RAG + Agent system.Stack: OpenAI GPT-4o + ChromaDB + Function Calling"""import osimport jsonimport chromadbfrom openai import OpenAIfrom chromadb.utils.embedding_functions import OpenAIEmbeddingFunctionclient = OpenAI(api_key=os.environ["OPENAI_API_KEY"])# ─── 1. Setup Vector Store ─────────────────────────────────────────────────ef = OpenAIEmbeddingFunction(    api_key=os.environ["OPENAI_API_KEY"],    model_name="text-embedding-3-small")chroma = chromadb.Client()kb = chroma.get_or_create_collection("knowledge", embedding_function=ef)def ingest_documents(docs: list[dict]):    """docs: [{"id": "...", "text": "...", "metadata": {...}}]"""    kb.upsert(        ids=[d["id"] for d in docs],        documents=[d["text"] for d in docs],        metadatas=[d.get("metadata", {}) for d in docs]    )def search_knowledge_base(query: str, n_results: int = 5) -> str:    results = kb.query(query_texts=[query], n_results=n_results)    docs = results["documents"][0]    return "\n\n---\n\n".join(docs) if docs else "No relevant documents found."# ─── 2. Define Agent Tools ─────────────────────────────────────────────────TOOLS = [    {        "type": "function",        "function": {            "name": "search_knowledge_base",            "description": "Search internal knowledge base for relevant information",            "parameters": {                "type": "object",                "properties": {                    "query": {"type": "string"},                    "n_results": {"type": "integer", "default": 5}                },                "required": ["query"]            }        }    },    {        "type": "function",        "function": {            "name": "calculate",            "description": "Evaluate a mathematical expression",            "parameters": {                "type": "object",                "properties": {                    "expression": {                        "type": "string",                        "description": "Python math expression, e.g. '2 ** 10'"                    }                },                "required": ["expression"]            }        }    }]def execute_tool(name: str, args: dict) -> str:    if name == "search_knowledge_base":        return search_knowledge_base(args["query"], args.get("n_results", 5))    elif name == "calculate":        try:            return str(eval(args["expression"], {"__builtins__": {}}, {}))        except Exception as e:            return f"Error: {e}"    return f"Unknown tool: {name}"# ─── 3. ReAct Agent Loop ──────────────────────────────────────────────────SYSTEM = """You are a helpful AI assistant with access to a knowledge base.Use the search_knowledge_base tool to look up relevant information before answering.Think through problems step by step. Always cite which documents informed your answer."""def chat(user_message: str, history: list = None) -> str:    if history is None:        history = []    messages = [{"role": "system", "content": SYSTEM}]    messages.extend(history)    messages.append({"role": "user", "content": user_message})    while True:        response = client.chat.completions.create(            model="gpt-4o",            messages=messages,            tools=TOOLS,            tool_choice="auto"        )        msg = response.choices[0].message        messages.append(msg)        # No tool calls = final answer        if not msg.tool_calls:            return msg.content        # Execute each tool call        for tc in msg.tool_calls:            args = json.loads(tc.function.arguments)            result = execute_tool(tc.function.name, args)            print(f"[Tool: {tc.function.name}] → {result[:100]}...")            messages.append({                "role": "tool",                "tool_call_id": tc.id,                "content": result            })# ─── Usage ─────────────────────────────────────────────────────────────────if __name__ == "__main__":    # Ingest some documents    ingest_documents([        {"id": "1", "text": "MCP (Model Context Protocol) was released by Anthropic in November 2024."},        {"id": "2", "text": "LangGraph is the #1 ranked AI agent framework for production stateful workflows in 2026."},        {"id": "3", "text": "GPT-5 was launched on August 7, 2025, achieving 94.6% on advanced math benchmarks."},    ])    answer = chat("What agent framework should I use for a production workflow, and when was GPT-5 released?")    print(f"\nAnswer:\n{answer}")

12.2 Minimal MCP Client

"""Consuming an MCP server from a Python client.Requires: pip install mcp anthropic"""import asynciofrom mcp import ClientSession, StdioServerParametersfrom mcp.client.stdio import stdio_clientimport anthropicasync def run_with_mcp(user_query: str):    # Connect to an MCP server (e.g., filesystem, database)    server_params = StdioServerParameters(        command="python",        args=["my_mcp_server.py"]    )    async with stdio_client(server_params) as (read, write):        async with ClientSession(read, write) as session:            await session.initialize()            # List available tools            tools_response = await session.list_tools()            tools = [                {                    "name": t.name,                    "description": t.description,                    "input_schema": t.inputSchema                }                for t in tools_response.tools            ]            print(f"Available tools: {[t['name'] for t in tools]}")            # Use Claude with MCP tools            anthropic_client = anthropic.Anthropic()            messages = [{"role": "user", "content": user_query}]            while True:                response = anthropic_client.messages.create(                    model="claude-3-5-sonnet-20241022",                    max_tokens=4096,                    tools=tools,                    messages=messages                )                if response.stop_reason == "end_turn":                    # Extract text response                    for block in response.content:                        if hasattr(block, "text"):                            return block.text                # Handle tool use                for block in response.content:                    if block.type == "tool_use":                        # Execute tool via MCP                        result = await session.call_tool(block.name, block.input)                        messages.append({                            "role": "assistant",                            "content": response.content                        })                        messages.append({                            "role": "user",                            "content": [{                                "type": "tool_result",                                "tool_use_id": block.id,                                "content": str(result.content)                            }]                        })                        breakasyncio.run(run_with_mcp("List all files in the current directory and summarize their contents"))

12.3 A2A Agent Discovery & Collaboration

"""Simplified A2A protocol implementation.Real A2A uses JSON-RPC 2.0 over HTTP with SSE for streaming."""import httpximport asynciofrom dataclasses import dataclass@dataclassclass AgentCard:    name: str    endpoint: str    skills: list[dict]    version: str = "1.0"class A2AClient:    """Discovers and calls remote A2A agents."""    async def discover_agent(self, agent_url: str) -> AgentCard:        """Fetch agent card from a known URL."""        async with httpx.AsyncClient() as client:            response = await client.get(f"{agent_url}/.well-known/agent.json")            data = response.json()            return AgentCard(                name=data["name"],                endpoint=data["endpoint"],                skills=data["skills"],                version=data.get("version", "1.0")            )    async def send_task(self, agent: AgentCard, skill_id: str,                         message: str) -> str:        """Send a task to a remote agent and get the result."""        payload = {            "jsonrpc": "2.0",            "method": "tasks/send",            "id": "req-1",            "params": {                "skill": skill_id,                "message": {"role": "user", "parts": [{"text": message}]}            }        }        async with httpx.AsyncClient() as client:            response = await client.post(                agent.endpoint,                json=payload,                headers={"Authorization": "Bearer my-token"}            )            result = response.json()            return result["result"]["output"]["message"]["parts"][0]["text"]class OrchestratorAgent:    """    An orchestrator that discovers and delegates to specialist agents.    This is the heart of a multi-agent A2A system.    """    def __init__(self):        self.a2a = A2AClient()        self.registry: dict[str, AgentCard] = {}    async def register_agent(self, url: str):        card = await self.a2a.discover_agent(url)        self.registry[card.name] = card        print(f"Registered agent: {card.name} with skills: {[s['id'] for s in card.skills]}")    async def handle_request(self, user_request: str) -> str:        # In a real system, an LLM would decide which agent to use        # Here we do simple keyword routing        if "payment" in user_request.lower():            agent = self.registry.get("payment-agent")            return await self.a2a.send_task(agent, "process_payment", user_request)        elif "weather" in user_request.lower():            agent = self.registry.get("weather-agent")            return await self.a2a.send_task(agent, "get_forecast", user_request)        return "I don't have a specialist agent for this request."async def main():    orchestrator = OrchestratorAgent()    # Discover available agents (in production: from a registry service)    await orchestrator.register_agent("https://payments.example.com")    await orchestrator.register_agent("https://weather.example.com")    result = await orchestrator.handle_request(        "Process a $50 payment for order #1234"    )    print(f"Result: {result}")asyncio.run(main())

Built with care for the Rawalpindi/Islamabad dev community — and every hacker reading this on a Thursday afternoon. Go build something.

API pagination: cursor vs offset

Thu, 04 Jun 2026 00:00:00 GMT

Note: This RSS feed strips out SVGs and embeds. You might want to read the post on the webpage here.

June 4, 2026

Cursor pagination (recommended)

Stable under inserts/deletes
Uses an opaque cursor (e.g., last seen id + sort key)
Easy to cache and resume

SELECT *FROM itemsWHERE (created_at, id) < (:created_at, :id)ORDER BY created_at DESC, id DESCLIMIT 50;

Offset pagination (avoid at scale)

Can skip/duplicate rows when data changes
Gets slower as offset grows

If you need “page numbers”, store cursors per page server-side.

TLS in plain English

Thu, 04 Jun 2026 00:00:00 GMT

Note: This RSS feed strips out SVGs and embeds. You might want to read the post on the webpage here.

June 4, 2026

The handshake

Client says: “Here are the cipher suites I support”
Server replies with a certificate (public key)
Client verifies the certificate chain
They agree on session keys (usually via ECDHE)
After that: traffic is encrypted + authenticated

What you get

Confidentiality (encryption)
Integrity (tamper detection)
Authenticity (you’re talking to the right server)

Common gotchas

Expired certs
Wrong hostname (SAN mismatch)
Missing intermediate certs

Docker layers: 6 rules for smaller images

Thu, 04 Jun 2026 00:00:00 GMT

Note: This RSS feed strips out SVGs and embeds. You might want to read the post on the webpage here.

June 4, 2026

6 rules

Use a small base (alpine/distroless when possible)
Copy only what you need (use .dockerignore)
Install deps before copying app source
Combine commands to reduce layers
Use multi-stage builds
Pin versions to avoid surprise rebuilds

FROM node:24-alpine AS buildWORKDIR /appCOPY package*.json ./RUN npm ciCOPY . .RUN npm run buildFROM gcr.io/distroless/nodejs24-debian12COPY --from=build /app/dist /appCMD ["/app/index.js"]

oreoro

How LLMs Actually Work: A Friendly Map for Humans

The whole idea in one minute

1. Tokens: the model's alphabet is not your alphabet

2. Embeddings: IDs become meaning-shaped numbers

3. Position: the model needs word order

4. Attention: tokens decide what to pay attention to

5. Multi-head attention: many views at once

6. Feed-forward networks: where a lot of learned structure lives

7. Residual stream and normalization: keeping deep models trainable

8. Next-token prediction: the answer is built one piece at a time

9. Architecture vs weights: why models feel different

10. GPT-2 and MoE: two useful milestones

GPT-2: scaling the next-token game

MoE: not every token needs the whole building

11. The AI ecosystem: MCP, tools, RAG, agents, and evals

MCP: the USB-C idea for AI tools

RAG: giving the model an open book

Agents: the loop around the model

A friendly checklist for understanding any LLM answer

Further reading

The 2026 AI Landscape: A Hacker's Deep Dive 🤖

1. The Big Picture: Where AI Stands in 2026

2. The GPT Evolution — A Complete Timeline

The o-Series: Reasoning Models

3. How LLMs Actually Work: The Transformer Architecture

3.1 The Four Building Blocks

3.2 Self-Attention: The Core Insight

3.3 Multi-Head Attention

3.4 Modern Architecture Improvements

4. The Agentic AI Revolution

4.1 What Makes an AI Agent?

4.2 The ReAct Pattern

4.3 Agentic Patterns

4.4 Multi-Agent Systems (MAS)

5. RAG: Retrieval-Augmented Generation Deep Dive

5.1 Naive RAG Pipeline

5.2 Advanced RAG Techniques

5.3 Agentic RAG

6. MCP: The USB-C for AI Tools

6.1 Why MCP Matters

6.2 Building an MCP Server

6.3 MCP vs Traditional APIs

7. A2A: The Protocol for Agent Collaboration

7.1 The Problem A2A Solves

7.2 Agent Cards: The Discovery Mechanism

7.3 MCP + A2A: The Full Stack

8. AI Agent Frameworks Compared

8.1 LangGraph: Production-Ready State Machines

8.2 CrewAI: Role-Based Multi-Agent Teams

9. Prompt Engineering Playbook

9.1 Core Techniques

9.2 System Prompt Architecture

10. Vector Databases: The Memory Layer

10.1 How Embeddings Work

10.2 Vector Database Comparison

11. The Complete AI Dictionary

🔤 Foundational Concepts

🧠 LLM Architecture Terms

🏋️ Training & Alignment

🤖 Agentic AI Terms

📡 Protocols & Infrastructure

📝 Prompting & Generation

📊 Evaluation & Safety

🏢 Model Families (2026 Landscape)

12. Code Appendix: Build It Yourself

12.1 A Complete RAG + Agent System

12.2 Minimal MCP Client

12.3 A2A Agent Discovery & Collaboration

API pagination: cursor vs offset

Cursor pagination (recommended)

Offset pagination (avoid at scale)

TLS in plain English

The handshake

What you get

Docker layers: 6 rules for smaller images

6 rules