oreoro

Data Compression Explained: A Visual Guide to the Whole Book

Sat, 20 Jun 2026 00:00:00 GMT

Note: This RSS feed strips out SVGs and embeds. You might want to read the post on the webpage here.

June 20, 2026

Property	Value
Source	Data Compression Explained
Original author	Matt Mahoney
Source last update	Apr. 15, 2013
Draft type	Visual Notion blog post / study guide
Audience	Developers, technical writers, ML engineers, compression-curious readers
Core idea	Compression is prediction plus coding, with transforms and perception models doing the heavy lifting.

Source note: This post is an original visual study guide based on Matt Mahoney's book. It paraphrases and organizes the ideas for blog reading. It is not a redistributed copy of the book. Historical benchmark numbers and tool rankings should be read in the context of the source's 2013 update.

The Whole Book in One Picture

flowchart LR
    A[Raw data] --> B{Can we expose structure?}
    B -->|Yes| C[Transform]
    B -->|No| D[Model]
    C --> D[Model predicts what comes next]
    D --> E[Coder maps probability to bits]
    E --> F[Archive or stream]
    F --> G[Decoder]
    G --> H[Inverse transform]
    H --> I[Original data or acceptable approximation]

    J[Benchmarks] -. measure .-> C
    J -. measure .-> D
    J -. measure .-> E
    K[Human perception] -. lossy path .-> B

Compression looks like file shrinkage, but the book frames it as a deeper engineering problem:

Layer	Question	Main chapters
Theory	What is compressible at all?	1
Measurement	How do we compare compressors fairly?	2
Coding	Given probabilities, how many bits are needed?	3
Modeling	Where do good probabilities come from?	4
Transforms	How do we rearrange data so simple models work?	5
Lossy compression	What can we throw away without humans noticing?	6

The shortest honest summary:

Compression is the search for shorter descriptions. Coding is mostly solved. Modeling is the hard part. Transforms make modeling easier. Lossy compression adds a model of human perception.

Fast Mental Models

1. Bits Measure Surprise

If an event has probability p, the ideal code length is:

ideal bits = log2(1 / p) = -log2(p)

Probability	Surprise	Intuition
`1/2`	1 bit	A fair yes/no question
`1/4`	2 bits	One outcome among four equal choices
`1/256`	8 bits	One byte under a uniform byte model
Near 1	Near 0 bits	Almost expected
Near 0	Many bits	Very surprising

Visual rule:

common symbol     -> short coderare symbol       -> long codeunknown pattern   -> expensive codeunderstood pattern -> tiny description

2. Compression Is Prediction

flowchart TD
    H[History already decoded] --> M[Model]
    M --> P[Probability for next bit or symbol]
    P --> C[Coder]
    C --> O[Compressed output]
    O --> D[Decoder repeats same model]
    D --> H2[Recovered next bit or symbol]

The compressor and decompressor must make the same predictions from the same history. The compressed file mainly stores the information that the model could not predict.

3. Lossy Compression Is Perception-Aware Prediction

Lossless: recover exactly the same bits.Lossy: recover something humans judge close enough.

That one change moves the problem from pure information theory into psychology, vision, hearing, language, and AI.

Chapter Map

Chapter	Visual handle	What it teaches
1. Information Theory	Limits map	Why random data cannot be compressed and why modeling matters more than coding.
2. Benchmarks	Tradeoff dashboard	How size, speed, memory, data set choice, and rules change compressor rankings.
3. Coding	Probability-to-bits machine	Huffman, arithmetic coding, asymmetric coding, numeric codes, archives, checksums, encryption.
4. Modeling	Prediction engine	Fixed-order models, variable-order models, context mixing, PAQ, ZPAQ, and why modeling is hard.
5. Transforms	Pattern-exposure tools	RLE, LZ77, LZW, BWT, filters, executable transforms, precompression.
6. Lossy Compression	Human sensor model	Images, video, audio, JPEG, MPEG, psychoacoustics, and recompression.

1. Information Theory

Compression Starts with a Count

There are 2^n different binary strings of length n. There are fewer than 2^n shorter binary strings. Therefore, no lossless compressor can make every n-bit input shorter while still allowing perfect decompression.

All n-bit inputs:[000...000] [000...001] [000...010] ... [111...111]       count = 2^nPossible shorter outputs:[] [0] [1] [00] [01] ... [length < n]       count = 2^n - 1One-to-one decoding cannot map more inputs into fewer outputs.

The key result:

Claim	Meaning
No universal compression	A compressor that shrinks every file cannot exist.
Some files must expand	If a compressor shrinks some inputs, it must make other inputs longer or refuse them.
Random-looking data is usually incompressible	Most possible strings have no shorter description.
Useful data is often compressible	Human-created data usually has patterns, constraints, formats, repetition, and meaning.

Why Meaningful Data Compresses

Most possible strings are random. Most strings people store are not:

Data	Why it has structure
English text	Grammar, vocabulary, topic, repeated words, spelling patterns
Source code	Keywords, syntax, indentation, identifiers, libraries
Images	Neighboring pixels are correlated
Audio	Samples are correlated over time and filtered by human hearing
Executables	Instructions, addresses, headers, imported symbols
Backups	Files repeat across versions and machines

Compression works because our data is not drawn uniformly from all possible bit strings. It comes from processes with structure.

Coding Is Bounded

If a model says a symbol has probability p, the best possible code length is approximately -log2(p) bits. You can choose a bad coder and waste bits, but no coder can beat the model's information content for all data drawn from that model.

flowchart LR
    A[Model says symbol is likely] --> B[Short code]
    C[Model says symbol is rare] --> D[Long code]
    E[Model is wrong] --> F[Compressed size penalty]

The lesson is subtle:

Part	Status
Turning probabilities into bits	Efficient, well-understood
Finding the right probabilities	Hard, open-ended, data-dependent

Modeling Is Not Computable

A better model can turn a long string into a tiny description. For example, a million digits of pi can be treated as random-looking decimal digits, or as "compute the first million digits of pi." The second description is dramatically shorter, but it requires recognizing the source.

weak model:314159265358979323846...-> "digits look independent"-> many bitsstrong model:314159265358979323846...-> "this is pi"-> short program or description

The book connects this to Kolmogorov complexity: the shortest program that outputs a string is an ideal compressed representation, but there is no general algorithm that can always find it.

Compression and AI

Prediction is a sign of understanding:

If a system understands...	It can predict...
English	likely next words
Images	likely neighboring pixels
Audio	likely future samples
Code	likely syntax and instruction patterns
File formats	likely headers, fields, and constraints

This is why compression and AI meet. A compressor that understands a data source can describe it more compactly. A perfect general compressor would need a very broad kind of understanding.

Chapter 1 Takeaways

Takeaway	Why it matters
Random data cannot be compressed	Do not expect magic from encrypted, already-compressed, or random data.
Compression is model plus coder	Separate probability estimation from bit representation.
Coding has mathematical limits	Better coding helps, but only up to the model's quality.
Modeling is the hard problem	Better compression usually comes from better prediction.
Understanding creates compression	The more structure you can exploit, the shorter the description.

2. Benchmarks

What Benchmarks Actually Measure

Compression benchmarks compare compressors on a chosen data set under chosen rules.

flowchart TD
    A[Benchmark] --> B[Data set]
    A --> C[Rules]
    A --> D[Metrics]
    D --> E[Compressed size]
    D --> F[Compression speed]
    D --> G[Decompression speed]
    D --> H[Memory use]
    C --> I[Can include decompressor?]
    C --> J[Can tune to files?]
    C --> K[Single file or archive?]

The big triangle:

         smaller output               /\              /  \             /    \            /      \           /        \less memory -------- faster speed

You usually cannot optimize all three at once. Maximum compression tools are often slow and memory-hungry. Practical formats often give up ratio to win speed, streaming, random access, or compatibility.

Bits Per Character

The book often uses bpc, or bits per character, for byte-oriented corpora.

bpc	Meaning
8.0	No compression for byte data
6.0	25 percent smaller than original
4.0	Half the original size
2.0	One quarter of original size
Lower	Better compression, assuming the same input data

Benchmark Landscape

Benchmark	What it emphasizes	Why it matters
Calgary Corpus	Classic mixed small files	Historical baseline for text compression research.
Large Text Compression Benchmark	Large Wikipedia XML text	Natural language modeling and long-range structure.
Hutter Prize	Compression as AI research	Rewards improvements on a fixed text corpus with decompressor included.
Maximum Compression	Maximum ratio on mixed files	Encourages aggressive tuning for size.
Generic Compression Benchmark	Untuned universal prediction	Tests generality rather than file-type tricks.
Compression Ratings	Size and speed scoring	Makes tradeoffs adjustable by user preference.
Other public benchmarks	Multiple corpora and rule sets	Shows that rankings depend on test design.
File system studies	Real-world storage mix	Reveals what data actually exists on machines.

Why Rankings Shift

Two compressors can trade places when any of these changes:

Variable	Effect
Data type	Text, images, executables, backups, logs, and audio favor different methods.
File size	Small files make headers and model startup costs visible.
Archive rules	Solid archives can exploit similarity across files.
Decompressor inclusion	Including source or executable rewards simpler decoders.
Memory limit	Large models can dominate if memory is unrestricted.
Speed limit	Slow context mixers may lose to practical LZ-family tools.
Tuning policy	Per-file options can inflate benchmark-specific performance.

Visual: Benchmark as a Dashboard

+--------------------------------------------------+| Compressor: example                              |+-------------------+------------------------------+| Size              | 1.95 bpc                     || Compression time  | slow                         || Decompression     | medium                       || Memory            | high                         || Decoder included  | yes                          || Data set          | text-heavy                   || Good use case     | archival / research          |+-------------------+------------------------------+

Chapter 2 Takeaways

Takeaway	Why it matters
Benchmarks are not neutral	They encode assumptions about data and priorities.
Size is only one metric	Real systems care about speed, memory, streaming, and compatibility.
Historical leaderboards age	Use the book's rankings as context, not current product advice.
Data set choice dominates	A compressor can look brilliant on one corpus and ordinary on another.
A benchmark is a contract	Read the rules before interpreting the chart.

3. Coding

Coder Job Description

A coder receives probabilities from a model and emits bits close to the theoretical ideal.

flowchart LR
    A[Symbol] --> B[Model probability]
    B --> C[Coder]
    C --> D[Bitstream]
    D --> E[Decoder]
    E --> F[Same symbol]

The coder must be:

Requirement	Reason
Decodable	The original symbols must be recoverable.
Efficient	Common symbols should use fewer bits.
Synchronized	Decoder must reproduce the same boundaries and model states.
Practical	Real files need headers, error checks, and sometimes encryption.

Huffman Coding

Huffman coding builds a prefix tree. Frequent symbols sit near the root. Rare symbols sit deeper.

    root   /    \common   *        / \    medium rare

Core idea:

Symbol probability	Huffman effect
High	Shorter integer number of bits
Low	Longer integer number of bits
Exact powers of 1/2	Very efficient
Awkward probabilities	Wastes some space due to whole-bit code lengths

Strengths:

Simple.
Fast.
Widely used.
Good with static or block models.

Limits:

Code lengths are whole numbers of bits.
Binary alphabets cannot be compressed by basic Huffman alone.
A full table or canonical description may need to be stored.

Arithmetic Coding

Arithmetic coding represents an entire message as a subinterval inside [0, 1).

Initial interval:[0 ------------------------------------------------ 1)After likely symbol:[0 -------- 0.7)After next symbol:[0.28 --- 0.42)After more symbols:[0.314159 ----------------)Output a binary number inside the final interval.

Why it matters:

Feature	Benefit
Fractional bit efficiency	Avoids Huffman's whole-bit rounding.
Works well for binary prediction	Ideal for bitwise models.
Adapts naturally	Model can update after every symbol.
Near-Shannon performance	Strong practical coding method.

Asymmetric Binary Coding

Asymmetric binary coding is another way to code predicted bits efficiently. It uses a single integer-like state rather than arithmetic coding's interval endpoints. It matters because the same theory can be implemented with different machine-level tradeoffs.

Coding family	Mental model
Huffman	Tree of prefix codes
Arithmetic/range	Shrinking probability interval
Asymmetric binary	State machine that packs bits according to probability

Numeric Codes

Some values are not arbitrary symbols. They are counts, offsets, lengths, or prediction errors. Numeric codes exploit common number distributions.

Code	Good for	Visual shape
Unary	Very small positive integers	`0`, `10`, `110`, `1110`
Rice	Geometric-like distributions with power-of-two parameter	quotient plus remainder
Golomb	Geometric-like distributions with flexible parameter	quotient plus bounded remainder
Extra-bit codes	Ranges with extra low bits	length classes and offset details

These appear in systems where the model says "small numbers are common, large numbers are rare."

Archive Formats

A compression algorithm is not the whole file format. Archives also need structure.

flowchart LR
    A[Archive header] --> B[File metadata]
    B --> C[Compressed payload]
    C --> D[Error check]
    D --> E[Optional encryption metadata]

Important archive concerns:

Concern	Why it matters
Single-file vs multi-file	Affects metadata and file recovery.
Solid compression	Similar files compressed together can shrink more.
Random access	Solid archives may make one-file extraction slower.
Error detection	Detects corruption after storage or transmission.
Encryption	Protects confidentiality but makes data look random afterward.

Error Detection

Method	What it catches
Parity	Simple odd/even bit errors, weak but cheap.
CRC-32	Common accidental corruption check.
Adler-32	Fast checksum used in some compression contexts.
Cryptographic hash	Strong integrity identity, designed against adversarial collisions.

Error detection is not compression, but production archives need it.

Chapter 3 Takeaways

Takeaway	Why it matters
Coding maps probability to bits	It is the final packing step.
Huffman is simple but rounded	Whole-bit lengths are a real limitation.
Arithmetic coding is closer to ideal	It fits adaptive and binary models well.
Numeric codes encode structured integers	They are useful for lengths, offsets, runs, and errors.
Archives are systems	Metadata, checks, encryption, and extraction rules matter.

4. Modeling

The Hard Part

A model estimates what comes next. Once we have a good probability, coding is mechanical. The model decides whether compression is mediocre or excellent.

history:  "the quick brown "model:    next symbol is likely "f", "d", "c", ...coder:    short code for likely next symbol

Static vs adaptive:

Model type	How it works	Tradeoff
Static	Analyze data, send model, then coded data	Good if model cost is small and data is stable.
Adaptive	Update model as data is read	Avoids sending full model, tracks local changes.

Fixed-Order Models

An order n model predicts the next symbol from the previous n symbols.

Order 0: no contextP(next)Order 1: one previous symbolP(next | previous)Order 3: three previous symbolsP(next | previous_3, previous_2, previous_1)

Example:

Context	Next-symbol table
`q`	`u` is very likely in English
`th`	`e`, `a`, `i`, `o` are plausible
`ing`	space, punctuation, or suffix continuation

Why fixed order breaks:

Order too low	Misses useful context.
Order too high	Most contexts are rare or unseen.
Result	Need smoothing, fallback, or variable order.

Bytewise, Bitwise, and Indirect Models

Model style	Unit	Useful when
Bytewise	Predict next byte	Text, simple binary data, byte-aligned formats.
Bitwise	Predict next bit	Arithmetic coding, mixed binary patterns, precise probability updates.
Indirect	Use hashed or transformed contexts	Large context spaces where full tables are too costly.

Variable-Order Models

Variable-order models keep statistics for multiple context lengths and choose or mix them.

flowchart TD
    A[Long context] --> B{Seen enough?}
    B -->|Yes| C[Use long-context prediction]
    B -->|No| D[Back off]
    D --> E[Medium context]
    E --> F{Seen enough?}
    F -->|Yes| G[Use medium-context prediction]
    F -->|No| H[Short context]

DMC

Dynamic Markov Coding predicts bits with a state machine that grows as it observes data. It can split states when histories diverge.

Visual:

state A --0--> state Bstate A --1--> state Cif state A is too vague:state A becomes A1 and A2

PPM

Prediction by Partial Matching uses byte contexts and backs off from longer to shorter contexts. It is a classic text-compression idea.

Try context "tion"if unknown, try "ion"if unknown, try "on"if unknown, try "n"if unknown, try order 0

The hard detail is handling symbols that have not appeared in a context, often called the escape or zero-frequency problem.

CTW

Context Tree Weighting mixes context tree predictions in a principled bitwise way. Instead of choosing only one context, it combines evidence across a tree.

Context Mixing

Context mixing uses many predictors at once.

flowchart LR
    A[Text model] --> M[Mixer]
    B[Match model] --> M
    C[Low-order model] --> M
    D[File-format model] --> M
    E[Image/audio heuristic] --> M
    M --> P[Final probability]
    P --> Coder[Arithmetic coder]

The mixer can learn which predictors are useful in each situation.

Component	Role
Linear evidence mixing	Combine model outputs with weighted evidence.
Logistic mixing	Mix in probability/logit space for better behavior.
SSE	Secondary Symbol Estimation adjusts predictions using past calibration.
ISSE	Indirect SSE uses contexts to select or adapt estimators.
Match model	If current data matches earlier data, predict continuation from the match.
PAQ models	High-compression family using context mixing.
ZPAQ	A more configurable/archive-oriented successor in the PAQ family.
Crinkler	Specialized compression/linking for executable code size competitions.

Why Context Mixing Can Beat Single Models

Different predictors notice different kinds of structure:

Predictor	Notices
Low-order byte model	Local byte frequencies
Word model	Language-level repetition
Match model	Exact repeated substrings
Image model	Neighboring pixel relationships
Executable model	Instruction and address patterns
XML model	Tags, attributes, markup rhythm

Compression improves when the mixer learns which predictor is trustworthy right now.

Chapter 4 Takeaways

Takeaway	Why it matters
Modeling is prediction	The compressed file stores surprises.
Fixed order is simple but brittle	Context length must match the data.
Variable order handles sparse contexts	Backoff avoids overconfidence in rare histories.
Context mixing is powerful	Many weak specialized models can beat one general model.
Better modeling can look like understanding	Language, images, code, and formats all reward domain knowledge.

5. Transforms

What a Transform Does

A transform rewrites data so a simpler model can compress it.

flowchart LR
    A[Original data] --> B[Transform]
    B --> C[More model-friendly symbols]
    C --> D[Model]
    D --> E[Coder]
    E --> F[Compressed file]
    F --> G[Decoder]
    G --> H[Inverse transform]
    H --> I[Original data]

Ideal transform:

Property	Meaning
Reversible	Decompression gets the original back exactly.
Structure exposing	Repetition, locality, or predictable errors become obvious.
Cheap enough	Transform cost must be worth the compression gain.
Canonical when possible	Avoid arbitrary choices that add information burden.

Run Length Encoding

RLE replaces repeated symbols with a symbol plus a count.

AAAAAABBBBCCCCCCCCbecomes(A,6) (B,4) (C,8)

Best for:

Good	Bad
Long repeated runs	Alternating symbols
Simple image masks	Natural text
Zero-filled data	Already transformed data with few runs

LZ77 and the Match Family

LZ77 replaces repeated strings with pointers to previous occurrences.

Input:ABRACADABRALater "ABRA" can become:(go back 7, copy 4)

Visual:

sliding history buffer        lookahead[ABRACAD]                     [ABRA...]   ^^^^^   match reused by pointer

Why it is popular:

Strength	Explanation
Fast decompression	Decoder mostly copies bytes from earlier output.
General-purpose	Works on many repeated byte patterns.
Streaming-friendly	Can run with bounded windows.
Foundation format	Deflate, LZMA-like families, and many practical tools build on the idea.

LZSS

LZSS improves practical LZ77 by only emitting pointers when they save space. Short non-saving matches remain literals.

Deflate

Deflate combines LZ77-style matches with Huffman coding. It powers common zip/gzip-style compression and survives because it is fast, widely implemented, and compatible.

flowchart LR
    A[Bytes] --> B[LZ77 literals and matches]
    B --> C[Huffman coding]
    C --> D[Deflate bitstream]

LZMA

LZMA pushes stronger modeling around LZ-style matches, often improving ratio at the cost of more CPU and memory.

LZX, ROLZ, LZP, Snappy, Deduplication

Method	Main idea	Design center
LZX	LZ-family compression used in Microsoft contexts	Practical binary/archive compression.
ROLZ	Restricts match search by recent contexts	Better match relevance.
LZP	Predicts repeated strings from context	Fast prediction of matches.
Snappy	Prioritizes very high speed	Low latency over maximum ratio.
Deduplication	Replaces repeated chunks across files or systems	Backup and storage efficiency.

LZW and Dictionary Encoding

Dictionary methods replace strings with dictionary references.

dictionary:1 -> the2 -> compression3 -> modeltext:the compression modelencoded:1 2 3

Dictionary types:

Type	Dictionary source
Fixed	Built into the format or algorithm.
Static	Learned from the file and stored with it.
Dynamic	Built by compressor and decompressor in lockstep.

LZW is a dynamic dictionary method historically associated with formats like GIF-era compression. Its broader lesson is that both sides can build the same dictionary without transmitting every entry.

Dictionary Encoding for Text

Text-specific dictionaries can model:

Feature	Compression opportunity
Words	Common words become compact tokens.
Capitalization	Store word identity separately from case pattern.
Newlines	Model paragraph and line structure.
Punctuation	Predict separators and syntax.
Word endings	Use morphology and repeated suffixes.

The stronger the text model, the more the compressor behaves like a small language-aware system.

Symbol Ranking and Move-to-Front

Move-to-front keeps a list of symbols ordered by recency. If the same few symbols keep appearing in a context, their ranks stay small.

alphabet list:[A B C D E ...]read C -> output rank 2, move C to front[C A B D E ...]read C again -> output rank 0

This works well after transforms like BWT, where local neighborhoods tend to reuse a small set of symbols.

Burrows-Wheeler Transform

BWT sorts rotations of a block so characters with similar right contexts cluster together. After BWT, a fast local model can often compress well.

Tiny example, conceptually:

Original block:banana$Sort rotations:$bananaa$bananana$bananana$bbanana$na$bananana$baTake last column:annb$aa

The output is not obviously shorter, but it groups context-related symbols. It is usually followed by move-to-front, run-length coding, and entropy coding.

flowchart LR
    A[Input block] --> B[BWT context sort]
    B --> C[Move-to-front]
    C --> D[Run-length coding]
    D --> E[Huffman or arithmetic coding]

Predictive Filtering

Numeric data often compresses better as prediction errors.

samples:100, 103, 105, 106, 108predict next as previous:100, +3, +2, +1, +2

Small errors are easier to encode than raw values.

Filter	Used for
Delta coding	Signals, images, ordered numeric data
Color transform	Separating brightness from color differences
Linear filtering	Predicting from neighboring samples or pixels

Specialized Transforms

Transform	Target	Idea
E8E9	x86 executable code	Normalize relative call/jump addresses so repeated code patterns match better.
Precomp	Already-compressed embedded data	Detect and temporarily expand compressed streams inside files so outer compression can work.
Huffman pre-coding	Context-mixing speed	Reduce input size before expensive modeling.

Chapter 5 Takeaways

Takeaway	Why it matters
Transforms do not finish compression	They prepare data for modeling and coding.
LZ methods exploit repeated strings	This explains many practical formats.
BWT exploits sorted contexts	It turns context into local symbol clustering.
Filters exploit smooth numeric data	Prediction errors are often small.
Specialized transforms exploit file knowledge	Better compression often comes from knowing the data's format.

6. Lossy Compression

The Big Shift

Lossless compression asks:

Can we reproduce the original bits exactly?

Lossy compression asks:

Can we reproduce something humans accept as the same?

That makes perception the model.

flowchart LR
    A[Original signal] --> B[Human perception model]
    B --> C[Discard hard-to-notice detail]
    C --> D[Quantize]
    D --> E[Lossless coding of remaining data]
    E --> F[Compressed media]

Images

Digital images are already approximations of continuous light. Lossy image compression removes detail that the visual system is less sensitive to.

Human visual fact	Compression use
Limited spatial resolution	Do not store invisible fine detail.
Limited brightness precision	Quantize small intensity differences.
Color sensitivity differs from brightness sensitivity	Store chroma at lower precision than luma.
Local smoothness is common	Predict pixels from neighbors or transform blocks.

Image Format Map

Format/topic	Role in the chapter
BMP	Mostly raw pixels, but still an approximation of continuous light.
GIF	Palette-based images and simple animation history.
PNG	Lossless image compression with filtering.
TIFF	Flexible container used in imaging workflows.
JPEG	Transform, quantization, and entropy coding for photos.
JPEG recompression	Attempts to compress JPEGs further without fully losing practical recoverability.

JPEG as a Visual Pipeline

flowchart LR
    A[RGB pixels] --> B[Color transform]
    B --> C[Chroma subsampling]
    C --> D[8x8 blocks]
    D --> E[DCT frequency transform]
    E --> F[Quantization]
    F --> G[Zigzag ordering]
    G --> H[Run-length and entropy coding]
    H --> I[JPEG file]

Mental picture:

left side of block frequency table  = broad smooth shaperight side of table                 = fine detailJPEG keeps more of the left side and throws away more of the right side.

Why artifacts happen:

Artifact	Cause
Blocking	Independent 8x8 block decisions become visible.
Ringing	Lost high-frequency detail near sharp edges.
Color bleeding	Reduced chroma detail.
Generational loss	Repeated decode/re-encode compounds quantization.

JPEG Recompression

JPEG files are already compressed. Recompressors look for remaining structure:

Strategy	What it tries to exploit
Better entropy coding	JPEG's stored coefficients may still be coded more compactly.
Coefficient modeling	Predict patterns in quantized DCT coefficients.
Metadata cleanup	Remove or compact non-image payload.
Specialized decoding knowledge	Preserve reconstructable JPEG details while storing them differently.

The chapter surveys historical approaches such as Stuffit, PAQ-based methods, WinZip behavior, and PackJPG. Treat the named results as historical context.

Video

Video compression adds time.

flowchart LR
    A[Frame 1] --> B[Predict frame 2 from frame 1]
    B --> C[Encode motion]
    C --> D[Encode residual error]
    D --> E[Repeat across frames]

Why video compresses:

Structure	Compression opportunity
Adjacent frames are similar	Store changes instead of full frames.
Objects move	Motion vectors describe block movement.
Human vision tolerates some error	Quantize residuals.
Scenes contain spatial redundancy	Use image-like compression inside frames.

NTSC and MPEG

Topic	Key idea
NTSC	Broadcast video is already shaped by human vision, refresh, interlacing, and color compromises.
MPEG	Modern-style video coding predicts frames from other frames, stores motion, quantizes transforms, and entropy-codes the result.

Frame types as a mental model:

I-frame: self-contained imageP-frame: predicted from previous framesB-frame: predicted from past and future reference frames

Tradeoff:

More prediction	Better compression, more complexity, more latency
Less prediction	Easier seeking and editing, larger files

Audio

Audio compression uses psychoacoustics: what the ear can and cannot notice.

Hearing fact	Compression use
Limited frequency range	Do not store inaudible frequencies.
Sensitivity varies by frequency	Allocate bits where hearing is sharpest.
Loud sounds mask nearby quiet sounds	Remove masked components.
Perceived loudness is logarithmic	Quantization can follow perception.
Time masking exists	Sounds can hide nearby sounds in time.

Audio pipeline:

flowchart LR
    A[PCM samples] --> B[Frequency analysis]
    B --> C[Psychoacoustic masking model]
    C --> D[Quantization and bit allocation]
    D --> E[Entropy coding]
    E --> F[Compressed audio]

Chapter 6 Takeaways

Takeaway	Why it matters
Lossy compression discards information	The hard part is choosing information humans will not miss.
Media compression is perceptual modeling	Vision and hearing are part of the algorithm.
Transform plus quantization is central	Especially for JPEG-like and audio/video systems.
Recompression is difficult	Already-compressed data has little easy redundancy left.
Perfect lossy compression would require understanding	A movie could theoretically be summarized semantically, but practical systems are far from that.

The Grand Unifying Model

flowchart TD
    A[Data source] --> B{Lossless or lossy?}
    B -->|Lossless| C[Preserve every bit]
    B -->|Lossy| D[Preserve perceptual meaning]
    C --> E{Transform useful?}
    D --> F[Perception transform and quantization]
    F --> E
    E -->|Yes| G[Expose patterns]
    E -->|No| H[Model directly]
    G --> H[Predict next symbol or bit]
    H --> I[Code using probability]
    I --> J[Package with metadata, checks, maybe encryption]

Every concrete compressor can be placed in this frame:

Compressor family	Transform	Model	Coder
gzip/deflate	LZ77 matches	Huffman-coded literals/lengths	Huffman
bzip2-style	BWT, MTF, RLE	Local symbol frequencies	Huffman
PAQ-style	Often file-aware contexts	Context mixing	Arithmetic/range-like
PNG	Image filters	Deflate model	Huffman via deflate
JPEG	DCT and quantization	Coefficient/statistical coding	Entropy coding
MPEG-like video	Motion prediction and transforms	Residual and motion models	Entropy coding
MP3/AAC-like audio	Frequency analysis and masking	Psychoacoustic bit allocation	Entropy coding

Practical Reading Paths

If You Want to Build a Compressor

Understand Chapter 1 so you stop expecting impossible wins.
Pick a benchmark from Chapter 2 that matches your target data.
Implement a simple coder from Chapter 3 or reuse a known one.
Start with a simple adaptive model from Chapter 4.
Add one transform from Chapter 5 only when it exposes a pattern you can explain.
For media, study Chapter 6 before inventing quality knobs.

If You Want to Choose a Format

Need	Prefer
Speed and compatibility	Deflate/gzip/zip-style tools
Archival ratio	Stronger LZMA/context-mixing tools, if time is acceptable
Text research	PPM/context-mixing/modern language-model-aware approaches
Backups	Deduplication plus compression
Photos	JPEG-like formats or newer perceptual image codecs
Screenshots/graphics	PNG-like lossless image compression
Audio/video distribution	Perceptual audio/video codecs

If You Want to Understand AI Through Compression

Read Chapter 1 and Chapter 4 together. The essential loop is:

understand pattern -> predict better -> encode surprise only -> shorter file

Compression is not just storage optimization. It is a measurable way to ask how much structure a system has discovered.

Cheat Sheets

Glossary

Term	Short definition
Lossless	Decompression recovers the exact original data.
Lossy	Decompression recovers an acceptable approximation.
Model	Probability estimator for upcoming symbols or bits.
Coder	Converts model probabilities into a bitstream.
Transform	Rewrites data to expose compressible structure.
Entropy	Expected information content under a probability model.
Context	Previously seen data used to predict the next symbol.
Adaptive model	Updates as data is processed.
Static model	Sent or fixed before coding the payload.
Solid archive	Compresses multiple files together to exploit shared structure.
BWT	Context-sorting transform that clusters similar-symbol contexts.
Quantization	Reducing precision, usually the irreversible part of lossy coding.
Psychoacoustics	Modeling what humans can hear.

Algorithm Selection Sketch

Mostly repeated bytes?-> LZ-style match codingMostly text?-> PPM, context mixing, dictionary transforms, or modern language-aware modelingMostly smooth numeric samples?-> predictive filtering plus entropy codingMostly photos?-> perceptual image codecMostly backups or VM images?-> deduplication plus compressionAlready encrypted or compressed?-> expect little or no gain

Red Flags When Evaluating Compression Claims

Claim	Skeptical question
Compresses every file	How does it avoid the counting argument?
Recompresses compressed data repeatedly	Where does the extra information go?
Beats all compressors	On which benchmark and rules?
No quality loss in lossy mode	What exact metric or human test supports that?
Tiny output with universal recovery	Is the decompressor/model included in the accounting?

Closing

The book's central message is practical and philosophical at the same time:

To compress data, find structure. To find structure, predict. To predict well, understand the source.

That is why the same field contains Huffman trees, probability intervals, dictionaries, suffix sorting, image transforms, psychoacoustics, benchmark politics, and AI. They are all different ways of answering one question:

What is the shortest description that still lets us recover what matters?

Source and Further Reading

Matt Mahoney, Data Compression Explained, last updated Apr. 15, 2013.
For current tool rankings, consult up-to-date benchmark leaderboards directly; the rankings in the source are historical.
For implementation practice, start with a simple RLE or Huffman coder, then build toward adaptive modeling, LZ-style matching, or arithmetic/range coding.

Iran-USA War, Told Through Tweets (2026)

Sat, 20 Jun 2026 00:00:00 GMT

Note: This RSS feed strips out SVGs and embeds. You might want to read the post on the webpage here.

June 20, 2026

📝

Updated June 20, 2026. This is a blog post, not a wire story: Wikipedia gives me the dated skeleton, and the embedded X posts show the public mood around that skeleton: official confidence, breaking clips, deal spin, Israeli anxiety, and the first argument over who won.

⚠️

Reading note: I am treating every tweet as an artifact, not as a final fact. The useful question is not only “was this post right?” but “what did this post make the war feel like while people were trying to understand it?”

Why I wanted to save the feed

A normal timeline makes the 2026 Iran-USA war look more orderly than it felt. It says the war began, escalated, reached Hormuz, hit a ceasefire, then moved into a memorandum. That is accurate enough for a reference page. It is not how the war arrived on my screen.

On X, the same week could look like victory, collapse, diplomacy, propaganda, and panic depending on which post landed first. Official accounts wrote in capital letters. Reporters posted fragments. Analysts tried to turn fragments into shape. Everyone else argued over whether the deal was peace, surrender, humiliation, or just a pause.

The timeline tells me what happened. The feed tells me what people were being asked to believe before the dust settled.

The first day was already a media war

Wikipedia places the opening U.S.-Israeli strikes on February 28, 2026, after the order for Operation Epic Fury. The feed had no patience for distance. The war was introduced as a command decision, then immediately as a moral claim, then as retaliation clips, then as a search for off-ramps.

Mentions tweet

The White House presents the opening of U.S. combat operations as a presidential statement.

Mentions tweet

Al Jazeera English captures Netanyahu framing the joint strikes as removing an existential threat.

Mentions tweet

Iranian retaliation appears in the feed as video from Bahrain, not as a dry line in a chronology.

Mentions tweet

Barak Ravid reports Trump already talking about possible off-ramps after the opening strikes.

That mix matters. The war was not only launched. It was narrated into existence. By the end of the first day, the public had already been given the three tones that would dominate the next months: resolve, retaliation, and “there is still a way out.”

Hormuz turned the war into everyone else’s problem

The Strait of Hormuz is why this never stayed as a distant foreign-policy story. Once shipping, insurance, oil, and fertilizer entered the conversation, the war moved from strategy pages into grocery bills and fuel prices. This is where the feed became oddly concrete: one chokepoint, one map, one global anxiety.

Mentions tweet

White House warning over the Strait of Hormuz during the pressure phase.

Mentions tweet

White House claim that the U.S. controlled the Strait of Hormuz.

But a strait is not controlled by a caption. It is controlled by ships, mines, drones, pilots, insurers, and whether captains believe tomorrow will be calmer than today. That is why the Hormuz posts are the most useful part of this archive: they show confidence running ahead of the actual settlement.

The ceasefire sounded cleaner than it was

By April, the word “ceasefire” started doing too much work. Pakistan pushed diplomacy. The U.S. and Iran talked through intermediaries. Israel and Lebanon sat awkwardly inside the same sentence without being fully contained by it. The feed looked like relief, but also like fine print.

Mentions tweet

Pakistan PM Shehbaz Sharif urging a two-week ceasefire window for diplomacy.

Mentions tweet

Pakistan announces that Iran, the United States, and allies have agreed to a ceasefire.

Mentions tweet

Iranian foreign minister Abbas Araghchi frames the ceasefire around the unresolved Israel/Lebanon track.

This is where the post stops being a simple U.S.-Iran story. If Lebanon keeps burning, the ceasefire is not a full stop. It is a bracket.

The deal became content before it became peace

The June memorandum had the rhythm of a product launch: leaks, denials, confirmation posts, victory captions, then screenshots and clause analysis. It is easy to mock that, but it is also how a lot of people first encountered the deal.

Mentions tweet

White House post saying both Israel and Iran were looking toward an immediate ceasefire.

Mentions tweet

Al Jazeera English on Trump saying a deal could be signed soon while Tehran urged caution.

Mentions tweet

White House declaring the deal with Iran complete.

Mentions tweet

Ro Khanna backing the ceasefire agreement and sovereignty language.

Mentions tweet

EU foreign policy chief Kaja Kallas welcomes the U.S.-Iran deal and Hormuz reopening.

Mentions tweet

Al Jazeera English notes U.S. officials framing the memorandum as not yet a full peace deal.

Mentions tweet

Barak Ravid reports that U.S. and Iran signed the MOU remotely and that it is in effect.

Mentions tweet

Follow-up detail on Trump personally signing the agreement.

The tension is visible in the posts. One account says complete. Another says interim. Another says signed. Another says not yet peace. Together they read less like contradiction and more like the actual shape of the moment: a war trying to turn itself into paperwork.

June 20 update: the argument moved to the aftermath

By June 20, the newest useful posts were not about the signing itself. They were about what the signing meant. That is usually the moment a war starts becoming history: not because everyone agrees it is over, but because everyone starts fighting over the interpretation.

Mentions tweet

June 20: Al Jazeera English thread on the interim peace deal becoming a political flashpoint inside Israel.

Mentions tweet

June 20: ISW argues Hormuz is reopening in a way that retains Iranian control rather than restoring the pre-war status quo.

Mentions tweet

June 20: ISW reads Iran as likely seeking to delay nuclear negotiations while keeping leverage.

Mentions tweet

Israeli political reading of the MOU’s Lebanon clause and why it alarms Israeli observers.

Mentions tweet

Foreign Affairs frames the postwar risk as Iran winning the war but possibly losing the peace.

Mentions tweet

Jake Sherman shares Senator Cassidy calling the outcome a major foreign-policy blunder.

These are the posts I would add if I were freezing the archive on June 20. They show the war shifting from “what happened?” to “who can live with the settlement?” Israel is angry. Iran is reading leverage into Hormuz. U.S. critics are calling the result a blunder. Analysts are already warning that the 60-day window is not a victory lap; it is a countdown.

What the feed still misses

Tweets are excellent at tension and terrible at scale. They can show a missile clip, a sentence from a leader, or a sharp argument over a clause. They do not naturally hold grief, repairs, debt, trauma, or the boring administrative work of making ports, hospitals, schools, and power grids usable again.

That is why I do not want this post to pretend the feed is the war. It is the surface of the war. A loud surface, sometimes useful, sometimes manipulative, sometimes ahead of official language, sometimes completely wrong.

The opening posts made the strikes feel decisive before the consequences were visible.
The Hormuz posts made a global economic problem easier to understand, but also easier to oversimplify.
The ceasefire posts showed how quickly diplomacy becomes branding.
The June 20 posts show the aftermath beginning before the war has emotionally ended.

My read as of June 20

The cleanest version is this: the war has moved from missiles to clauses. That is better than the reverse. But it is not peace in the deep sense. It is a document asking several angry systems to behave long enough for the next document to exist.

The feed will probably keep calling that victory or humiliation depending on the account. I am more interested in whether the strait stays open, whether Lebanon stops being the loophole, whether nuclear talks become real verification, and whether the human cost is still visible once the diplomatic theater moves on.

A ceasefire can stop the clock. It cannot, by itself, repair the time already lost.

Source spine

Wikipedia: 2026 Iran war
Wikipedia: Timeline of the 2026 Iran war
Al Jazeera June 20 post on Israeli political fallout
ISW June 20 post on Hormuz and Iranian control
All embedded X posts above are saved as public artifacts from the war narrative, not endorsements.

How LLMs Actually Work: A Friendly Map for Humans

Sat, 06 Jun 2026 00:00:00 GMT

Note: This RSS feed strips out SVGs and embeds. You might want to read the post on the webpage here.

June 6, 2026

🧭

LLMs are not magic brains. They are prediction machines built from a few repeatable parts: tokens, vectors, attention, memory-like feed-forward layers, and a loop that keeps choosing the next likely piece of text.

✍️

Source note: this is an original, beginner-friendly rewrite inspired by Kato's article How LLMs Actually Work, with extra examples, code, tables, and Notion-native structure.

The whole idea in one minute

An LLM, or large language model, takes your text, turns it into numbers, runs those numbers through many transformer layers, and predicts what text should come next.

That is the simple version. The useful version is this:

Your prompt is split into tokens, which are small text pieces.
Each token becomes a vector, which is a list of numbers that carries learned meaning.
The model adds information about order, because dog bites man and man bites dog do not mean the same thing.
Attention lets each token decide which earlier tokens matter.
A feed-forward network does deeper processing for each token.
Residual connections and normalization keep the many layers stable.
The model outputs scores for the next possible token.
One token is chosen, added to the text, and the loop repeats.

flowchart LR
    A["You type a prompt"] --> B["Tokenizer<br>text pieces"]
    B --> C["Embeddings<br>meaning as numbers"]
    C --> D["Position signal<br>word order"]
    D --> E["Attention<br>what should matter?"]
    E --> F["Feed-forward layer<br>deeper processing"]
    F --> G["Next-token scores"]
    G --> H["Pick one token"]
    H --> I["Add it to the text"]
    I --> E

💡

A good mental model: an LLM is like an autocomplete system that has read a massive library and learned incredibly subtle patterns about what usually follows what.

Part	Plain-English job	Why it matters
Tokens	Break text into pieces	The model cannot read raw words or letters directly.
Embeddings	Turn pieces into meaning-shaped numbers	Similar ideas can sit near each other in number-space.
Position	Tell the model where each piece appears	Order changes meaning.
Attention	Let tokens look at useful previous tokens	This is how context flows through the sentence.
Feed-forward network	Process each token more deeply	A lot of learned structure lives here.
Next-token prediction	Score likely continuations	This is the generation loop behind every answer.

1. Tokens: the model's alphabet is not your alphabet

Models do not see your sentence the way you do. You see words. The model sees token IDs.

A tokenizer might split a sentence like this:

Text:   "The sleepy robot writes poetry."Tokens: ["The", " sleepy", " robot", " writes", " poetry", "."]IDs:    [791, 47823, 11205, 13004, 24465, 13]

Those ID numbers are what enter the model. The specific numbers differ across model families, but the pattern is the same: text becomes a sequence of integers.

Why not just use whole words? Because language is messy. New names, typos, code, slang, and other languages would explode the vocabulary. Tokens sit between letters and words: flexible enough for rare text, efficient enough for common text.

Slightly technical: why the strawberry counting problem happens

When you ask a model how many letters are in a word, the model may not be looking at separate letters. It may see a word as one or a few tokens. That means character-level questions can be awkward unless the model deliberately reasons about spelling.

const vocabulary = {  "The": 791,  " sleepy": 47823,  " robot": 11205,  " writes": 13004,  " poetry": 24465,  ".": 13,};const prompt = ["The", " sleepy", " robot", " writes", " poetry", "."];const tokenIds = prompt.map((piece) => vocabulary[piece]);console.log(tokenIds);// [791, 47823, 11205, 13004, 24465, 13]

2. Embeddings: IDs become meaning-shaped numbers

A token ID by itself is just a label. ID 11205 does not mean robot unless the model has a learned table that says what vector should represent that token.

That table is called the embedding matrix. Think of it as a huge spreadsheet:

Every token ID gets one row.
Every row contains many numbers.
Those numbers are learned during training.
The row becomes the token's starting representation.

If two tokens are used in similar situations, their vectors often end up close together. Words like doctor, nurse, and hospital tend to live near related medical concepts. This was not hand-labeled by a person; it emerges because those relationships help the model predict text.

🧠

Embeddings are not definitions. They are coordinates learned from usage. The model learns that concepts are related because they appear in related contexts.

Slightly technical: vector arithmetic

An embedding is a vector, meaning a list of numbers. With enough training, directions in vector space can behave like meaning shifts. That is why famous examples like king - man + woman ≈ queen can sometimes work. It is geometry, not a dictionary.

3. Position: the model needs word order

A bag of tokens is not enough. These two sentences contain almost the same pieces but mean very different things:

The dog chased the boy.

The boy chased the dog.

The model therefore needs a position signal. Older transformers added a position vector to each token embedding. Many modern LLMs use RoPE, short for Rotary Position Embeddings, where position is represented by rotating parts of the vector.

You do not need the math to understand the purpose: position makes the model aware that one token came before another, and roughly how far apart they are.

📌

Practical takeaway: important context usually works best near the start or end of a long prompt. Many models are weaker at using information buried in the middle.

Slightly technical: why long context is still hard

Even if a model can accept a huge prompt, that does not mean it uses every part equally well. Attention has to compare many tokens, and retrieval quality can drop when the answer is hidden in the middle of a long context window.

4. Attention: tokens decide what to pay attention to

Attention is the heart of the transformer. It lets each token ask: which previous tokens should shape my current meaning?

For each token, the model creates three learned views:

Name	Question it answers	Everyday analogy
Query	What am I looking for?	A search request
Key	What do I match with?	A label on stored information
Value	What information should be passed along?	The content you copy after finding a match

Imagine the sentence:

The cat that I saw yesterday was sleeping.

When the model reaches was, it needs to know what was sleeping. Attention can give more weight to cat than to yesterday, because cat is more useful for understanding the verb.

import mathscores = {"cat": 3.0, "yesterday": 0.2, "saw": 0.7}# Softmax turns raw scores into weights that add up to 1.exp_scores = {word: math.exp(score) for word, score in scores.items()}total = sum(exp_scores.values())weights = {word: value / total for word, value in exp_scores.items()}print(weights)# cat gets most of the weight

🔒

GPT-style models use causal masking: while predicting the next token, they can look backward but not forward. Future text is hidden because it has not been generated yet.

5. Multi-head attention: many views at once

One attention pattern is not enough for language. A sentence can contain grammar, references, tone, code syntax, and long-range dependencies at the same time.

Multi-head attention runs several attention operations in parallel. One head might track subject-verb relationships. Another might follow quotation marks. Another might notice that a variable name in code was used earlier.

Slightly technical: heads are learned projections, not fixed slices

Each head learns its own projections from the full token vector into a smaller query/key/value space. So a head is not simply handed a pre-cut piece of the vector. It learns its own way to view the whole token representation.

The model then combines the outputs from all heads and sends the result onward.

Token representation   ├─ attention head 1: grammar relationship   ├─ attention head 2: nearby phrase structure   ├─ attention head 3: repeated pattern   └─ attention head 4: reference or pronoun link        ↓Combined into one updated token representation

A practical detail: during generation, the model stores old key and value vectors in a KV cache. That way it does not need to recompute the entire conversation every time it adds one new token.

6. Feed-forward networks: where a lot of learned structure lives

After attention mixes information between tokens, each token goes through a feed-forward network.

Attention is about tokens communicating. The feed-forward network is more like each token doing private thinking.

The rough pattern is:

Expand the vector into a larger space.
Apply a non-linear function.
Compress it back down.

The non-linear step matters because it lets the model learn richer patterns. Without it, many stacked layers would collapse into something much simpler.

🧱

A lot of model parameters live in feed-forward layers. This is one reason they are often discussed as the model's learned store of patterns, facts, and associations.

Slightly technical: dense models vs mixture of experts

In a dense transformer, every token uses the same feed-forward network in a layer. In a mixture-of-experts model, a small router chooses only a few expert networks for each token. This can increase total model capacity without making every token run through every parameter.

7. Residual stream and normalization: keeping deep models trainable

A modern LLM can have dozens or even hundreds of layers. If each layer simply replaced the previous representation, training would be fragile.

Residual connections solve part of that problem. Instead of replacing the vector, a block adds its output back to the existing vector.

new_vector = old_vector + block_output

This creates a running stream of information through the network. Each layer can add a refinement without destroying everything that came before.

Layer normalization keeps the numbers stable. Without it, values can grow too large or shrink too much as they pass through many layers.

🛠️

The boring-sounding parts matter. Residual connections and normalization are major reasons very deep transformer stacks can actually train.

8. Next-token prediction: the answer is built one piece at a time

At the end of the stack, the model turns the final vector into scores for possible next tokens. These raw scores are called logits. A softmax converts them into probabilities.

Then a decoding strategy chooses one token.

Setting	Plain-English effect	When useful
Temperature	Controls randomness	Lower for precise answers, higher for creative drafts
Top-k	Only considers the k most likely tokens	Prevents very unlikely choices
Top-p	Considers the smallest likely group whose probabilities add up to p	Flexible sampling without fixed k

text = "The capital of France is"while not done:    token_ids = tokenize(text)    vectors = transformer(token_ids)    next_token_scores = unembed(vectors[-1])    next_token = sample(next_token_scores, temperature=0.7)    text += detokenize(next_token)

That loop is the machine behind the fluent paragraph. The model writes by repeatedly asking: given everything so far, what token should come next?

⚠️

This also explains hallucinations. The base training objective rewards plausible continuation, not guaranteed truth. Post-training, retrieval, tool use, and evaluation are added to make outputs more useful and reliable.

9. Architecture vs weights: why models feel different

Many modern LLMs share the same broad transformer-family shape. What makes them feel different is usually a combination of:

Training data: what they learned from.
Scale: how many layers, heads, parameters, and tokens were used.
Architecture choices: dense or mixture-of-experts, attention variants, context length, tokenizer.
Post-training: instruction tuning, preference training, safety behavior, tool use, and product-level rules.

So when people compare GPT, Claude, Gemini, Llama, Mistral, Qwen, or Gemma, they are often comparing siblings in a broad transformer family rather than completely unrelated species of model.

Slightly technical: modern transformer vocabulary

RoPE: position through vector rotation.

RMSNorm: a cheaper normalization variant used in many modern open models.

SwiGLU: a popular activation/feed-forward design.

GQA: grouped-query attention, which reduces KV-cache memory.

MoE: mixture of experts, where only selected expert networks run for each token.

10. GPT-2 and MoE: two useful milestones

Two research threads make the mechanics above feel more concrete. GPT-2 showed how far plain next-token prediction could go when scaled. Mixture of Experts shows how a model can grow more capable without forcing every token to use every parameter.

🧩

Plain-English mental model: GPT-2 is like one very large generalist team. MoE is like a building with specialist rooms, where a router sends each token to only the rooms that seem useful.

GPT-2: scaling the next-token game

OpenAI's 2019 paper Language Models are Unsupervised Multitask Learners made a simple bet famous: train a transformer to continue internet text, then test whether that same model can handle many tasks by phrasing them as text continuation.

It was autoregressive: it generated left to right, one token at a time.
It was dense: every token passed through the same model weights.
It helped popularize the idea that scale plus simple training can produce surprisingly general behavior.

# Simplified GPT-2-style objectiveprompt = "Translate to French: hello"target_next_token = " bon"# Training nudges the model so this next token becomes more likely.loss = cross_entropy(model(prompt), target_next_token)

MoE: not every token needs the whole building

A dense transformer usually runs every token through the same feed-forward network. In a Mixture-of-Experts model, a small router chooses only a few expert networks for each token. The model can have many more total parameters, while each token activates only a subset.

Concept	Dense LLM	MoE LLM
Work per token	Uses the same main blocks	Uses selected experts
Analogy	One big generalist team	Router plus specialist teams
Tradeoff	Simpler to train and serve	More capacity, more routing complexity

Slightly technical: where the MoE papers fit

Switch Transformers simplified MoE routing by sending each token to one expert. Efficient Large Scale Language Modeling with Mixtures of Experts studied autoregressive MoE language models at scale. Mixtral of Experts is a modern sparse MoE example where each token is routed to two feed-forward experts.

⚖️

Important nuance: MoE does not automatically mean smarter. Data quality, routing balance, training stability, inference hardware, and post-training still matter.

11. The AI ecosystem: MCP, tools, RAG, agents, and evals

The transformer is the engine, but real AI products usually add a stack around it. That stack gives the model fresh information, lets it take actions, checks its work, and keeps the system observable.

🗺️

Plain-English map: the LLM is the text brain, tools are the hands, RAG is the open-book notes, MCP is a standard plug for external systems, agents are the loop that decides what to do next, and evals are the tests that tell you if any of it works.

Term	Simple meaning	What it helps with	Watch out for
Prompt	Instructions and context	Steering behavior without changing weights	Vague prompts create vague answers
Tool calling	The model asks your app to run a function	Weather, search, payments, calendars, databases	Validate every argument before doing anything real
MCP	A shared protocol for connecting AI apps to tools/data	Reusable integrations across different hosts	Permissions, auth, and tool descriptions matter
RAG	Retrieve relevant documents before answering	Fresh facts and private knowledge	Bad retrieval creates confident wrong answers
Embeddings	Meaning as searchable vectors	Semantic search and clustering	Similar does not always mean correct
Agent	A model inside a task loop	Planning, tool use, retries, handoffs	Needs limits, logs, and stop conditions
Fine-tuning	Training on examples of desired behavior	Style, format, classification, repeated edge cases	Do evals first; do not use it as a fact database
Evals	Tests for model behavior	Comparing prompts, tools, models, and releases	Tiny demo tests miss real-world messiness

MCP: the USB-C idea for AI tools

MCP stands for Model Context Protocol. Instead of every AI app inventing a custom connector for every service, MCP defines a common client-server pattern. An AI app is the host. It creates an MCP client. That client connects to an MCP server, which exposes things like tools, resources, and prompts.

AI app / host  └─ MCP client       └─ MCP server            ├─ tools: actions the model may request            ├─ resources: files, docs, database records, logs            └─ prompts: reusable instruction templates

The key idea is not that MCP makes the model smarter by itself. It makes integrations more standard. A coding agent can connect to GitHub, a support assistant can connect to tickets, and a research assistant can connect to document stores using the same basic pattern.

🔐

Security rule: treat tools like real permissions, not decorations. If a tool can send email, delete files, spend money, or publish content, the app should require clear approval, scoped access, logging, and argument validation.

RAG: giving the model an open book

RAG means Retrieval-Augmented Generation. The model does not rely only on what it learned during training. Your app first searches a knowledge base, pulls the most relevant chunks into the prompt, and asks the model to answer using that context.

Split documents into chunks.
Turn each chunk into an embedding vector.
Store those vectors in a search index or vector database.
When the user asks something, search for similar chunks.
Put the best chunks into the model context and ask for a grounded answer.

const question = "What is our refund policy?";const hits = await vectorSearch(embed(question), { topK: 5 });const answer = await llm.generate({  instructions: "Answer only from the provided policy snippets.",  context: hits.map((hit) => hit.text),  input: question,});

Agents: the loop around the model

An agent is not a new kind of brain. It is usually an LLM plus an orchestration loop: read the goal, choose a next step, maybe call a tool, inspect the result, update the plan, and continue until done or stopped.

goal → think about next step → call tool → observe result → adjust plan → final answer

🧪

Evals are what turn AI from a cool demo into an engineering system. Before shipping a new prompt, model, tool, or agent flow, test it on examples that represent real users, failure cases, and edge cases.

Slightly technical: how these pieces fit in one product

A production assistant might use MCP to discover tools, RAG to fetch private documents, tool calling to take controlled actions, structured outputs to return clean JSON, evals to measure quality, tracing to debug failures, and guardrails to block unsafe or unauthorized actions.

A friendly checklist for understanding any LLM answer

✅

Did the model receive the right information in the prompt?

✅

Was the important context near the beginning or end?

✅

Is the task asking for facts, reasoning, creativity, or formatting?

✅

Would retrieval or a tool make the answer more grounded?

✅

Should the output be checked against a source before trusting it?

✅

If you remember one thing, remember this: LLMs transform text into numbers, let those numbers exchange context through attention, and then predict the next token again and again until an answer appears.

The 2026 AI Landscape: A Hacker's Deep Dive 🤖

Thu, 04 Jun 2026 00:00:00 GMT

Note: This RSS feed strips out SVGs and embeds. You might want to read the post on the webpage here.

June 4, 2026

🤖

This deep dive is structured as a native Notion article: use the table of contents below to jump between architecture, agents, RAG, protocols, frameworks, prompting, vector databases, glossary, and code appendix.

Layer	Native Notion treatment
Navigation	Built-in table of contents plus semantic headings
Reference material	Native tables and collapsible glossary sections
Implementation detail	Language-aware code blocks and equation blocks
Long-form reading	Callouts, dividers, and structured sections

☕

Everything you need to understand the current AI moment — from transformers to agents, RAG pipelines to MCP, and every buzzword in between. Grab a coffee.

1. The Big Picture: Where AI Stands in 2026

The AI landscape of 2026 is defined by a single, sweeping shift: from chat to action. The previous era was dominated by raw model intelligence — who had the biggest, smartest LLM. The current era prioritizes orchestration layers that unify multiple models and tools to automate complex, end-to-end business workflows. We've moved from "AI that talks" to "AI that does."[1][2]

A few landmark data points paint the picture clearly:

700 million people use ChatGPT weekly as of mid-2025[3]
Anthropic now holds ~40% of enterprise LLM API spend; OpenAI has dropped to 27%, down from ~50% in 2023[4]
2026 is the year of autonomous AI agents — goal-driven systems combining reasoning, planning, and tool use, marking the biggest functional jump since GPT-3[1]
Open-source models (Meta's Llama 4 family with 10M token context windows) have narrowed the gap with proprietary models dramatically[4]

The defining theme is the Agentic Web — where AI agents serve as the primary gateway to the internet, navigating backends through APIs rather than humans switching between websites.[2]

2. The GPT Evolution — A Complete Timeline

The journey from GPT-1 to GPT-5.5 is arguably the fastest capability evolution in computing history — parameters grew from 117 million to 175+ billion, a 1,495× increase in two years (GPT-1 to GPT-3).[3]

Model	Date	Params	Key Leap
GPT-1	June 2018	117M	Proved unsupervised pre-training works[3]
GPT-2	Feb 2019	1.5B	Coherent long-form text; initially "too dangerous to release"[5]
GPT-3	May 2020	175B	First commercially viable model; in-context learning[5]
GPT-3.5 / InstructGPT	2022	~175B	RLHF introduced; gave us ChatGPT[5]
GPT-4	March 2023	Undisclosed	Multimodal (text + image); reasoning at scale[3]
GPT-4o	May 2024	Undisclosed	Omnimodal (text, image, audio natively); 2× speed[6]
GPT-4.5	Feb 2025	Undisclosed	Stronger world knowledge, fewer hallucinations[6]
GPT-5	Aug 2025	Undisclosed	94.6% on advanced math; 45% fewer hallucinations vs GPT-4o[3]
GPT-5.5	April 2026	Undisclosed	Native omnimodal; autonomous computer use; agentic coding[6]

The o-Series: Reasoning Models

Alongside the GPT-series, OpenAI launched a separate reasoning-first architecture — the o1, o3, and o4-mini series — which departed from pure next-token prediction. These models "think before they speak," running chain-of-thought reasoning internally before generating output. The o3 and o4-mini launched in April 2025 with significantly enhanced reasoning, particularly in STEM.[5][6]

3. How LLMs Actually Work: The Transformer Architecture

Every modern LLM is built on the Transformer, introduced in the landmark 2017 paper "Attention Is All You Need". Here's the architecture unwrapped:[7]

3.1 The Four Building Blocks

A transformer layer has four core components:[8]

Token Embeddings — Convert words/subwords into numerical vectors in high-dimensional space
Positional Encoding — Inject information about the order of tokens (since attention has no built-in notion of sequence)
Attention — The magic: lets each token "look at" every other token
Feed-Forward Block — A pair of linear transformations applied position-wise

3.2 Self-Attention: The Core Insight

Self-attention answers the question: "Which other words should I focus on to understand my own meaning?" For every token, three vectors are computed:[9][10]

Q (Query) — "What am I looking for?"
K (Key) — "What do I contain?"
V (Value) — "What information do I carry?"

The attention score is computed as:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)VAttention(Q,K,V)=softmax(dkQKT)V

where dkd_kdk is the key dimension. The dk\sqrt{d_k}dk scaling prevents the dot products from growing too large and making softmax gradients vanish.[11]

import torchimport torch.nn.functional as Fimport mathdef scaled_dot_product_attention(Q, K, V, mask=None):    """    Core self-attention mechanism.    Q, K, V: (batch_size, seq_len, d_k)    """    d_k = Q.size(-1)    # Compute attention scores    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)    # Apply optional mask (for decoder / causal attention)    if mask is not None:        scores = scores.masked_fill(mask == 0, float('-inf'))    # Softmax to get attention weights    attn_weights = F.softmax(scores, dim=-1)    # Weighted sum of values    output = torch.matmul(attn_weights, V)    return output, attn_weights# Example: 2 sentences, 4 tokens, 8-dim embeddingsbatch_size, seq_len, d_model = 2, 4, 8d_k = 8Q = torch.randn(batch_size, seq_len, d_k)K = torch.randn(batch_size, seq_len, d_k)V = torch.randn(batch_size, seq_len, d_k)output, weights = scaled_dot_product_attention(Q, K, V)print(f"Output shape: {output.shape}")        # (2, 4, 8)print(f"Attention weights: {weights.shape}")  # (2, 4, 4)

3.3 Multi-Head Attention

Instead of computing attention once, transformers run multiple attention heads in parallel — GPT-3 uses 96 attention heads per block. Each head learns a different "relevance function." The outputs are concatenated and projected:[7]

import torch.nn as nnclass MultiHeadAttention(nn.Module):    def __init__(self, d_model, num_heads):        super().__init__()        assert d_model % num_heads == 0        self.d_k = d_model // num_heads        self.num_heads = num_heads        self.W_q = nn.Linear(d_model, d_model)        self.W_k = nn.Linear(d_model, d_model)        self.W_v = nn.Linear(d_model, d_model)        self.W_o = nn.Linear(d_model, d_model)    def split_heads(self, x):        B, T, d = x.shape        # (B, T, d) -> (B, num_heads, T, d_k)        return x.view(B, T, self.num_heads, self.d_k).transpose(1, 2)    def forward(self, x, mask=None):        Q = self.split_heads(self.W_q(x))        K = self.split_heads(self.W_k(x))        V = self.split_heads(self.W_v(x))        attn_out, _ = scaled_dot_product_attention(Q, K, V, mask)        # Merge heads: (B, num_heads, T, d_k) -> (B, T, d_model)        B, H, T, d_k = attn_out.shape        attn_out = attn_out.transpose(1, 2).contiguous().view(B, T, H * d_k)        return self.W_o(attn_out)

3.4 Modern Architecture Improvements

The vanilla transformer has been significantly optimized:[11]

Innovation	What It Does
FlashAttention	Reduces memory traffic 2–4× by optimizing GPU SRAM access patterns
RoPE (Rotary Position Embeddings)	Better position encoding enabling longer context windows
MoE (Mixture of Experts)	Activates only a subset of parameters per token — enables huge models at lower compute cost
GQA (Grouped Query Attention)	Multiple query heads share key/value heads, reducing KV-cache memory
Linear Attention	Reduces complexity from O(n²) to O(n) for long documents

4. The Agentic AI Revolution

The single biggest shift in 2026 is from generative AI (creates content) to agentic AI (autonomous systems that plan, decide, and execute). Where a generative model answers your question, an agentic AI accomplishes your goal.[12]

4.1 What Makes an AI Agent?

An agent has four capabilities that a plain chatbot lacks:[12]

Goal Understanding — Decompose a complex objective into sub-tasks
Multi-Step Planning — Create and revise a plan of action
Tool Use — Execute functions, call APIs, browse the web, write code
Self-Correction — Observe outcomes and adjust behavior in a loop

4.2 The ReAct Pattern

ReAct (Reasoning + Acting) is the foundational pattern for agents, introduced by Yao et al. The model interleaves reasoning traces with actions:[13]

Thought: I need to find the current population of Karachi.Action: search("Karachi population 2026")Observation: Karachi population is approximately 16.5 million.Thought: Now I can answer the question.Answer: Karachi has approximately 16.5 million people.

from openai import OpenAIimport jsonclient = OpenAI()# Define tools the agent can usetools = [    {        "type": "function",        "function": {            "name": "web_search",            "description": "Search the web for current information",            "parameters": {                "type": "object",                "properties": {                    "query": {"type": "string", "description": "Search query"}                },                "required": ["query"]            }        }    },    {        "type": "function",         "function": {            "name": "run_python",            "description": "Execute Python code and return the result",            "parameters": {                "type": "object",                "properties": {                    "code": {"type": "string", "description": "Python code to run"}                },                "required": ["code"]            }        }    }]def run_react_agent(task: str, max_steps: int = 5):    messages = [{"role": "user", "content": task}]    for step in range(max_steps):        response = client.chat.completions.create(            model="gpt-4o",            messages=messages,            tools=tools,            tool_choice="auto"        )        msg = response.choices[0].message        # No tool call = final answer        if not msg.tool_calls:            return msg.content        # Execute tool calls        messages.append(msg)        for tool_call in msg.tool_calls:            result = execute_tool(tool_call.function.name,                                   json.loads(tool_call.function.arguments))            messages.append({                "role": "tool",                "tool_call_id": tool_call.id,                "content": str(result)            })    return "Max steps reached"

4.3 Agentic Patterns

Four core patterns drive agent behavior:[14]

Pattern	Description	Example
Reflection	Agent critiques its own output and revises	Code reviewer that re-checks generated code
Planning	Decompose goal into ordered sub-tasks	Research agent building a structured outline
Tool Use	Call external APIs and functions	Weather agent calling a weather API
Multi-Agent	Spawn specialized sub-agents	Orchestrator delegates to coder + tester agents

4.4 Multi-Agent Systems (MAS)

By 2026, the field has moved beyond single-purpose agents to Multi-Agent Systems — AI "teams" where specialized agents collaborate to achieve a shared objective, mirroring microservice architecture in traditional software.[12]

                    ┌─────────────────┐                    │   Orchestrator   │                    │   (Planner LLM) │                    └────────┬────────┘            ┌───────────────┼───────────────┐            ▼               ▼               ▼    ┌──────────────┐ ┌─────────────┐ ┌──────────────┐    │  Researcher  │ │   Coder     │ │   Reviewer   │    │   Agent      │ │   Agent     │ │   Agent      │    └──────────────┘ └─────────────┘ └──────────────┘

5. RAG: Retrieval-Augmented Generation Deep Dive

RAG is how you give an LLM access to your private data without retraining it. In 2026, RAG as default infrastructure for enterprise LLM applications.[1]

5.1 Naive RAG Pipeline

The vanilla RAG flow is simple:[15]

User Query    │    ▼Embed Query (vector)    │    ▼Similarity Search → Vector DB → Top-K Documents    │    ▼Inject Context into LLM Prompt    │    ▼LLM generates grounded response

from openai import OpenAIimport numpy as npclient = OpenAI()# Step 1: Embed documents at ingestion timedef embed_text(text: str) -> list[float]:    response = client.embeddings.create(        model="text-embedding-3-small",        input=text    )    return response.data[0].embedding# Step 2: Simple cosine similarity searchdef cosine_similarity(a: list, b: list) -> float:    a, b = np.array(a), np.array(b)    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))def retrieve(query: str, documents: list[dict], top_k: int = 3):    query_vec = embed_text(query)    scored = [        (doc, cosine_similarity(query_vec, doc["embedding"]))        for doc in documents    ]    return sorted(scored, key=lambda x: x[1], reverse=True)[:top_k]# Step 3: Generate grounded answerdef rag_answer(query: str, docs: list[str]) -> str:    context = "\n\n".join(docs)    prompt = f"""Answer the question using ONLY the context below.If the answer isn't in the context, say "I don't know."Context:{context}Question: {query}"""    response = client.chat.completions.create(        model="gpt-4o-mini",        messages=[{"role": "user", "content": prompt}]    )    return response.choices[0].message.content

5.2 Advanced RAG Techniques

Naive RAG breaks at scale. Production systems use:[16]

Technique	Problem It Solves
Chunking strategies	Semantic splitting preserves context better than fixed-size chunks
Hybrid search	BM25 (keyword) + vector search for better recall
Reranking	Cross-encoder reranks top-K results for precision
HyDE	Generate a hypothetical answer, embed it, then search — better for abstract queries
MMR (Maximal Marginal Relevance)	Diversify retrieved documents to avoid redundancy
Multi-vector retrieval	Store summary + detailed chunks separately

5.3 Agentic RAG

Agentic RAG supercharges RAG by adding an agent layer that can iterate, re-retrieve, and validate before answering:[15]

from typing import Optionalclass AgenticRAGPipeline:    """    Agentic RAG: Agent decides WHEN and WHAT to retrieve,    can iterate multiple retrieval rounds, and validates output.    """    def __init__(self, retriever, llm_client):        self.retriever = retriever        self.client = llm_client        self.retrieved_docs = []    def should_retrieve_more(self, current_answer: str, query: str) -> bool:        """Ask the LLM if it needs more context."""        check_prompt = f"""Query: {query}Current draft answer: {current_answer}Is this answer complete and well-supported? Reply with JSON: {{"complete": true/false, "missing": "what's missing"}}"""        response = self.client.chat.completions.create(            model="gpt-4o-mini",            messages=[{"role": "user", "content": check_prompt}],            response_format={"type": "json_object"}        )        import json        result = json.loads(response.choices[0].message.content)        return not result["complete"], result.get("missing", "")    def run(self, query: str, max_iterations: int = 3) -> str:        answer = ""        for i in range(max_iterations):            # Retrieve relevant docs            search_query = query if i == 0 else f"{query} - focusing on: {answer}"            new_docs = self.retriever.search(search_query, top_k=5)            self.retrieved_docs.extend(new_docs)            # Generate answer with all accumulated context            context = "\n---\n".join(self.retrieved_docs)            answer = self._generate(query, context)            # Check if we need more info            needs_more, missing = self.should_retrieve_more(answer, query)            if not needs_more:                break        return answer    def _generate(self, query: str, context: str) -> str:        response = self.client.chat.completions.create(            model="gpt-4o",            messages=[{                "role": "user",                "content": f"Context:\n{context}\n\nAnswer: {query}"            }]        )        return response.choices[0].message.content

6. MCP: The USB-C for AI Tools

Model Context Protocol (MCP) is an open standard introduced by Anthropic in November 2024 to standardize how AI systems integrate with external tools, data sources, and services. Think of it as the USB-C port for AI — one standard connector for everything.[17]

6.1 Why MCP Matters

Before MCP, every AI-tool integration was a custom one-off. MCP provides:[18]

Tools — Functions the AI can call (e.g., run_sql, read_file)
Resources — Data the AI can read (files, database records, API responses)
Prompts — Reusable prompt templates
Sampling — The server can ask the client to run an LLM query

┌─────────────────────────────────────────────────┐│                  MCP Architecture                ││                                                  ││  ┌──────────────┐      MCP Protocol              ││  │  AI Client   │◄──────────────────►┐           ││  │ (Claude/GPT) │                    │           ││  └──────────────┘         ┌──────────┴────────┐  ││                            │    MCP Server     │  ││                            │  (your tools)     │  ││                            │                   │  ││                            │  ┌─────────────┐  │  ││                            │  │  Tools      │  │  ││                            │  │  Resources  │  │  ││                            │  │  Prompts    │  │  ││                            │  └─────────────┘  │  ││                            └───────────────────┘  │└─────────────────────────────────────────────────┘

6.2 Building an MCP Server

# Minimal MCP server using the official Python SDKfrom mcp.server import Serverfrom mcp.server.stdio import stdio_serverfrom mcp.types import Tool, TextContentimport mcp.types as typesapp = Server("my-mcp-server")@app.list_tools()async def list_tools() -> list[Tool]:    return [        Tool(            name="get_weather",            description="Get the current weather for a city",            inputSchema={                "type": "object",                "properties": {                    "city": {                        "type": "string",                        "description": "City name"                    }                },                "required": ["city"]            }        ),        Tool(            name="run_sql",            description="Execute a read-only SQL query against our DB",            inputSchema={                "type": "object",                "properties": {                    "query": {"type": "string"}                },                "required": ["query"]            }        )    ]@app.call_tool()async def call_tool(name: str, arguments: dict) -> list[TextContent]:    if name == "get_weather":        city = arguments["city"]        # Call your actual weather API here        return [TextContent(type="text", text=f"Weather in {city}: 28°C, sunny")]    elif name == "run_sql":        query = arguments["query"]        # Execute query safely        results = execute_readonly_query(query)        return [TextContent(type="text", text=str(results))]async def main():    async with stdio_server() as (read_stream, write_stream):        await app.run(read_stream, write_stream, app.create_initialization_options())if __name__ == "__main__":    import asyncio    asyncio.run(main())

6.3 MCP vs Traditional APIs

Dimension	REST API	MCP Server
Discovery	Manual (read docs)	Auto-discovery via `list_tools()`
Schema	OpenAPI/Swagger	JSON Schema, AI-readable
Context sharing	Per-request	Stateful sessions with context
AI integration	Custom glue code	Native, standardized
Multi-tool	N integrations	One MCP layer

7. A2A: The Protocol for Agent Collaboration

While MCP handles agent-to-tool communication, A2A (Agent2Agent) — announced by Google in April 2025 — handles agent-to-agent communication.[19]

7.1 The Problem A2A Solves

Imagine a travel booking agent that needs to coordinate with a payment agent from a different company, a hotel API agent, and an airline agent — all built on different frameworks. A2A enables them to discover each other, understand capabilities, and coordinate tasks without sharing internals.[19]

MCP Layer (Vertical):          A2A Layer (Horizontal):Agent ◄──tools──► APIs         Agent ◄──collaborate──► AgentAgent ◄──tools──► Databases    Agent ◄──delegate───► AgentAgent ◄──tools──► Files        Agent ◄──coordinate──► Agent

7.2 Agent Cards: The Discovery Mechanism

Every A2A agent publishes an Agent Card — a JSON document at a well-known URL that describes the agent's capabilities:[20]

{  "name": "payment-processor-agent",  "version": "1.2.0",  "description": "Handles payment processing for e-commerce workflows",  "endpoint": "https://payments.example.com/a2a",  "skills": [    {      "id": "process_payment",      "name": "Process Payment",      "description": "Charge a customer for a transaction",      "inputModes": ["text", "json"],      "outputModes": ["json"]    },    {      "id": "refund",      "name": "Issue Refund",      "description": "Refund a previously processed payment"    }  ],  "authentication": {    "schemes": ["Bearer"]  }}

7.3 MCP + A2A: The Full Stack

Google positioned A2A as complementary to MCP, not competitive:[20]

┌──────────────────────────────────────────────────┐│              Enterprise AI Architecture           ││                                                   ││    ┌─────────┐   A2A   ┌─────────┐               ││    │ Agent A │◄───────►│ Agent B │               ││    └────┬────┘         └────┬────┘               ││         │ MCP               │ MCP                 ││    ┌────┴────┐         ┌────┴────┐               ││    │ Tools & │         │ Tools & │               ││    │  Data   │         │  Data   │               ││    └─────────┘         └─────────┘               ││                                                   ││  MCP = vertical (agent ↔ tools)                  ││  A2A = horizontal (agent ↔ agent)                │└──────────────────────────────────────────────────┘

8. AI Agent Frameworks Compared

Choosing the wrong framework costs weeks. Here's the production-tested ranking for 2026:[21]

Framework	Best For	Architecture Style	MCP/A2A Support
LangGraph	Complex stateful production workflows	Graph-based, explicit state machines	✅ MCP
Claude Agent SDK	Anthropic-native production agents	Native Claude hooks + subagents	✅ MCP native
CrewAI	Fast multi-agent prototypes	Role-based crews	✅ MCP
AutoGen / AG2	Research-style conversational agents	Conversational multi-agent	✅ MCP
Semantic Kernel	Enterprise / .NET stacks	Plugin-based, Azure-first	✅
LlamaIndex	RAG-grounded agents	Data-layer first	✅
Pydantic AI	Type-safe Python agents	Pydantic validation throughout	✅

8.1 LangGraph: Production-Ready State Machines

from langgraph.graph import StateGraph, ENDfrom typing import TypedDict, Annotatedimport operatorclass AgentState(TypedDict):    messages: Annotated[list, operator.add]    query: str    documents: list[str]    answer: str    needs_more_info: booldef retrieve_node(state: AgentState) -> AgentState:    """Retrieve relevant documents."""    docs = vector_store.search(state["query"], k=5)    return {"documents": docs}def generate_node(state: AgentState) -> AgentState:    """Generate answer from retrieved docs."""    context = "\n".join(state["documents"])    answer = llm.invoke(f"Context: {context}\nQuestion: {state['query']}")    return {"answer": answer, "needs_more_info": False}def check_node(state: AgentState) -> str:    """Route: done or need more retrieval?"""    return "done" if not state["needs_more_info"] else "retrieve"# Build the graphworkflow = StateGraph(AgentState)workflow.add_node("retrieve", retrieve_node)workflow.add_node("generate", generate_node)workflow.set_entry_point("retrieve")workflow.add_edge("retrieve", "generate")workflow.add_conditional_edges("generate", check_node, {    "done": END,    "retrieve": "retrieve"  # Loop back if needed})app = workflow.compile()# Run itresult = app.invoke({"query": "What is the capital of Punjab?", "messages": []})print(result["answer"])

8.2 CrewAI: Role-Based Multi-Agent Teams

from crewai import Agent, Task, Crew, Process# Define specialized agentsresearcher = Agent(    role="Senior Research Analyst",    goal="Find accurate, up-to-date information on the given topic",    backstory="Expert researcher with access to web search and databases",    verbose=True,    allow_delegation=False,    tools=[web_search_tool, wikipedia_tool])writer = Agent(    role="Technical Writer",    goal="Write clear, engaging blog posts from research findings",    backstory="Experienced tech blogger who makes complex topics accessible",    verbose=True,    allow_delegation=False)# Define tasksresearch_task = Task(    description="Research the latest developments in {topic}. "                "Find key facts, statistics, and expert opinions.",    expected_output="A structured research brief with citations",    agent=researcher)writing_task = Task(    description="Write a 1000-word blog post based on the research brief. "                "Make it engaging for a technical audience.",    expected_output="A complete, publication-ready blog post in Markdown",    agent=writer,    context=[research_task]  # Uses output from research_task)# Assemble and run the crewcrew = Crew(    agents=[researcher, writer],    tasks=[research_task, writing_task],    process=Process.sequential,    verbose=2)result = crew.kickoff(inputs={"topic": "MCP protocol for AI agents"})print(result)

9. Prompt Engineering Playbook

Prompt engineering is the art of communicating precisely with LLMs. Here are the techniques every practitioner needs:[22]

9.1 Core Techniques

# ─── Zero-Shot ──────────────────────────────────────────────────────────────# No examples. Rely on the model's training.zero_shot = "Classify the sentiment of this review: 'The app crashes constantly.'"# ─── Few-Shot ────────────────────────────────────────────────────────────────# Provide examples to guide the modelfew_shot = """Classify sentiment. Examples:Input: "Love this product!" → PositiveInput: "Terrible experience." → NegativeInput: "It's okay, nothing special." → NeutralNow classify: "The battery life is surprisingly good.""""# ─── Chain-of-Thought (CoT) ──────────────────────────────────────────────────# Force step-by-step reasoning for complex taskscot = """Solve this step by step:A store sells apples for Rs. 50 each. If Ali buys 12 apples with Rs. 700, how much change does he get?Think through it step by step before giving the final answer."""# ─── ReAct Pattern ───────────────────────────────────────────────────────────# Alternate reasoning and actionreact_system = """You are an agent. For each task:1. Thought: reason about what to do next2. Action: choose a tool [search | calculate | respond]3. Observation: note what the tool returned4. Repeat until you have the final answer."""# ─── Self-Consistency ────────────────────────────────────────────────────────# Sample multiple reasoning paths, pick the majority answerimport refrom collections import Counterdef self_consistent_answer(question: str, client, n_samples: int = 5) -> str:    answers = []    for _ in range(n_samples):        response = client.chat.completions.create(            model="gpt-4o-mini",            messages=[{                "role": "user",                 "content": f"{question}\nThink step by step."            }],            temperature=0.7  # Some variability for diversity        )        # Extract final answer (last number or key phrase)        text = response.choices[0].message.content        answers.append(text.strip().split("\n")[-1])    # Return most common answer    return Counter(answers).most_common(1)[0][0]

9.2 System Prompt Architecture

SYSTEM_PROMPT_TEMPLATE = """## RoleYou are {role_name}, a {expertise_level} specialist in {domain}.## Objective  {primary_objective}## Constraints- Always cite sources when making factual claims- If uncertain, say "I'm not sure" rather than guessing  - Keep responses under {max_length} words unless asked for detail- Output format: {output_format}## ContextToday's date: {date}User's technical level: {user_level}## Examples{few_shot_examples}"""

10. Vector Databases: The Memory Layer

Vector databases store embeddings — dense numerical representations of meaning — enabling semantic search (search by meaning, not keywords).[23]

10.1 How Embeddings Work

from openai import OpenAIimport numpy as npclient = OpenAI()def get_embedding(text: str) -> list[float]:    return client.embeddings.create(        model="text-embedding-3-small",        input=text    ).data[0].embedding# Semantic similarity demosentences = [    "Karachi is the largest city in Pakistan",    "The metropolitan area of Karachi has 16 million people",    "I like to eat biryani",    "Python is a programming language"]embeddings = [get_embedding(s) for s in sentences]def cosine_sim(a, b):    a, b = np.array(a), np.array(b)    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))query_emb = get_embedding("What is the population of Karachi?")for s, emb in zip(sentences, embeddings):    score = cosine_sim(query_emb, emb)    print(f"{score:.3f} | {s}")# Output:# 0.812 | Karachi is the largest city in Pakistan  ← high# 0.798 | The metropolitan area of Karachi...      ← high  # 0.312 | I like to eat biryani                    ← low# 0.289 | Python is a programming language         ← low

10.2 Vector Database Comparison

DB	Best For	Hosting	Hybrid Search	Notes
Pinecone	Managed, production scale	Cloud-only	✅	Easiest setup
Weaviate	GraphQL + semantic queries	Self/Cloud	✅	MCP support in v3.0
Qdrant	High-performance, Rust core	Self/Cloud	✅	Best perf/$
Chroma	Local dev & testing	Self-host	Limited	Dead-simple Python API
Milvus	Massive scale (billions)	Self/Cloud	✅	GPU-accelerated
pgvector	Already using Postgres	Self-host	✅	No new infra needed

# Quick start: Chroma (local, perfect for prototyping)import chromadbfrom chromadb.utils import embedding_functionsclient = chromadb.Client()openai_ef = embedding_functions.OpenAIEmbeddingFunction(    api_key="YOUR_KEY",    model_name="text-embedding-3-small")collection = client.create_collection(    name="knowledge_base",    embedding_function=openai_ef)# Add documentscollection.add(    documents=[        "MCP is an open standard by Anthropic for AI tool integration",        "RAG stands for Retrieval-Augmented Generation",        "LangGraph is a framework for building stateful agent workflows"    ],    ids=["doc1", "doc2", "doc3"])# Queryresults = collection.query(    query_texts=["How do AI agents connect to external tools?"],    n_results=2)print(results["documents"])# [['MCP is an open standard by Anthropic...', #   'LangGraph is a framework...']]

11. The Complete AI Dictionary

A comprehensive reference of every term you'll encounter, from beginner to deep-technical.

🔤 Foundational Concepts

Artificial Intelligence (AI) The broad field of building systems that perform tasks that typically require human intelligence — reasoning, learning, perception, language understanding.

Machine Learning (ML) A subfield of AI where systems learn from data rather than being explicitly programmed. The model improves with experience.

Deep Learning (DL) ML using neural networks with many layers ("deep"). Powers all modern LLMs, image models, and speech systems.

Neural Network A computational model loosely inspired by biological neurons. Consists of layers of mathematical functions that transform inputs into outputs.

Parameters / Weights The learnable numerical values in a neural network. GPT-3 has 175 billion parameters. More parameters ≠ always better, but generally more capability.[3]

Training The process of adjusting a model's parameters on a large dataset to minimize prediction error. Requires massive compute (GPU clusters).

Inference Running a trained model to generate outputs. What happens when you type a prompt into ChatGPT.

Token The basic unit of text for LLMs. A token is roughly 0.75 words in English. "Hello, world!" = 4 tokens. LLMs process and generate text as token sequences.[16]

Context Window The maximum number of tokens an LLM can "see" at once. GPT-4 had 128K tokens; Llama 4 supports 10 million tokens. Larger = more context, higher cost.[4]

Embedding A dense numerical vector (array of floats) representing the semantic meaning of text, images, or other data. Similar meanings cluster together in embedding space.[23]

🧠 LLM Architecture Terms

Transformer The neural network architecture underlying all major LLMs, introduced in 2017. Key innovation: the attention mechanism replaces sequential processing with parallel processing.[7]

Attention Mechanism The core innovation of transformers. Lets each token attend to (learn from) every other token in context, regardless of distance. Computes Q/K/V matrices.[10]

Self-Attention Attention where the query, key, and value all come from the same sequence. Enables a model to understand words in context of each other.[10]

Multi-Head Attention Running multiple attention operations in parallel, each learning different relationships. GPT-3 uses 96 attention heads.[7]

Positional Encoding A mechanism to inject token position information into embeddings, since attention is position-agnostic by default.[11]

RoPE (Rotary Position Embeddings) A modern positional encoding scheme that encodes position through rotation matrices, enabling better generalization to longer contexts than the original model was trained on.[11]

MoE (Mixture of Experts) Architecture where only a subset of model parameters ("experts") activate per token, enabling models to have far more total parameters at similar inference cost.[11]

Encoder / Decoder

Encoder-only (e.g., BERT): Builds rich representations; best for classification, NER
Decoder-only (e.g., GPT): Generates text autoregressively; best for generation tasks
Encoder-Decoder (e.g., T5): Good for translation and summarization

Autoregressive Generation How LLMs generate text: one token at a time, each new token conditioned on all previous tokens. This is why they can't "edit" — they always predict left-to-right.

Temperature Controls randomness in generation. Temperature=0: always pick the most likely token (deterministic). Temperature=1: sample proportionally. Temperature>1: more random/creative.

Top-P (Nucleus Sampling) Restricts sampling to the smallest set of tokens whose cumulative probability exceeds P. More robust than Temperature alone for controlling output quality.

KV Cache Stores computed key and value matrices for previously processed tokens so they don't need to be recomputed during autoregressive generation. Critical for inference efficiency.

🏋️ Training & Alignment

Pre-training Initial training of an LLM on massive unlabeled text data (web, books, code). Learns statistical patterns of language. Requires enormous compute.

Fine-tuning Further training on a smaller, task-specific dataset to specialize a pre-trained model. Cheaper than pre-training.

SFT (Supervised Fine-Tuning) Fine-tuning on human-curated input-output pairs. "Given this input, produce this output."

RLHF (Reinforcement Learning from Human Feedback) The technique that transformed GPT-3 into ChatGPT. Human evaluators rate outputs; a reward model is trained on those ratings; the LLM is fine-tuned to maximize the reward. Dramatically improves alignment and reduces harmful outputs.[24]

PEFT (Parameter-Efficient Fine-Tuning) Fine-tuning techniques that update only a small fraction of parameters (e.g., LoRA, QLoRA), making fine-tuning feasible on consumer hardware.[1]

LoRA (Low-Rank Adaptation) Popular PEFT method that adds small trainable rank-decomposition matrices to existing weight matrices. Often achieves 90%+ of full fine-tune quality at 1% of the compute.

DPO (Direct Preference Optimization) A simpler alternative to RLHF that directly optimizes the model against human preference data without a separate reward model.

Constitutional AI Anthropic's technique for alignment: the model critiques its own responses against a set of principles and revises them.

Hallucination When an LLM generates confident, fluent, but factually wrong information. Caused by training on noisy data and the autoregressive generation process. Major active research area.[25]

Alignment Ensuring AI systems behave according to human values and intentions. Includes safety, helpfulness, and harmlessness.

🤖 Agentic AI Terms

AI Agent An AI system that autonomously perceives its environment, makes decisions, executes actions, and pursues goals over multiple steps — not just responds to a single prompt.[12]

Agentic AI The paradigm where LLMs act as autonomous agents that plan, use tools, and complete multi-step tasks without constant human direction.[12]

Tool Use / Function Calling The ability of an LLM to call external functions, APIs, and services based on user requests. Core capability enabling agents to "do things" rather than just "say things."[26]

Orchestrator The "brain" of a multi-agent system — the component that plans tasks, delegates to sub-agents, and synthesizes results. Often a more powerful LLM.[2]

Sub-agent / Worker Agent Specialized agents that execute specific tasks delegated by an orchestrator. Examples: a web-search agent, a code-execution agent, a database agent.

Human-in-the-Loop (HITL) A system design where a human can intervene, approve, or redirect an agent at key decision points. Critical for high-stakes workflows.

ReAct (Reasoning + Acting) A foundational prompting/architecture pattern for agents where the model interleaves reasoning ("Thought:") and actions ("Action:") in a loop.[13]

Planning The agent's ability to decompose a goal into an ordered sequence of sub-tasks. Types include: Plan-then-Execute, ReAct (interleaved), and Tree-of-Thought (branching).

Reflection An agentic pattern where the agent reviews and critiques its own previous output to improve it — a form of self-correction.[14]

Memory (Agent Memory) How agents retain information:

In-context: Within the current prompt window (ephemeral)
External: Stored in vector DBs or traditional DBs (persistent)
Episodic: Records of past interactions
Semantic: General knowledge/facts

Multi-Agent System (MAS) A network of specialized agents collaborating to solve problems that exceed any single agent's capability.[12]

📡 Protocols & Infrastructure

MCP (Model Context Protocol) Open standard by Anthropic (Nov 2024) for connecting AI agents to external tools and data sources through a standardized, AI-readable interface. The "USB-C for AI."[17]

A2A (Agent2Agent Protocol) Open protocol by Google (April 2025) enabling AI agents from different vendors to discover, communicate, and collaborate with each other.[19]

Agent Card A JSON document published by an A2A agent that describes its capabilities, endpoint, and authentication requirements — enabling other agents to discover it.[20]

RAG (Retrieval-Augmented Generation) Grounding LLM outputs in relevant documents retrieved from a knowledge base at inference time, reducing hallucination and enabling access to private/current data.[15]

Agentic RAG RAG enhanced with agent capabilities — the agent can iteratively retrieve, evaluate, and re-retrieve context before generating the final answer.[15]

Vector Database A database optimized for storing and querying high-dimensional vector embeddings via similarity search (ANN algorithms).[23]

Semantic Search Search that finds results by meaning rather than keyword matching, using embedding similarity.[27]

Hybrid Search Combining vector similarity search (semantic) with keyword-based search (BM25) in a single query for better recall and precision.[27]

HNSW (Hierarchical Navigable Small World) A graph-based indexing algorithm used in vector databases for fast approximate nearest neighbor (ANN) search.[23]

Chunking Breaking documents into smaller pieces before indexing in RAG systems. Semantic chunking (by meaning) outperforms fixed-size chunking.[16]

Reranking A second-pass step in RAG pipelines where retrieved documents are rescored using a more accurate (but slower) cross-encoder model to improve precision.

📝 Prompting & Generation

Prompt Engineering The practice of designing input prompts to maximize LLM output quality. A rapidly evolving discipline with significant impact on model performance.[22]

Zero-Shot Prompting Instructing a model to perform a task with no examples — relying purely on the model's pre-trained knowledge.[22]

Few-Shot Prompting Providing a small number of input-output examples in the prompt to guide the model's behavior.[22]

Chain-of-Thought (CoT) Prompting Instructing the model to show its reasoning step-by-step before giving a final answer. Dramatically improves performance on math, logic, and multi-step tasks.[22]

System Prompt A special prompt (hidden from the user) that sets the model's role, persona, constraints, and behavior for an entire conversation.

Temperature / Sampling Parameters controlling the randomness and diversity of LLM outputs.

Prompt Injection An attack where malicious content in the environment (e.g., a webpage an agent reads) attempts to override the agent's instructions.

Jailbreaking Attempts to bypass an LLM's safety guardrails through cleverly crafted prompts.

📊 Evaluation & Safety

Benchmark A standardized test for measuring model capability. Examples: MMLU (knowledge), HumanEval (coding), MATH (mathematics), GPQA (PhD-level science).

Hallucination LLM-generated content that is factually incorrect but stated with confidence.[25]

Groundedness The extent to which an LLM's outputs are supported by provided context (e.g., retrieved documents in RAG).

Red-teaming Adversarially probing an AI system to find safety vulnerabilities, jailbreaks, and failure modes before deployment.

AI Safety The field studying how to build AI systems that reliably do what humans intend and avoid unintended harmful behaviors.

Guardrails Programmatic constraints applied to LLM inputs and outputs to enforce safety, content policies, and format requirements.

🏢 Model Families (2026 Landscape)

Family	Company	Notable Models
GPT	OpenAI	GPT-5, GPT-5.5, o3, o4-mini[6]
Claude	Anthropic	Claude 3.5 Sonnet, Claude 4[28]
Gemini	Google	Gemini 2.0 Flash, Gemini Ultra 2[28]
Llama	Meta	Llama 4 (10M context, multimodal)[4]
Mistral	Mistral AI	Mistral Large, Codestral, Mixtral MoE
Phi	Microsoft	Phi-4 (small, surprisingly capable)[4]

12. Code Appendix: Build It Yourself

12.1 A Complete RAG + Agent System

"""Full-stack RAG + Agent system.Stack: OpenAI GPT-4o + ChromaDB + Function Calling"""import osimport jsonimport chromadbfrom openai import OpenAIfrom chromadb.utils.embedding_functions import OpenAIEmbeddingFunctionclient = OpenAI(api_key=os.environ["OPENAI_API_KEY"])# ─── 1. Setup Vector Store ─────────────────────────────────────────────────ef = OpenAIEmbeddingFunction(    api_key=os.environ["OPENAI_API_KEY"],    model_name="text-embedding-3-small")chroma = chromadb.Client()kb = chroma.get_or_create_collection("knowledge", embedding_function=ef)def ingest_documents(docs: list[dict]):    """docs: [{"id": "...", "text": "...", "metadata": {...}}]"""    kb.upsert(        ids=[d["id"] for d in docs],        documents=[d["text"] for d in docs],        metadatas=[d.get("metadata", {}) for d in docs]    )def search_knowledge_base(query: str, n_results: int = 5) -> str:    results = kb.query(query_texts=[query], n_results=n_results)    docs = results["documents"][0]    return "\n\n---\n\n".join(docs) if docs else "No relevant documents found."# ─── 2. Define Agent Tools ─────────────────────────────────────────────────TOOLS = [    {        "type": "function",        "function": {            "name": "search_knowledge_base",            "description": "Search internal knowledge base for relevant information",            "parameters": {                "type": "object",                "properties": {                    "query": {"type": "string"},                    "n_results": {"type": "integer", "default": 5}                },                "required": ["query"]            }        }    },    {        "type": "function",        "function": {            "name": "calculate",            "description": "Evaluate a mathematical expression",            "parameters": {                "type": "object",                "properties": {                    "expression": {                        "type": "string",                        "description": "Python math expression, e.g. '2 ** 10'"                    }                },                "required": ["expression"]            }        }    }]def execute_tool(name: str, args: dict) -> str:    if name == "search_knowledge_base":        return search_knowledge_base(args["query"], args.get("n_results", 5))    elif name == "calculate":        try:            return str(eval(args["expression"], {"__builtins__": {}}, {}))        except Exception as e:            return f"Error: {e}"    return f"Unknown tool: {name}"# ─── 3. ReAct Agent Loop ──────────────────────────────────────────────────SYSTEM = """You are a helpful AI assistant with access to a knowledge base.Use the search_knowledge_base tool to look up relevant information before answering.Think through problems step by step. Always cite which documents informed your answer."""def chat(user_message: str, history: list = None) -> str:    if history is None:        history = []    messages = [{"role": "system", "content": SYSTEM}]    messages.extend(history)    messages.append({"role": "user", "content": user_message})    while True:        response = client.chat.completions.create(            model="gpt-4o",            messages=messages,            tools=TOOLS,            tool_choice="auto"        )        msg = response.choices[0].message        messages.append(msg)        # No tool calls = final answer        if not msg.tool_calls:            return msg.content        # Execute each tool call        for tc in msg.tool_calls:            args = json.loads(tc.function.arguments)            result = execute_tool(tc.function.name, args)            print(f"[Tool: {tc.function.name}] → {result[:100]}...")            messages.append({                "role": "tool",                "tool_call_id": tc.id,                "content": result            })# ─── Usage ─────────────────────────────────────────────────────────────────if __name__ == "__main__":    # Ingest some documents    ingest_documents([        {"id": "1", "text": "MCP (Model Context Protocol) was released by Anthropic in November 2024."},        {"id": "2", "text": "LangGraph is the #1 ranked AI agent framework for production stateful workflows in 2026."},        {"id": "3", "text": "GPT-5 was launched on August 7, 2025, achieving 94.6% on advanced math benchmarks."},    ])    answer = chat("What agent framework should I use for a production workflow, and when was GPT-5 released?")    print(f"\nAnswer:\n{answer}")

12.2 Minimal MCP Client

"""Consuming an MCP server from a Python client.Requires: pip install mcp anthropic"""import asynciofrom mcp import ClientSession, StdioServerParametersfrom mcp.client.stdio import stdio_clientimport anthropicasync def run_with_mcp(user_query: str):    # Connect to an MCP server (e.g., filesystem, database)    server_params = StdioServerParameters(        command="python",        args=["my_mcp_server.py"]    )    async with stdio_client(server_params) as (read, write):        async with ClientSession(read, write) as session:            await session.initialize()            # List available tools            tools_response = await session.list_tools()            tools = [                {                    "name": t.name,                    "description": t.description,                    "input_schema": t.inputSchema                }                for t in tools_response.tools            ]            print(f"Available tools: {[t['name'] for t in tools]}")            # Use Claude with MCP tools            anthropic_client = anthropic.Anthropic()            messages = [{"role": "user", "content": user_query}]            while True:                response = anthropic_client.messages.create(                    model="claude-3-5-sonnet-20241022",                    max_tokens=4096,                    tools=tools,                    messages=messages                )                if response.stop_reason == "end_turn":                    # Extract text response                    for block in response.content:                        if hasattr(block, "text"):                            return block.text                # Handle tool use                for block in response.content:                    if block.type == "tool_use":                        # Execute tool via MCP                        result = await session.call_tool(block.name, block.input)                        messages.append({                            "role": "assistant",                            "content": response.content                        })                        messages.append({                            "role": "user",                            "content": [{                                "type": "tool_result",                                "tool_use_id": block.id,                                "content": str(result.content)                            }]                        })                        breakasyncio.run(run_with_mcp("List all files in the current directory and summarize their contents"))

12.3 A2A Agent Discovery & Collaboration

"""Simplified A2A protocol implementation.Real A2A uses JSON-RPC 2.0 over HTTP with SSE for streaming."""import httpximport asynciofrom dataclasses import dataclass@dataclassclass AgentCard:    name: str    endpoint: str    skills: list[dict]    version: str = "1.0"class A2AClient:    """Discovers and calls remote A2A agents."""    async def discover_agent(self, agent_url: str) -> AgentCard:        """Fetch agent card from a known URL."""        async with httpx.AsyncClient() as client:            response = await client.get(f"{agent_url}/.well-known/agent.json")            data = response.json()            return AgentCard(                name=data["name"],                endpoint=data["endpoint"],                skills=data["skills"],                version=data.get("version", "1.0")            )    async def send_task(self, agent: AgentCard, skill_id: str,                         message: str) -> str:        """Send a task to a remote agent and get the result."""        payload = {            "jsonrpc": "2.0",            "method": "tasks/send",            "id": "req-1",            "params": {                "skill": skill_id,                "message": {"role": "user", "parts": [{"text": message}]}            }        }        async with httpx.AsyncClient() as client:            response = await client.post(                agent.endpoint,                json=payload,                headers={"Authorization": "Bearer my-token"}            )            result = response.json()            return result["result"]["output"]["message"]["parts"][0]["text"]class OrchestratorAgent:    """    An orchestrator that discovers and delegates to specialist agents.    This is the heart of a multi-agent A2A system.    """    def __init__(self):        self.a2a = A2AClient()        self.registry: dict[str, AgentCard] = {}    async def register_agent(self, url: str):        card = await self.a2a.discover_agent(url)        self.registry[card.name] = card        print(f"Registered agent: {card.name} with skills: {[s['id'] for s in card.skills]}")    async def handle_request(self, user_request: str) -> str:        # In a real system, an LLM would decide which agent to use        # Here we do simple keyword routing        if "payment" in user_request.lower():            agent = self.registry.get("payment-agent")            return await self.a2a.send_task(agent, "process_payment", user_request)        elif "weather" in user_request.lower():            agent = self.registry.get("weather-agent")            return await self.a2a.send_task(agent, "get_forecast", user_request)        return "I don't have a specialist agent for this request."async def main():    orchestrator = OrchestratorAgent()    # Discover available agents (in production: from a registry service)    await orchestrator.register_agent("https://payments.example.com")    await orchestrator.register_agent("https://weather.example.com")    result = await orchestrator.handle_request(        "Process a $50 payment for order #1234"    )    print(f"Result: {result}")asyncio.run(main())

Built with care for the Rawalpindi/Islamabad dev community — and every hacker reading this on a Thursday afternoon. Go build something.

API pagination: cursor vs offset

Thu, 04 Jun 2026 00:00:00 GMT

Note: This RSS feed strips out SVGs and embeds. You might want to read the post on the webpage here.

June 4, 2026

Cursor pagination (recommended)

Stable under inserts/deletes
Uses an opaque cursor (e.g., last seen id + sort key)
Easy to cache and resume

SELECT *FROM itemsWHERE (created_at, id) < (:created_at, :id)ORDER BY created_at DESC, id DESCLIMIT 50;

Offset pagination (avoid at scale)

Can skip/duplicate rows when data changes
Gets slower as offset grows

If you need “page numbers”, store cursors per page server-side.

TLS in plain English

Thu, 04 Jun 2026 00:00:00 GMT

Note: This RSS feed strips out SVGs and embeds. You might want to read the post on the webpage here.

June 4, 2026

The handshake

Client says: “Here are the cipher suites I support”
Server replies with a certificate (public key)
Client verifies the certificate chain
They agree on session keys (usually via ECDHE)
After that: traffic is encrypted + authenticated

What you get

Confidentiality (encryption)
Integrity (tamper detection)
Authenticity (you’re talking to the right server)

Common gotchas

Expired certs
Wrong hostname (SAN mismatch)
Missing intermediate certs

Docker layers: 6 rules for smaller images

Thu, 04 Jun 2026 00:00:00 GMT

Note: This RSS feed strips out SVGs and embeds. You might want to read the post on the webpage here.

June 4, 2026

6 rules

Use a small base (alpine/distroless when possible)
Copy only what you need (use .dockerignore)
Install deps before copying app source
Combine commands to reduce layers
Use multi-stage builds
Pin versions to avoid surprise rebuilds

FROM node:24-alpine AS buildWORKDIR /appCOPY package*.json ./RUN npm ciCOPY . .RUN npm run buildFROM gcr.io/distroless/nodejs24-debian12COPY --from=build /app/dist /appCMD ["/app/index.js"]