| Property | Value |
| Source | Data Compression Explained |
| Original author | Matt Mahoney |
| Source last update | Apr. 15, 2013 |
| Draft type | Visual Notion blog post / study guide |
| Audience | Developers, technical writers, ML engineers, compression-curious readers |
| Core idea | Compression is prediction plus coding, with transforms and perception models doing the heavy lifting. |
Source note: This post is an original visual study guide based on Matt Mahoney's book. It paraphrases and organizes the ideas for blog reading. It is not a redistributed copy of the book. Historical benchmark numbers and tool rankings should be read in the context of the source's 2013 update.
The Whole Book in One Picture
flowchart LR
A[Raw data] --> B{Can we expose structure?}
B -->|Yes| C[Transform]
B -->|No| D[Model]
C --> D[Model predicts what comes next]
D --> E[Coder maps probability to bits]
E --> F[Archive or stream]
F --> G[Decoder]
G --> H[Inverse transform]
H --> I[Original data or acceptable approximation]
J[Benchmarks] -. measure .-> C
J -. measure .-> D
J -. measure .-> E
K[Human perception] -. lossy path .-> BCompression looks like file shrinkage, but the book frames it as a deeper engineering problem:
| Layer | Question | Main chapters |
| Theory | What is compressible at all? | 1 |
| Measurement | How do we compare compressors fairly? | 2 |
| Coding | Given probabilities, how many bits are needed? | 3 |
| Modeling | Where do good probabilities come from? | 4 |
| Transforms | How do we rearrange data so simple models work? | 5 |
| Lossy compression | What can we throw away without humans noticing? | 6 |
The shortest honest summary:
Compression is the search for shorter descriptions. Coding is mostly solved. Modeling is the hard part. Transforms make modeling easier. Lossy compression adds a model of human perception.
Fast Mental Models
1. Bits Measure Surprise
If an event has probability p, the ideal code length is:
ideal bits = log2(1 / p) = -log2(p)| Probability | Surprise | Intuition |
1/2 | 1 bit | A fair yes/no question |
1/4 | 2 bits | One outcome among four equal choices |
1/256 | 8 bits | One byte under a uniform byte model |
| Near 1 | Near 0 bits | Almost expected |
| Near 0 | Many bits | Very surprising |
Visual rule:
common symbol -> short code
rare symbol -> long code
unknown pattern -> expensive code
understood pattern -> tiny description2. Compression Is Prediction
flowchart TD
H[History already decoded] --> M[Model]
M --> P[Probability for next bit or symbol]
P --> C[Coder]
C --> O[Compressed output]
O --> D[Decoder repeats same model]
D --> H2[Recovered next bit or symbol]The compressor and decompressor must make the same predictions from the same history. The compressed file mainly stores the information that the model could not predict.
3. Lossy Compression Is Perception-Aware Prediction
Lossless: recover exactly the same bits.
Lossy: recover something humans judge close enough.That one change moves the problem from pure information theory into psychology, vision, hearing, language, and AI.
Chapter Map
| Chapter | Visual handle | What it teaches |
| 1. Information Theory | Limits map | Why random data cannot be compressed and why modeling matters more than coding. |
| 2. Benchmarks | Tradeoff dashboard | How size, speed, memory, data set choice, and rules change compressor rankings. |
| 3. Coding | Probability-to-bits machine | Huffman, arithmetic coding, asymmetric coding, numeric codes, archives, checksums, encryption. |
| 4. Modeling | Prediction engine | Fixed-order models, variable-order models, context mixing, PAQ, ZPAQ, and why modeling is hard. |
| 5. Transforms | Pattern-exposure tools | RLE, LZ77, LZW, BWT, filters, executable transforms, precompression. |
| 6. Lossy Compression | Human sensor model | Images, video, audio, JPEG, MPEG, psychoacoustics, and recompression. |
1. Information Theory
Compression Starts with a Count
There are 2^n different binary strings of length n. There are fewer than 2^n shorter binary strings. Therefore, no lossless compressor can make every n-bit input shorter while still allowing perfect decompression.
All n-bit inputs:
[000...000] [000...001] [000...010] ... [111...111]
count = 2^n
Possible shorter outputs:
[] [0] [1] [00] [01] ... [length < n]
count = 2^n - 1
One-to-one decoding cannot map more inputs into fewer outputs.The key result:
| Claim | Meaning |
| No universal compression | A compressor that shrinks every file cannot exist. |
| Some files must expand | If a compressor shrinks some inputs, it must make other inputs longer or refuse them. |
| Random-looking data is usually incompressible | Most possible strings have no shorter description. |
| Useful data is often compressible | Human-created data usually has patterns, constraints, formats, repetition, and meaning. |
Why Meaningful Data Compresses
Most possible strings are random. Most strings people store are not:
| Data | Why it has structure |
| English text | Grammar, vocabulary, topic, repeated words, spelling patterns |
| Source code | Keywords, syntax, indentation, identifiers, libraries |
| Images | Neighboring pixels are correlated |
| Audio | Samples are correlated over time and filtered by human hearing |
| Executables | Instructions, addresses, headers, imported symbols |
| Backups | Files repeat across versions and machines |
Compression works because our data is not drawn uniformly from all possible bit strings. It comes from processes with structure.
Coding Is Bounded
If a model says a symbol has probability p, the best possible code length is approximately -log2(p) bits. You can choose a bad coder and waste bits, but no coder can beat the model's information content for all data drawn from that model.
flowchart LR
A[Model says symbol is likely] --> B[Short code]
C[Model says symbol is rare] --> D[Long code]
E[Model is wrong] --> F[Compressed size penalty]The lesson is subtle:
| Part | Status |
| Turning probabilities into bits | Efficient, well-understood |
| Finding the right probabilities | Hard, open-ended, data-dependent |
Modeling Is Not Computable
A better model can turn a long string into a tiny description. For example, a million digits of pi can be treated as random-looking decimal digits, or as "compute the first million digits of pi." The second description is dramatically shorter, but it requires recognizing the source.
weak model:
314159265358979323846...
-> "digits look independent"
-> many bits
strong model:
314159265358979323846...
-> "this is pi"
-> short program or descriptionThe book connects this to Kolmogorov complexity: the shortest program that outputs a string is an ideal compressed representation, but there is no general algorithm that can always find it.
Compression and AI
Prediction is a sign of understanding:
| If a system understands... | It can predict... |
| English | likely next words |
| Images | likely neighboring pixels |
| Audio | likely future samples |
| Code | likely syntax and instruction patterns |
| File formats | likely headers, fields, and constraints |
This is why compression and AI meet. A compressor that understands a data source can describe it more compactly. A perfect general compressor would need a very broad kind of understanding.
Chapter 1 Takeaways
| Takeaway | Why it matters |
| Random data cannot be compressed | Do not expect magic from encrypted, already-compressed, or random data. |
| Compression is model plus coder | Separate probability estimation from bit representation. |
| Coding has mathematical limits | Better coding helps, but only up to the model's quality. |
| Modeling is the hard problem | Better compression usually comes from better prediction. |
| Understanding creates compression | The more structure you can exploit, the shorter the description. |
2. Benchmarks
What Benchmarks Actually Measure
Compression benchmarks compare compressors on a chosen data set under chosen rules.
flowchart TD
A[Benchmark] --> B[Data set]
A --> C[Rules]
A --> D[Metrics]
D --> E[Compressed size]
D --> F[Compression speed]
D --> G[Decompression speed]
D --> H[Memory use]
C --> I[Can include decompressor?]
C --> J[Can tune to files?]
C --> K[Single file or archive?]The big triangle:
smaller output
/\
/ \
/ \
/ \
/ \
less memory -------- faster speedYou usually cannot optimize all three at once. Maximum compression tools are often slow and memory-hungry. Practical formats often give up ratio to win speed, streaming, random access, or compatibility.
Bits Per Character
The book often uses bpc, or bits per character, for byte-oriented corpora.
| bpc | Meaning |
| 8.0 | No compression for byte data |
| 6.0 | 25 percent smaller than original |
| 4.0 | Half the original size |
| 2.0 | One quarter of original size |
| Lower | Better compression, assuming the same input data |
Benchmark Landscape
| Benchmark | What it emphasizes | Why it matters |
| Calgary Corpus | Classic mixed small files | Historical baseline for text compression research. |
| Large Text Compression Benchmark | Large Wikipedia XML text | Natural language modeling and long-range structure. |
| Hutter Prize | Compression as AI research | Rewards improvements on a fixed text corpus with decompressor included. |
| Maximum Compression | Maximum ratio on mixed files | Encourages aggressive tuning for size. |
| Generic Compression Benchmark | Untuned universal prediction | Tests generality rather than file-type tricks. |
| Compression Ratings | Size and speed scoring | Makes tradeoffs adjustable by user preference. |
| Other public benchmarks | Multiple corpora and rule sets | Shows that rankings depend on test design. |
| File system studies | Real-world storage mix | Reveals what data actually exists on machines. |
Why Rankings Shift
Two compressors can trade places when any of these changes:
| Variable | Effect |
| Data type | Text, images, executables, backups, logs, and audio favor different methods. |
| File size | Small files make headers and model startup costs visible. |
| Archive rules | Solid archives can exploit similarity across files. |
| Decompressor inclusion | Including source or executable rewards simpler decoders. |
| Memory limit | Large models can dominate if memory is unrestricted. |
| Speed limit | Slow context mixers may lose to practical LZ-family tools. |
| Tuning policy | Per-file options can inflate benchmark-specific performance. |
Visual: Benchmark as a Dashboard
+--------------------------------------------------+
| Compressor: example |
+-------------------+------------------------------+
| Size | 1.95 bpc |
| Compression time | slow |
| Decompression | medium |
| Memory | high |
| Decoder included | yes |
| Data set | text-heavy |
| Good use case | archival / research |
+-------------------+------------------------------+Chapter 2 Takeaways
| Takeaway | Why it matters |
| Benchmarks are not neutral | They encode assumptions about data and priorities. |
| Size is only one metric | Real systems care about speed, memory, streaming, and compatibility. |
| Historical leaderboards age | Use the book's rankings as context, not current product advice. |
| Data set choice dominates | A compressor can look brilliant on one corpus and ordinary on another. |
| A benchmark is a contract | Read the rules before interpreting the chart. |
3. Coding
Coder Job Description
A coder receives probabilities from a model and emits bits close to the theoretical ideal.
flowchart LR
A[Symbol] --> B[Model probability]
B --> C[Coder]
C --> D[Bitstream]
D --> E[Decoder]
E --> F[Same symbol]The coder must be:
| Requirement | Reason |
| Decodable | The original symbols must be recoverable. |
| Efficient | Common symbols should use fewer bits. |
| Synchronized | Decoder must reproduce the same boundaries and model states. |
| Practical | Real files need headers, error checks, and sometimes encryption. |
Huffman Coding
Huffman coding builds a prefix tree. Frequent symbols sit near the root. Rare symbols sit deeper.
root
/ \
common *
/ \
medium rareCore idea:
| Symbol probability | Huffman effect |
| High | Shorter integer number of bits |
| Low | Longer integer number of bits |
| Exact powers of 1/2 | Very efficient |
| Awkward probabilities | Wastes some space due to whole-bit code lengths |
Strengths:
- Simple.
- Fast.
- Widely used.
- Good with static or block models.
Limits:
- Code lengths are whole numbers of bits.
- Binary alphabets cannot be compressed by basic Huffman alone.
- A full table or canonical description may need to be stored.
Arithmetic Coding
Arithmetic coding represents an entire message as a subinterval inside [0, 1).
Initial interval:
[0 ------------------------------------------------ 1)
After likely symbol:
[0 -------- 0.7)
After next symbol:
[0.28 --- 0.42)
After more symbols:
[0.314159 ----------------)
Output a binary number inside the final interval.Why it matters:
| Feature | Benefit |
| Fractional bit efficiency | Avoids Huffman's whole-bit rounding. |
| Works well for binary prediction | Ideal for bitwise models. |
| Adapts naturally | Model can update after every symbol. |
| Near-Shannon performance | Strong practical coding method. |
Asymmetric Binary Coding
Asymmetric binary coding is another way to code predicted bits efficiently. It uses a single integer-like state rather than arithmetic coding's interval endpoints. It matters because the same theory can be implemented with different machine-level tradeoffs.
| Coding family | Mental model |
| Huffman | Tree of prefix codes |
| Arithmetic/range | Shrinking probability interval |
| Asymmetric binary | State machine that packs bits according to probability |
Numeric Codes
Some values are not arbitrary symbols. They are counts, offsets, lengths, or prediction errors. Numeric codes exploit common number distributions.
| Code | Good for | Visual shape |
| Unary | Very small positive integers | 0, 10, 110, 1110 |
| Rice | Geometric-like distributions with power-of-two parameter | quotient plus remainder |
| Golomb | Geometric-like distributions with flexible parameter | quotient plus bounded remainder |
| Extra-bit codes | Ranges with extra low bits | length classes and offset details |
These appear in systems where the model says "small numbers are common, large numbers are rare."
Archive Formats
A compression algorithm is not the whole file format. Archives also need structure.
flowchart LR
A[Archive header] --> B[File metadata]
B --> C[Compressed payload]
C --> D[Error check]
D --> E[Optional encryption metadata]Important archive concerns:
| Concern | Why it matters |
| Single-file vs multi-file | Affects metadata and file recovery. |
| Solid compression | Similar files compressed together can shrink more. |
| Random access | Solid archives may make one-file extraction slower. |
| Error detection | Detects corruption after storage or transmission. |
| Encryption | Protects confidentiality but makes data look random afterward. |
Error Detection
| Method | What it catches |
| Parity | Simple odd/even bit errors, weak but cheap. |
| CRC-32 | Common accidental corruption check. |
| Adler-32 | Fast checksum used in some compression contexts. |
| Cryptographic hash | Strong integrity identity, designed against adversarial collisions. |
Error detection is not compression, but production archives need it.
Chapter 3 Takeaways
| Takeaway | Why it matters |
| Coding maps probability to bits | It is the final packing step. |
| Huffman is simple but rounded | Whole-bit lengths are a real limitation. |
| Arithmetic coding is closer to ideal | It fits adaptive and binary models well. |
| Numeric codes encode structured integers | They are useful for lengths, offsets, runs, and errors. |
| Archives are systems | Metadata, checks, encryption, and extraction rules matter. |
4. Modeling
The Hard Part
A model estimates what comes next. Once we have a good probability, coding is mechanical. The model decides whether compression is mediocre or excellent.
history: "the quick brown "
model: next symbol is likely "f", "d", "c", ...
coder: short code for likely next symbolStatic vs adaptive:
| Model type | How it works | Tradeoff |
| Static | Analyze data, send model, then coded data | Good if model cost is small and data is stable. |
| Adaptive | Update model as data is read | Avoids sending full model, tracks local changes. |
Fixed-Order Models
An order n model predicts the next symbol from the previous n symbols.
Order 0: no context
P(next)
Order 1: one previous symbol
P(next | previous)
Order 3: three previous symbols
P(next | previous_3, previous_2, previous_1)Example:
| Context | Next-symbol table |
q | u is very likely in English |
th | e, a, i, o are plausible |
ing | space, punctuation, or suffix continuation |
Why fixed order breaks:
| Order too low | Misses useful context. |
| Order too high | Most contexts are rare or unseen. |
| Result | Need smoothing, fallback, or variable order. |
Bytewise, Bitwise, and Indirect Models
| Model style | Unit | Useful when |
| Bytewise | Predict next byte | Text, simple binary data, byte-aligned formats. |
| Bitwise | Predict next bit | Arithmetic coding, mixed binary patterns, precise probability updates. |
| Indirect | Use hashed or transformed contexts | Large context spaces where full tables are too costly. |
Variable-Order Models
Variable-order models keep statistics for multiple context lengths and choose or mix them.
flowchart TD
A[Long context] --> B{Seen enough?}
B -->|Yes| C[Use long-context prediction]
B -->|No| D[Back off]
D --> E[Medium context]
E --> F{Seen enough?}
F -->|Yes| G[Use medium-context prediction]
F -->|No| H[Short context]DMC
Dynamic Markov Coding predicts bits with a state machine that grows as it observes data. It can split states when histories diverge.
Visual:
state A --0--> state B
state A --1--> state C
if state A is too vague:
state A becomes A1 and A2PPM
Prediction by Partial Matching uses byte contexts and backs off from longer to shorter contexts. It is a classic text-compression idea.
Try context "tion"
if unknown, try "ion"
if unknown, try "on"
if unknown, try "n"
if unknown, try order 0The hard detail is handling symbols that have not appeared in a context, often called the escape or zero-frequency problem.
CTW
Context Tree Weighting mixes context tree predictions in a principled bitwise way. Instead of choosing only one context, it combines evidence across a tree.
Context Mixing
Context mixing uses many predictors at once.
flowchart LR
A[Text model] --> M[Mixer]
B[Match model] --> M
C[Low-order model] --> M
D[File-format model] --> M
E[Image/audio heuristic] --> M
M --> P[Final probability]
P --> Coder[Arithmetic coder]The mixer can learn which predictors are useful in each situation.
| Component | Role |
| Linear evidence mixing | Combine model outputs with weighted evidence. |
| Logistic mixing | Mix in probability/logit space for better behavior. |
| SSE | Secondary Symbol Estimation adjusts predictions using past calibration. |
| ISSE | Indirect SSE uses contexts to select or adapt estimators. |
| Match model | If current data matches earlier data, predict continuation from the match. |
| PAQ models | High-compression family using context mixing. |
| ZPAQ | A more configurable/archive-oriented successor in the PAQ family. |
| Crinkler | Specialized compression/linking for executable code size competitions. |
Why Context Mixing Can Beat Single Models
Different predictors notice different kinds of structure:
| Predictor | Notices |
| Low-order byte model | Local byte frequencies |
| Word model | Language-level repetition |
| Match model | Exact repeated substrings |
| Image model | Neighboring pixel relationships |
| Executable model | Instruction and address patterns |
| XML model | Tags, attributes, markup rhythm |
Compression improves when the mixer learns which predictor is trustworthy right now.
Chapter 4 Takeaways
| Takeaway | Why it matters |
| Modeling is prediction | The compressed file stores surprises. |
| Fixed order is simple but brittle | Context length must match the data. |
| Variable order handles sparse contexts | Backoff avoids overconfidence in rare histories. |
| Context mixing is powerful | Many weak specialized models can beat one general model. |
| Better modeling can look like understanding | Language, images, code, and formats all reward domain knowledge. |
5. Transforms
What a Transform Does
A transform rewrites data so a simpler model can compress it.
flowchart LR
A[Original data] --> B[Transform]
B --> C[More model-friendly symbols]
C --> D[Model]
D --> E[Coder]
E --> F[Compressed file]
F --> G[Decoder]
G --> H[Inverse transform]
H --> I[Original data]Ideal transform:
| Property | Meaning |
| Reversible | Decompression gets the original back exactly. |
| Structure exposing | Repetition, locality, or predictable errors become obvious. |
| Cheap enough | Transform cost must be worth the compression gain. |
| Canonical when possible | Avoid arbitrary choices that add information burden. |
Run Length Encoding
RLE replaces repeated symbols with a symbol plus a count.
AAAAAABBBBCCCCCCCC
becomes
(A,6) (B,4) (C,8)Best for:
| Good | Bad |
| Long repeated runs | Alternating symbols |
| Simple image masks | Natural text |
| Zero-filled data | Already transformed data with few runs |
LZ77 and the Match Family
LZ77 replaces repeated strings with pointers to previous occurrences.
Input:
ABRACADABRA
Later "ABRA" can become:
(go back 7, copy 4)Visual:
sliding history buffer lookahead
[ABRACAD] [ABRA...]
^^^^^
match reused by pointerWhy it is popular:
| Strength | Explanation |
| Fast decompression | Decoder mostly copies bytes from earlier output. |
| General-purpose | Works on many repeated byte patterns. |
| Streaming-friendly | Can run with bounded windows. |
| Foundation format | Deflate, LZMA-like families, and many practical tools build on the idea. |
LZSS
LZSS improves practical LZ77 by only emitting pointers when they save space. Short non-saving matches remain literals.
Deflate
Deflate combines LZ77-style matches with Huffman coding. It powers common zip/gzip-style compression and survives because it is fast, widely implemented, and compatible.
flowchart LR
A[Bytes] --> B[LZ77 literals and matches]
B --> C[Huffman coding]
C --> D[Deflate bitstream]LZMA
LZMA pushes stronger modeling around LZ-style matches, often improving ratio at the cost of more CPU and memory.
LZX, ROLZ, LZP, Snappy, Deduplication
| Method | Main idea | Design center |
| LZX | LZ-family compression used in Microsoft contexts | Practical binary/archive compression. |
| ROLZ | Restricts match search by recent contexts | Better match relevance. |
| LZP | Predicts repeated strings from context | Fast prediction of matches. |
| Snappy | Prioritizes very high speed | Low latency over maximum ratio. |
| Deduplication | Replaces repeated chunks across files or systems | Backup and storage efficiency. |
LZW and Dictionary Encoding
Dictionary methods replace strings with dictionary references.
dictionary:
1 -> the
2 -> compression
3 -> model
text:
the compression model
encoded:
1 2 3Dictionary types:
| Type | Dictionary source |
| Fixed | Built into the format or algorithm. |
| Static | Learned from the file and stored with it. |
| Dynamic | Built by compressor and decompressor in lockstep. |
LZW is a dynamic dictionary method historically associated with formats like GIF-era compression. Its broader lesson is that both sides can build the same dictionary without transmitting every entry.
Dictionary Encoding for Text
Text-specific dictionaries can model:
| Feature | Compression opportunity |
| Words | Common words become compact tokens. |
| Capitalization | Store word identity separately from case pattern. |
| Newlines | Model paragraph and line structure. |
| Punctuation | Predict separators and syntax. |
| Word endings | Use morphology and repeated suffixes. |
The stronger the text model, the more the compressor behaves like a small language-aware system.
Symbol Ranking and Move-to-Front
Move-to-front keeps a list of symbols ordered by recency. If the same few symbols keep appearing in a context, their ranks stay small.
alphabet list:
[A B C D E ...]
read C -> output rank 2, move C to front
[C A B D E ...]
read C again -> output rank 0This works well after transforms like BWT, where local neighborhoods tend to reuse a small set of symbols.
Burrows-Wheeler Transform
BWT sorts rotations of a block so characters with similar right contexts cluster together. After BWT, a fast local model can often compress well.
Tiny example, conceptually:
Original block:
banana$
Sort rotations:
$banana
a$banan
ana$ban
anana$b
banana$
na$bana
nana$ba
Take last column:
annb$aaThe output is not obviously shorter, but it groups context-related symbols. It is usually followed by move-to-front, run-length coding, and entropy coding.
flowchart LR
A[Input block] --> B[BWT context sort]
B --> C[Move-to-front]
C --> D[Run-length coding]
D --> E[Huffman or arithmetic coding]Predictive Filtering
Numeric data often compresses better as prediction errors.
samples:
100, 103, 105, 106, 108
predict next as previous:
100, +3, +2, +1, +2Small errors are easier to encode than raw values.
| Filter | Used for |
| Delta coding | Signals, images, ordered numeric data |
| Color transform | Separating brightness from color differences |
| Linear filtering | Predicting from neighboring samples or pixels |
Specialized Transforms
| Transform | Target | Idea |
| E8E9 | x86 executable code | Normalize relative call/jump addresses so repeated code patterns match better. |
| Precomp | Already-compressed embedded data | Detect and temporarily expand compressed streams inside files so outer compression can work. |
| Huffman pre-coding | Context-mixing speed | Reduce input size before expensive modeling. |
Chapter 5 Takeaways
| Takeaway | Why it matters |
| Transforms do not finish compression | They prepare data for modeling and coding. |
| LZ methods exploit repeated strings | This explains many practical formats. |
| BWT exploits sorted contexts | It turns context into local symbol clustering. |
| Filters exploit smooth numeric data | Prediction errors are often small. |
| Specialized transforms exploit file knowledge | Better compression often comes from knowing the data's format. |
6. Lossy Compression
The Big Shift
Lossless compression asks:
Can we reproduce the original bits exactly?Lossy compression asks:
Can we reproduce something humans accept as the same?That makes perception the model.
flowchart LR
A[Original signal] --> B[Human perception model]
B --> C[Discard hard-to-notice detail]
C --> D[Quantize]
D --> E[Lossless coding of remaining data]
E --> F[Compressed media]Images
Digital images are already approximations of continuous light. Lossy image compression removes detail that the visual system is less sensitive to.
| Human visual fact | Compression use |
| Limited spatial resolution | Do not store invisible fine detail. |
| Limited brightness precision | Quantize small intensity differences. |
| Color sensitivity differs from brightness sensitivity | Store chroma at lower precision than luma. |
| Local smoothness is common | Predict pixels from neighbors or transform blocks. |
Image Format Map
| Format/topic | Role in the chapter |
| BMP | Mostly raw pixels, but still an approximation of continuous light. |
| GIF | Palette-based images and simple animation history. |
| PNG | Lossless image compression with filtering. |
| TIFF | Flexible container used in imaging workflows. |
| JPEG | Transform, quantization, and entropy coding for photos. |
| JPEG recompression | Attempts to compress JPEGs further without fully losing practical recoverability. |
JPEG as a Visual Pipeline
flowchart LR
A[RGB pixels] --> B[Color transform]
B --> C[Chroma subsampling]
C --> D[8x8 blocks]
D --> E[DCT frequency transform]
E --> F[Quantization]
F --> G[Zigzag ordering]
G --> H[Run-length and entropy coding]
H --> I[JPEG file]Mental picture:
left side of block frequency table = broad smooth shape
right side of table = fine detail
JPEG keeps more of the left side and throws away more of the right side.Why artifacts happen:
| Artifact | Cause |
| Blocking | Independent 8x8 block decisions become visible. |
| Ringing | Lost high-frequency detail near sharp edges. |
| Color bleeding | Reduced chroma detail. |
| Generational loss | Repeated decode/re-encode compounds quantization. |
JPEG Recompression
JPEG files are already compressed. Recompressors look for remaining structure:
| Strategy | What it tries to exploit |
| Better entropy coding | JPEG's stored coefficients may still be coded more compactly. |
| Coefficient modeling | Predict patterns in quantized DCT coefficients. |
| Metadata cleanup | Remove or compact non-image payload. |
| Specialized decoding knowledge | Preserve reconstructable JPEG details while storing them differently. |
The chapter surveys historical approaches such as Stuffit, PAQ-based methods, WinZip behavior, and PackJPG. Treat the named results as historical context.
Video
Video compression adds time.
flowchart LR
A[Frame 1] --> B[Predict frame 2 from frame 1]
B --> C[Encode motion]
C --> D[Encode residual error]
D --> E[Repeat across frames]Why video compresses:
| Structure | Compression opportunity |
| Adjacent frames are similar | Store changes instead of full frames. |
| Objects move | Motion vectors describe block movement. |
| Human vision tolerates some error | Quantize residuals. |
| Scenes contain spatial redundancy | Use image-like compression inside frames. |
NTSC and MPEG
| Topic | Key idea |
| NTSC | Broadcast video is already shaped by human vision, refresh, interlacing, and color compromises. |
| MPEG | Modern-style video coding predicts frames from other frames, stores motion, quantizes transforms, and entropy-codes the result. |
Frame types as a mental model:
I-frame: self-contained image
P-frame: predicted from previous frames
B-frame: predicted from past and future reference framesTradeoff:
| More prediction | Better compression, more complexity, more latency |
| Less prediction | Easier seeking and editing, larger files |
Audio
Audio compression uses psychoacoustics: what the ear can and cannot notice.
| Hearing fact | Compression use |
| Limited frequency range | Do not store inaudible frequencies. |
| Sensitivity varies by frequency | Allocate bits where hearing is sharpest. |
| Loud sounds mask nearby quiet sounds | Remove masked components. |
| Perceived loudness is logarithmic | Quantization can follow perception. |
| Time masking exists | Sounds can hide nearby sounds in time. |
Audio pipeline:
flowchart LR
A[PCM samples] --> B[Frequency analysis]
B --> C[Psychoacoustic masking model]
C --> D[Quantization and bit allocation]
D --> E[Entropy coding]
E --> F[Compressed audio]Chapter 6 Takeaways
| Takeaway | Why it matters |
| Lossy compression discards information | The hard part is choosing information humans will not miss. |
| Media compression is perceptual modeling | Vision and hearing are part of the algorithm. |
| Transform plus quantization is central | Especially for JPEG-like and audio/video systems. |
| Recompression is difficult | Already-compressed data has little easy redundancy left. |
| Perfect lossy compression would require understanding | A movie could theoretically be summarized semantically, but practical systems are far from that. |
The Grand Unifying Model
flowchart TD
A[Data source] --> B{Lossless or lossy?}
B -->|Lossless| C[Preserve every bit]
B -->|Lossy| D[Preserve perceptual meaning]
C --> E{Transform useful?}
D --> F[Perception transform and quantization]
F --> E
E -->|Yes| G[Expose patterns]
E -->|No| H[Model directly]
G --> H[Predict next symbol or bit]
H --> I[Code using probability]
I --> J[Package with metadata, checks, maybe encryption]Every concrete compressor can be placed in this frame:
| Compressor family | Transform | Model | Coder |
| gzip/deflate | LZ77 matches | Huffman-coded literals/lengths | Huffman |
| bzip2-style | BWT, MTF, RLE | Local symbol frequencies | Huffman |
| PAQ-style | Often file-aware contexts | Context mixing | Arithmetic/range-like |
| PNG | Image filters | Deflate model | Huffman via deflate |
| JPEG | DCT and quantization | Coefficient/statistical coding | Entropy coding |
| MPEG-like video | Motion prediction and transforms | Residual and motion models | Entropy coding |
| MP3/AAC-like audio | Frequency analysis and masking | Psychoacoustic bit allocation | Entropy coding |
Practical Reading Paths
If You Want to Build a Compressor
- Understand Chapter 1 so you stop expecting impossible wins.
- Pick a benchmark from Chapter 2 that matches your target data.
- Implement a simple coder from Chapter 3 or reuse a known one.
- Start with a simple adaptive model from Chapter 4.
- Add one transform from Chapter 5 only when it exposes a pattern you can explain.
- For media, study Chapter 6 before inventing quality knobs.
If You Want to Choose a Format
| Need | Prefer |
| Speed and compatibility | Deflate/gzip/zip-style tools |
| Archival ratio | Stronger LZMA/context-mixing tools, if time is acceptable |
| Text research | PPM/context-mixing/modern language-model-aware approaches |
| Backups | Deduplication plus compression |
| Photos | JPEG-like formats or newer perceptual image codecs |
| Screenshots/graphics | PNG-like lossless image compression |
| Audio/video distribution | Perceptual audio/video codecs |
If You Want to Understand AI Through Compression
Read Chapter 1 and Chapter 4 together. The essential loop is:
understand pattern -> predict better -> encode surprise only -> shorter fileCompression is not just storage optimization. It is a measurable way to ask how much structure a system has discovered.
Cheat Sheets
Glossary
| Term | Short definition |
| Lossless | Decompression recovers the exact original data. |
| Lossy | Decompression recovers an acceptable approximation. |
| Model | Probability estimator for upcoming symbols or bits. |
| Coder | Converts model probabilities into a bitstream. |
| Transform | Rewrites data to expose compressible structure. |
| Entropy | Expected information content under a probability model. |
| Context | Previously seen data used to predict the next symbol. |
| Adaptive model | Updates as data is processed. |
| Static model | Sent or fixed before coding the payload. |
| Solid archive | Compresses multiple files together to exploit shared structure. |
| BWT | Context-sorting transform that clusters similar-symbol contexts. |
| Quantization | Reducing precision, usually the irreversible part of lossy coding. |
| Psychoacoustics | Modeling what humans can hear. |
Algorithm Selection Sketch
Mostly repeated bytes?
-> LZ-style match coding
Mostly text?
-> PPM, context mixing, dictionary transforms, or modern language-aware modeling
Mostly smooth numeric samples?
-> predictive filtering plus entropy coding
Mostly photos?
-> perceptual image codec
Mostly backups or VM images?
-> deduplication plus compression
Already encrypted or compressed?
-> expect little or no gainRed Flags When Evaluating Compression Claims
| Claim | Skeptical question |
| Compresses every file | How does it avoid the counting argument? |
| Recompresses compressed data repeatedly | Where does the extra information go? |
| Beats all compressors | On which benchmark and rules? |
| No quality loss in lossy mode | What exact metric or human test supports that? |
| Tiny output with universal recovery | Is the decompressor/model included in the accounting? |
Closing
The book's central message is practical and philosophical at the same time:
To compress data, find structure. To find structure, predict. To predict well, understand the source.
That is why the same field contains Huffman trees, probability intervals, dictionaries, suffix sorting, image transforms, psychoacoustics, benchmark politics, and AI. They are all different ways of answering one question:
What is the shortest description that still lets us recover what matters?Source and Further Reading
- Matt Mahoney, Data Compression Explained, last updated Apr. 15, 2013.
- For current tool rankings, consult up-to-date benchmark leaderboards directly; the rankings in the source are historical.
- For implementation practice, start with a simple RLE or Huffman coder, then build toward adaptive modeling, LZ-style matching, or arithmetic/range coding.