---
title: "Data Compression Explained: A Visual Guide to the Whole Book"
slug: data-compression-explained-visual-guide
canonical_url: https://oreoro.github.io/posts/data-compression-explained-visual-guide/
collection: "Personal Notes"
published_at: 2026-06-20T00:00:00.000Z
updated_at: 2026-06-20T00:00:00.000Z
tags: 
  - Information
  - Guide
  - Webtrotion
excerpt: "An illustrated whole-book study guide to Matt Mahoney's Data Compression Explained, covering information theory, benchmarks, coding, modeling, transforms, and lossy media compression."
author: NMC
---

## Navigation Context

- Canonical URL: https://oreoro.github.io/posts/data-compression-explained-visual-guide/
- You are here: Home > Posts > Personal Notes > Data Compression Explained: A Visual Guide to the Whole Book

### Useful Next Links
- [Home](https://oreoro.github.io/)
- [Contact](https://oreoro.github.io/contact/)
- [Donate](https://oreoro.github.io/donate/)
- [Personal Notes](https://oreoro.github.io/collections/personal-notes/)

<table class="no-column-header"><thead></thead><tbody><tr><td>Property</td><td>Value</td></tr><tr><td>Source</td><td><a href="https://mattmahoney.net/dc/dce.html" class="site-page-link text-link decoration-solid" target="_blank" rel="noopener noreferrer">Data Compression Explained</a></td></tr><tr><td>Original author</td><td>Matt Mahoney</td></tr><tr><td>Source last update</td><td>Apr. 15, 2013</td></tr><tr><td>Draft type</td><td>Visual Notion blog post / study guide</td></tr><tr><td>Audience</td><td>Developers, technical writers, ML engineers, compression-curious readers</td></tr><tr><td>Core idea</td><td>Compression is prediction plus coding, with transforms and perception models doing the heavy lifting.</td></tr></tbody></table>

> Source note: This post is an original visual study guide based on Matt Mahoney's book. It paraphrases and organizes the ideas for blog reading. It is not a redistributed copy of the book. Historical benchmark numbers and tool rankings should be read in the context of the source's 2013 update.

### The Whole Book in One Picture

flowchart LR
    A\[Raw data\] --> B{Can we expose structure?}
    B -->|Yes| C\[Transform\]
    B -->|No| D\[Model\]
    C --> D\[Model predicts what comes next\]
    D --> E\[Coder maps probability to bits\]
    E --> F\[Archive or stream\]
    F --> G\[Decoder\]
    G --> H\[Inverse transform\]
    H --> I\[Original data or acceptable approximation\]

    J\[Benchmarks\] -. measure .-> C
    J -. measure .-> D
    J -. measure .-> E
    K\[Human perception\] -. lossy path .-> B

Compression looks like file shrinkage, but the book frames it as a deeper engineering problem:

<table class="no-column-header"><thead></thead><tbody><tr><td>Layer</td><td>Question</td><td>Main chapters</td></tr><tr><td>Theory</td><td>What is compressible at all?</td><td>1</td></tr><tr><td>Measurement</td><td>How do we compare compressors fairly?</td><td>2</td></tr><tr><td>Coding</td><td>Given probabilities, how many bits are needed?</td><td>3</td></tr><tr><td>Modeling</td><td>Where do good probabilities come from?</td><td>4</td></tr><tr><td>Transforms</td><td>How do we rearrange data so simple models work?</td><td>5</td></tr><tr><td>Lossy compression</td><td>What can we throw away without humans noticing?</td><td>6</td></tr></tbody></table>

The shortest honest summary:

> Compression is the search for shorter descriptions. Coding is mostly solved. Modeling is the hard part. Transforms make modeling easier. Lossy compression adds a model of human perception.

### Fast Mental Models

#### 1\. Bits Measure Surprise

If an event has probability `p`, the ideal code length is:

```
ideal bits = log2(1 / p) = -log2(p)
```

<table class="no-column-header"><thead></thead><tbody><tr><td>Probability</td><td>Surprise</td><td>Intuition</td></tr><tr><td><code class="annotation-code bg-default text-default">1/2</code></td><td>1 bit</td><td>A fair yes/no question</td></tr><tr><td><code class="annotation-code bg-default text-default">1/4</code></td><td>2 bits</td><td>One outcome among four equal choices</td></tr><tr><td><code class="annotation-code bg-default text-default">1/256</code></td><td>8 bits</td><td>One byte under a uniform byte model</td></tr><tr><td>Near 1</td><td>Near 0 bits</td><td>Almost expected</td></tr><tr><td>Near 0</td><td>Many bits</td><td>Very surprising</td></tr></tbody></table>

Visual rule:

```
common symbol     -> short code
rare symbol       -> long code
unknown pattern   -> expensive code
understood pattern -> tiny description
```

#### 2\. Compression Is Prediction

flowchart TD
    H\[History already decoded\] --> M\[Model\]
    M --> P\[Probability for next bit or symbol\]
    P --> C\[Coder\]
    C --> O\[Compressed output\]
    O --> D\[Decoder repeats same model\]
    D --> H2\[Recovered next bit or symbol\]

The compressor and decompressor must make the same predictions from the same history. The compressed file mainly stores the information that the model could not predict.

#### 3\. Lossy Compression Is Perception-Aware Prediction

```
Lossless: recover exactly the same bits.
Lossy: recover something humans judge close enough.
```

That one change moves the problem from pure information theory into psychology, vision, hearing, language, and AI.

### Chapter Map

<table class="no-column-header"><thead></thead><tbody><tr><td>Chapter</td><td>Visual handle</td><td>What it teaches</td></tr><tr><td>1. Information Theory</td><td>Limits map</td><td>Why random data cannot be compressed and why modeling matters more than coding.</td></tr><tr><td>2. Benchmarks</td><td>Tradeoff dashboard</td><td>How size, speed, memory, data set choice, and rules change compressor rankings.</td></tr><tr><td>3. Coding</td><td>Probability-to-bits machine</td><td>Huffman, arithmetic coding, asymmetric coding, numeric codes, archives, checksums, encryption.</td></tr><tr><td>4. Modeling</td><td>Prediction engine</td><td>Fixed-order models, variable-order models, context mixing, PAQ, ZPAQ, and why modeling is hard.</td></tr><tr><td>5. Transforms</td><td>Pattern-exposure tools</td><td>RLE, LZ77, LZW, BWT, filters, executable transforms, precompression.</td></tr><tr><td>6. Lossy Compression</td><td>Human sensor model</td><td>Images, video, audio, JPEG, MPEG, psychoacoustics, and recompression.</td></tr></tbody></table>

* * *

## 1\. Information Theory

### Compression Starts with a Count

There are `2^n` different binary strings of length `n`. There are fewer than `2^n` shorter binary strings. Therefore, no lossless compressor can make every `n`\-bit input shorter while still allowing perfect decompression.

```
All n-bit inputs:

[000...000] [000...001] [000...010] ... [111...111]
       count = 2^n

Possible shorter outputs:

[] [0] [1] [00] [01] ... [length < n]
       count = 2^n - 1

One-to-one decoding cannot map more inputs into fewer outputs.
```

The key result:

<table class="no-column-header"><thead></thead><tbody><tr><td>Claim</td><td>Meaning</td></tr><tr><td>No universal compression</td><td>A compressor that shrinks every file cannot exist.</td></tr><tr><td>Some files must expand</td><td>If a compressor shrinks some inputs, it must make other inputs longer or refuse them.</td></tr><tr><td>Random-looking data is usually incompressible</td><td>Most possible strings have no shorter description.</td></tr><tr><td>Useful data is often compressible</td><td>Human-created data usually has patterns, constraints, formats, repetition, and meaning.</td></tr></tbody></table>

### Why Meaningful Data Compresses

Most possible strings are random. Most strings people store are not:

<table class="no-column-header"><thead></thead><tbody><tr><td>Data</td><td>Why it has structure</td></tr><tr><td>English text</td><td>Grammar, vocabulary, topic, repeated words, spelling patterns</td></tr><tr><td>Source code</td><td>Keywords, syntax, indentation, identifiers, libraries</td></tr><tr><td>Images</td><td>Neighboring pixels are correlated</td></tr><tr><td>Audio</td><td>Samples are correlated over time and filtered by human hearing</td></tr><tr><td>Executables</td><td>Instructions, addresses, headers, imported symbols</td></tr><tr><td>Backups</td><td>Files repeat across versions and machines</td></tr></tbody></table>

Compression works because our data is not drawn uniformly from all possible bit strings. It comes from processes with structure.

### Coding Is Bounded

If a model says a symbol has probability `p`, the best possible code length is approximately `-log2(p)` bits. You can choose a bad coder and waste bits, but no coder can beat the model's information content for all data drawn from that model.

flowchart LR
    A\[Model says symbol is likely\] --> B\[Short code\]
    C\[Model says symbol is rare\] --> D\[Long code\]
    E\[Model is wrong\] --> F\[Compressed size penalty\]

The lesson is subtle:

<table class="no-column-header"><thead></thead><tbody><tr><td>Part</td><td>Status</td></tr><tr><td>Turning probabilities into bits</td><td>Efficient, well-understood</td></tr><tr><td>Finding the right probabilities</td><td>Hard, open-ended, data-dependent</td></tr></tbody></table>

### Modeling Is Not Computable

A better model can turn a long string into a tiny description. For example, a million digits of `pi` can be treated as random-looking decimal digits, or as "compute the first million digits of pi." The second description is dramatically shorter, but it requires recognizing the source.

```
weak model:
314159265358979323846...
-> "digits look independent"
-> many bits

strong model:
314159265358979323846...
-> "this is pi"
-> short program or description
```

The book connects this to Kolmogorov complexity: the shortest program that outputs a string is an ideal compressed representation, but there is no general algorithm that can always find it.

### Compression and AI

Prediction is a sign of understanding:

<table class="no-column-header"><thead></thead><tbody><tr><td>If a system understands...</td><td>It can predict...</td></tr><tr><td>English</td><td>likely next words</td></tr><tr><td>Images</td><td>likely neighboring pixels</td></tr><tr><td>Audio</td><td>likely future samples</td></tr><tr><td>Code</td><td>likely syntax and instruction patterns</td></tr><tr><td>File formats</td><td>likely headers, fields, and constraints</td></tr></tbody></table>

This is why compression and AI meet. A compressor that understands a data source can describe it more compactly. A perfect general compressor would need a very broad kind of understanding.

### Chapter 1 Takeaways

<table class="no-column-header"><thead></thead><tbody><tr><td>Takeaway</td><td>Why it matters</td></tr><tr><td>Random data cannot be compressed</td><td>Do not expect magic from encrypted, already-compressed, or random data.</td></tr><tr><td>Compression is model plus coder</td><td>Separate probability estimation from bit representation.</td></tr><tr><td>Coding has mathematical limits</td><td>Better coding helps, but only up to the model&amp;#x27;s quality.</td></tr><tr><td>Modeling is the hard problem</td><td>Better compression usually comes from better prediction.</td></tr><tr><td>Understanding creates compression</td><td>The more structure you can exploit, the shorter the description.</td></tr></tbody></table>

* * *

## 2\. Benchmarks

### What Benchmarks Actually Measure

Compression benchmarks compare compressors on a chosen data set under chosen rules.

flowchart TD
    A\[Benchmark\] --> B\[Data set\]
    A --> C\[Rules\]
    A --> D\[Metrics\]
    D --> E\[Compressed size\]
    D --> F\[Compression speed\]
    D --> G\[Decompression speed\]
    D --> H\[Memory use\]
    C --> I\[Can include decompressor?\]
    C --> J\[Can tune to files?\]
    C --> K\[Single file or archive?\]

The big triangle:

```
         smaller output
               /\
              /  \
             /    \
            /      \
           /        \
less memory -------- faster speed
```

You usually cannot optimize all three at once. Maximum compression tools are often slow and memory-hungry. Practical formats often give up ratio to win speed, streaming, random access, or compatibility.

### Bits Per Character

The book often uses `bpc`, or bits per character, for byte-oriented corpora.

<table class="no-column-header"><thead></thead><tbody><tr><td>bpc</td><td>Meaning</td></tr><tr><td>8.0</td><td>No compression for byte data</td></tr><tr><td>6.0</td><td>25 percent smaller than original</td></tr><tr><td>4.0</td><td>Half the original size</td></tr><tr><td>2.0</td><td>One quarter of original size</td></tr><tr><td>Lower</td><td>Better compression, assuming the same input data</td></tr></tbody></table>

### Benchmark Landscape

<table class="no-column-header"><thead></thead><tbody><tr><td>Benchmark</td><td>What it emphasizes</td><td>Why it matters</td></tr><tr><td>Calgary Corpus</td><td>Classic mixed small files</td><td>Historical baseline for text compression research.</td></tr><tr><td>Large Text Compression Benchmark</td><td>Large Wikipedia XML text</td><td>Natural language modeling and long-range structure.</td></tr><tr><td>Hutter Prize</td><td>Compression as AI research</td><td>Rewards improvements on a fixed text corpus with decompressor included.</td></tr><tr><td>Maximum Compression</td><td>Maximum ratio on mixed files</td><td>Encourages aggressive tuning for size.</td></tr><tr><td>Generic Compression Benchmark</td><td>Untuned universal prediction</td><td>Tests generality rather than file-type tricks.</td></tr><tr><td>Compression Ratings</td><td>Size and speed scoring</td><td>Makes tradeoffs adjustable by user preference.</td></tr><tr><td>Other public benchmarks</td><td>Multiple corpora and rule sets</td><td>Shows that rankings depend on test design.</td></tr><tr><td>File system studies</td><td>Real-world storage mix</td><td>Reveals what data actually exists on machines.</td></tr></tbody></table>

### Why Rankings Shift

Two compressors can trade places when any of these changes:

<table class="no-column-header"><thead></thead><tbody><tr><td>Variable</td><td>Effect</td></tr><tr><td>Data type</td><td>Text, images, executables, backups, logs, and audio favor different methods.</td></tr><tr><td>File size</td><td>Small files make headers and model startup costs visible.</td></tr><tr><td>Archive rules</td><td>Solid archives can exploit similarity across files.</td></tr><tr><td>Decompressor inclusion</td><td>Including source or executable rewards simpler decoders.</td></tr><tr><td>Memory limit</td><td>Large models can dominate if memory is unrestricted.</td></tr><tr><td>Speed limit</td><td>Slow context mixers may lose to practical LZ-family tools.</td></tr><tr><td>Tuning policy</td><td>Per-file options can inflate benchmark-specific performance.</td></tr></tbody></table>

### Visual: Benchmark as a Dashboard

```
+--------------------------------------------------+
| Compressor: example                              |
+-------------------+------------------------------+
| Size              | 1.95 bpc                     |
| Compression time  | slow                         |
| Decompression     | medium                       |
| Memory            | high                         |
| Decoder included  | yes                          |
| Data set          | text-heavy                   |
| Good use case     | archival / research          |
+-------------------+------------------------------+
```

### Chapter 2 Takeaways

<table class="no-column-header"><thead></thead><tbody><tr><td>Takeaway</td><td>Why it matters</td></tr><tr><td>Benchmarks are not neutral</td><td>They encode assumptions about data and priorities.</td></tr><tr><td>Size is only one metric</td><td>Real systems care about speed, memory, streaming, and compatibility.</td></tr><tr><td>Historical leaderboards age</td><td>Use the book&amp;#x27;s rankings as context, not current product advice.</td></tr><tr><td>Data set choice dominates</td><td>A compressor can look brilliant on one corpus and ordinary on another.</td></tr><tr><td>A benchmark is a contract</td><td>Read the rules before interpreting the chart.</td></tr></tbody></table>

* * *

## 3\. Coding

### Coder Job Description

A coder receives probabilities from a model and emits bits close to the theoretical ideal.

flowchart LR
    A\[Symbol\] --> B\[Model probability\]
    B --> C\[Coder\]
    C --> D\[Bitstream\]
    D --> E\[Decoder\]
    E --> F\[Same symbol\]

The coder must be:

<table class="no-column-header"><thead></thead><tbody><tr><td>Requirement</td><td>Reason</td></tr><tr><td>Decodable</td><td>The original symbols must be recoverable.</td></tr><tr><td>Efficient</td><td>Common symbols should use fewer bits.</td></tr><tr><td>Synchronized</td><td>Decoder must reproduce the same boundaries and model states.</td></tr><tr><td>Practical</td><td>Real files need headers, error checks, and sometimes encryption.</td></tr></tbody></table>

### Huffman Coding

Huffman coding builds a prefix tree. Frequent symbols sit near the root. Rare symbols sit deeper.

```
    root
   /    \
common   *
        / \
    medium rare
```

Core idea:

<table class="no-column-header"><thead></thead><tbody><tr><td>Symbol probability</td><td>Huffman effect</td></tr><tr><td>High</td><td>Shorter integer number of bits</td></tr><tr><td>Low</td><td>Longer integer number of bits</td></tr><tr><td>Exact powers of 1/2</td><td>Very efficient</td></tr><tr><td>Awkward probabilities</td><td>Wastes some space due to whole-bit code lengths</td></tr></tbody></table>

Strengths:

-   Simple.
-   Fast.
-   Widely used.
-   Good with static or block models.

Limits:

-   Code lengths are whole numbers of bits.
-   Binary alphabets cannot be compressed by basic Huffman alone.
-   A full table or canonical description may need to be stored.

### Arithmetic Coding

Arithmetic coding represents an entire message as a subinterval inside `[0, 1)`.

```
Initial interval:
[0 ------------------------------------------------ 1)

After likely symbol:
[0 -------- 0.7)

After next symbol:
[0.28 --- 0.42)

After more symbols:
[0.314159 ----------------)

Output a binary number inside the final interval.
```

Why it matters:

<table class="no-column-header"><thead></thead><tbody><tr><td>Feature</td><td>Benefit</td></tr><tr><td>Fractional bit efficiency</td><td>Avoids Huffman&amp;#x27;s whole-bit rounding.</td></tr><tr><td>Works well for binary prediction</td><td>Ideal for bitwise models.</td></tr><tr><td>Adapts naturally</td><td>Model can update after every symbol.</td></tr><tr><td>Near-Shannon performance</td><td>Strong practical coding method.</td></tr></tbody></table>

### Asymmetric Binary Coding

Asymmetric binary coding is another way to code predicted bits efficiently. It uses a single integer-like state rather than arithmetic coding's interval endpoints. It matters because the same theory can be implemented with different machine-level tradeoffs.

<table class="no-column-header"><thead></thead><tbody><tr><td>Coding family</td><td>Mental model</td></tr><tr><td>Huffman</td><td>Tree of prefix codes</td></tr><tr><td>Arithmetic/range</td><td>Shrinking probability interval</td></tr><tr><td>Asymmetric binary</td><td>State machine that packs bits according to probability</td></tr></tbody></table>

### Numeric Codes

Some values are not arbitrary symbols. They are counts, offsets, lengths, or prediction errors. Numeric codes exploit common number distributions.

<table class="no-column-header"><thead></thead><tbody><tr><td>Code</td><td>Good for</td><td>Visual shape</td></tr><tr><td>Unary</td><td>Very small positive integers</td><td><code class="annotation-code bg-default text-default">0</code>, <code class="annotation-code bg-default text-default">10</code>, <code class="annotation-code bg-default text-default">110</code>, <code class="annotation-code bg-default text-default">1110</code></td></tr><tr><td>Rice</td><td>Geometric-like distributions with power-of-two parameter</td><td>quotient plus remainder</td></tr><tr><td>Golomb</td><td>Geometric-like distributions with flexible parameter</td><td>quotient plus bounded remainder</td></tr><tr><td>Extra-bit codes</td><td>Ranges with extra low bits</td><td>length classes and offset details</td></tr></tbody></table>

These appear in systems where the model says "small numbers are common, large numbers are rare."

### Archive Formats

A compression algorithm is not the whole file format. Archives also need structure.

flowchart LR
    A\[Archive header\] --> B\[File metadata\]
    B --> C\[Compressed payload\]
    C --> D\[Error check\]
    D --> E\[Optional encryption metadata\]

Important archive concerns:

<table class="no-column-header"><thead></thead><tbody><tr><td>Concern</td><td>Why it matters</td></tr><tr><td>Single-file vs multi-file</td><td>Affects metadata and file recovery.</td></tr><tr><td>Solid compression</td><td>Similar files compressed together can shrink more.</td></tr><tr><td>Random access</td><td>Solid archives may make one-file extraction slower.</td></tr><tr><td>Error detection</td><td>Detects corruption after storage or transmission.</td></tr><tr><td>Encryption</td><td>Protects confidentiality but makes data look random afterward.</td></tr></tbody></table>

### Error Detection

<table class="no-column-header"><thead></thead><tbody><tr><td>Method</td><td>What it catches</td></tr><tr><td>Parity</td><td>Simple odd/even bit errors, weak but cheap.</td></tr><tr><td>CRC-32</td><td>Common accidental corruption check.</td></tr><tr><td>Adler-32</td><td>Fast checksum used in some compression contexts.</td></tr><tr><td>Cryptographic hash</td><td>Strong integrity identity, designed against adversarial collisions.</td></tr></tbody></table>

Error detection is not compression, but production archives need it.

### Chapter 3 Takeaways

<table class="no-column-header"><thead></thead><tbody><tr><td>Takeaway</td><td>Why it matters</td></tr><tr><td>Coding maps probability to bits</td><td>It is the final packing step.</td></tr><tr><td>Huffman is simple but rounded</td><td>Whole-bit lengths are a real limitation.</td></tr><tr><td>Arithmetic coding is closer to ideal</td><td>It fits adaptive and binary models well.</td></tr><tr><td>Numeric codes encode structured integers</td><td>They are useful for lengths, offsets, runs, and errors.</td></tr><tr><td>Archives are systems</td><td>Metadata, checks, encryption, and extraction rules matter.</td></tr></tbody></table>

* * *

## 4\. Modeling

### The Hard Part

A model estimates what comes next. Once we have a good probability, coding is mechanical. The model decides whether compression is mediocre or excellent.

```
history:  "the quick brown "
model:    next symbol is likely "f", "d", "c", ...
coder:    short code for likely next symbol
```

Static vs adaptive:

<table class="no-column-header"><thead></thead><tbody><tr><td>Model type</td><td>How it works</td><td>Tradeoff</td></tr><tr><td>Static</td><td>Analyze data, send model, then coded data</td><td>Good if model cost is small and data is stable.</td></tr><tr><td>Adaptive</td><td>Update model as data is read</td><td>Avoids sending full model, tracks local changes.</td></tr></tbody></table>

### Fixed-Order Models

An order `n` model predicts the next symbol from the previous `n` symbols.

```
Order 0: no context
P(next)

Order 1: one previous symbol
P(next | previous)

Order 3: three previous symbols
P(next | previous_3, previous_2, previous_1)
```

Example:

<table class="no-column-header"><thead></thead><tbody><tr><td>Context</td><td>Next-symbol table</td></tr><tr><td><code class="annotation-code bg-default text-default">q</code></td><td><code class="annotation-code bg-default text-default">u</code> is very likely in English</td></tr><tr><td><code class="annotation-code bg-default text-default">th</code></td><td><code class="annotation-code bg-default text-default">e</code>, <code class="annotation-code bg-default text-default">a</code>, <code class="annotation-code bg-default text-default">i</code>, <code class="annotation-code bg-default text-default">o</code> are plausible</td></tr><tr><td><code class="annotation-code bg-default text-default">ing</code></td><td>space, punctuation, or suffix continuation</td></tr></tbody></table>

Why fixed order breaks:

<table class="no-column-header"><thead></thead><tbody><tr><td>Order too low</td><td>Misses useful context.</td></tr><tr><td>Order too high</td><td>Most contexts are rare or unseen.</td></tr><tr><td>Result</td><td>Need smoothing, fallback, or variable order.</td></tr></tbody></table>

### Bytewise, Bitwise, and Indirect Models

<table class="no-column-header"><thead></thead><tbody><tr><td>Model style</td><td>Unit</td><td>Useful when</td></tr><tr><td>Bytewise</td><td>Predict next byte</td><td>Text, simple binary data, byte-aligned formats.</td></tr><tr><td>Bitwise</td><td>Predict next bit</td><td>Arithmetic coding, mixed binary patterns, precise probability updates.</td></tr><tr><td>Indirect</td><td>Use hashed or transformed contexts</td><td>Large context spaces where full tables are too costly.</td></tr></tbody></table>

### Variable-Order Models

Variable-order models keep statistics for multiple context lengths and choose or mix them.

flowchart TD
    A\[Long context\] --> B{Seen enough?}
    B -->|Yes| C\[Use long-context prediction\]
    B -->|No| D\[Back off\]
    D --> E\[Medium context\]
    E --> F{Seen enough?}
    F -->|Yes| G\[Use medium-context prediction\]
    F -->|No| H\[Short context\]

#### DMC

Dynamic Markov Coding predicts bits with a state machine that grows as it observes data. It can split states when histories diverge.

Visual:

```
state A --0--> state B
state A --1--> state C

if state A is too vague:
state A becomes A1 and A2
```

#### PPM

Prediction by Partial Matching uses byte contexts and backs off from longer to shorter contexts. It is a classic text-compression idea.

```
Try context "tion"
if unknown, try "ion"
if unknown, try "on"
if unknown, try "n"
if unknown, try order 0
```

The hard detail is handling symbols that have not appeared in a context, often called the escape or zero-frequency problem.

#### CTW

Context Tree Weighting mixes context tree predictions in a principled bitwise way. Instead of choosing only one context, it combines evidence across a tree.

### Context Mixing

Context mixing uses many predictors at once.

flowchart LR
    A\[Text model\] --> M\[Mixer\]
    B\[Match model\] --> M
    C\[Low-order model\] --> M
    D\[File-format model\] --> M
    E\[Image/audio heuristic\] --> M
    M --> P\[Final probability\]
    P --> Coder\[Arithmetic coder\]

The mixer can learn which predictors are useful in each situation.

<table class="no-column-header"><thead></thead><tbody><tr><td>Component</td><td>Role</td></tr><tr><td>Linear evidence mixing</td><td>Combine model outputs with weighted evidence.</td></tr><tr><td>Logistic mixing</td><td>Mix in probability/logit space for better behavior.</td></tr><tr><td>SSE</td><td>Secondary Symbol Estimation adjusts predictions using past calibration.</td></tr><tr><td>ISSE</td><td>Indirect SSE uses contexts to select or adapt estimators.</td></tr><tr><td>Match model</td><td>If current data matches earlier data, predict continuation from the match.</td></tr><tr><td>PAQ models</td><td>High-compression family using context mixing.</td></tr><tr><td>ZPAQ</td><td>A more configurable/archive-oriented successor in the PAQ family.</td></tr><tr><td>Crinkler</td><td>Specialized compression/linking for executable code size competitions.</td></tr></tbody></table>

### Why Context Mixing Can Beat Single Models

Different predictors notice different kinds of structure:

<table class="no-column-header"><thead></thead><tbody><tr><td>Predictor</td><td>Notices</td></tr><tr><td>Low-order byte model</td><td>Local byte frequencies</td></tr><tr><td>Word model</td><td>Language-level repetition</td></tr><tr><td>Match model</td><td>Exact repeated substrings</td></tr><tr><td>Image model</td><td>Neighboring pixel relationships</td></tr><tr><td>Executable model</td><td>Instruction and address patterns</td></tr><tr><td>XML model</td><td>Tags, attributes, markup rhythm</td></tr></tbody></table>

Compression improves when the mixer learns which predictor is trustworthy right now.

### Chapter 4 Takeaways

<table class="no-column-header"><thead></thead><tbody><tr><td>Takeaway</td><td>Why it matters</td></tr><tr><td>Modeling is prediction</td><td>The compressed file stores surprises.</td></tr><tr><td>Fixed order is simple but brittle</td><td>Context length must match the data.</td></tr><tr><td>Variable order handles sparse contexts</td><td>Backoff avoids overconfidence in rare histories.</td></tr><tr><td>Context mixing is powerful</td><td>Many weak specialized models can beat one general model.</td></tr><tr><td>Better modeling can look like understanding</td><td>Language, images, code, and formats all reward domain knowledge.</td></tr></tbody></table>

* * *

## 5\. Transforms

### What a Transform Does

A transform rewrites data so a simpler model can compress it.

flowchart LR
    A\[Original data\] --> B\[Transform\]
    B --> C\[More model-friendly symbols\]
    C --> D\[Model\]
    D --> E\[Coder\]
    E --> F\[Compressed file\]
    F --> G\[Decoder\]
    G --> H\[Inverse transform\]
    H --> I\[Original data\]

Ideal transform:

<table class="no-column-header"><thead></thead><tbody><tr><td>Property</td><td>Meaning</td></tr><tr><td>Reversible</td><td>Decompression gets the original back exactly.</td></tr><tr><td>Structure exposing</td><td>Repetition, locality, or predictable errors become obvious.</td></tr><tr><td>Cheap enough</td><td>Transform cost must be worth the compression gain.</td></tr><tr><td>Canonical when possible</td><td>Avoid arbitrary choices that add information burden.</td></tr></tbody></table>

### Run Length Encoding

RLE replaces repeated symbols with a symbol plus a count.

```
AAAAAABBBBCCCCCCCC

becomes

(A,6) (B,4) (C,8)
```

Best for:

<table class="no-column-header"><thead></thead><tbody><tr><td>Good</td><td>Bad</td></tr><tr><td>Long repeated runs</td><td>Alternating symbols</td></tr><tr><td>Simple image masks</td><td>Natural text</td></tr><tr><td>Zero-filled data</td><td>Already transformed data with few runs</td></tr></tbody></table>

### LZ77 and the Match Family

LZ77 replaces repeated strings with pointers to previous occurrences.

```
Input:
ABRACADABRA

Later "ABRA" can become:
(go back 7, copy 4)
```

Visual:

```
sliding history buffer        lookahead
[ABRACAD]                     [ABRA...]
   ^^^^^
   match reused by pointer
```

Why it is popular:

<table class="no-column-header"><thead></thead><tbody><tr><td>Strength</td><td>Explanation</td></tr><tr><td>Fast decompression</td><td>Decoder mostly copies bytes from earlier output.</td></tr><tr><td>General-purpose</td><td>Works on many repeated byte patterns.</td></tr><tr><td>Streaming-friendly</td><td>Can run with bounded windows.</td></tr><tr><td>Foundation format</td><td>Deflate, LZMA-like families, and many practical tools build on the idea.</td></tr></tbody></table>

#### LZSS

LZSS improves practical LZ77 by only emitting pointers when they save space. Short non-saving matches remain literals.

#### Deflate

Deflate combines LZ77-style matches with Huffman coding. It powers common zip/gzip-style compression and survives because it is fast, widely implemented, and compatible.

flowchart LR
    A\[Bytes\] --> B\[LZ77 literals and matches\]
    B --> C\[Huffman coding\]
    C --> D\[Deflate bitstream\]

#### LZMA

LZMA pushes stronger modeling around LZ-style matches, often improving ratio at the cost of more CPU and memory.

#### LZX, ROLZ, LZP, Snappy, Deduplication

<table class="no-column-header"><thead></thead><tbody><tr><td>Method</td><td>Main idea</td><td>Design center</td></tr><tr><td>LZX</td><td>LZ-family compression used in Microsoft contexts</td><td>Practical binary/archive compression.</td></tr><tr><td>ROLZ</td><td>Restricts match search by recent contexts</td><td>Better match relevance.</td></tr><tr><td>LZP</td><td>Predicts repeated strings from context</td><td>Fast prediction of matches.</td></tr><tr><td>Snappy</td><td>Prioritizes very high speed</td><td>Low latency over maximum ratio.</td></tr><tr><td>Deduplication</td><td>Replaces repeated chunks across files or systems</td><td>Backup and storage efficiency.</td></tr></tbody></table>

### LZW and Dictionary Encoding

Dictionary methods replace strings with dictionary references.

```
dictionary:
1 -> the
2 -> compression
3 -> model

text:
the compression model

encoded:
1 2 3
```

Dictionary types:

<table class="no-column-header"><thead></thead><tbody><tr><td>Type</td><td>Dictionary source</td></tr><tr><td>Fixed</td><td>Built into the format or algorithm.</td></tr><tr><td>Static</td><td>Learned from the file and stored with it.</td></tr><tr><td>Dynamic</td><td>Built by compressor and decompressor in lockstep.</td></tr></tbody></table>

LZW is a dynamic dictionary method historically associated with formats like GIF-era compression. Its broader lesson is that both sides can build the same dictionary without transmitting every entry.

### Dictionary Encoding for Text

Text-specific dictionaries can model:

<table class="no-column-header"><thead></thead><tbody><tr><td>Feature</td><td>Compression opportunity</td></tr><tr><td>Words</td><td>Common words become compact tokens.</td></tr><tr><td>Capitalization</td><td>Store word identity separately from case pattern.</td></tr><tr><td>Newlines</td><td>Model paragraph and line structure.</td></tr><tr><td>Punctuation</td><td>Predict separators and syntax.</td></tr><tr><td>Word endings</td><td>Use morphology and repeated suffixes.</td></tr></tbody></table>

The stronger the text model, the more the compressor behaves like a small language-aware system.

### Symbol Ranking and Move-to-Front

Move-to-front keeps a list of symbols ordered by recency. If the same few symbols keep appearing in a context, their ranks stay small.

```
alphabet list:
[A B C D E ...]

read C -> output rank 2, move C to front
[C A B D E ...]

read C again -> output rank 0
```

This works well after transforms like BWT, where local neighborhoods tend to reuse a small set of symbols.

### Burrows-Wheeler Transform

BWT sorts rotations of a block so characters with similar right contexts cluster together. After BWT, a fast local model can often compress well.

Tiny example, conceptually:

```
Original block:
banana$

Sort rotations:
$banana
a$banan
ana$ban
anana$b
banana$
na$bana
nana$ba

Take last column:
annb$aa
```

The output is not obviously shorter, but it groups context-related symbols. It is usually followed by move-to-front, run-length coding, and entropy coding.

flowchart LR
    A\[Input block\] --> B\[BWT context sort\]
    B --> C\[Move-to-front\]
    C --> D\[Run-length coding\]
    D --> E\[Huffman or arithmetic coding\]

### Predictive Filtering

Numeric data often compresses better as prediction errors.

```
samples:
100, 103, 105, 106, 108

predict next as previous:
100, +3, +2, +1, +2
```

Small errors are easier to encode than raw values.

<table class="no-column-header"><thead></thead><tbody><tr><td>Filter</td><td>Used for</td></tr><tr><td>Delta coding</td><td>Signals, images, ordered numeric data</td></tr><tr><td>Color transform</td><td>Separating brightness from color differences</td></tr><tr><td>Linear filtering</td><td>Predicting from neighboring samples or pixels</td></tr></tbody></table>

### Specialized Transforms

<table class="no-column-header"><thead></thead><tbody><tr><td>Transform</td><td>Target</td><td>Idea</td></tr><tr><td>E8E9</td><td>x86 executable code</td><td>Normalize relative call/jump addresses so repeated code patterns match better.</td></tr><tr><td>Precomp</td><td>Already-compressed embedded data</td><td>Detect and temporarily expand compressed streams inside files so outer compression can work.</td></tr><tr><td>Huffman pre-coding</td><td>Context-mixing speed</td><td>Reduce input size before expensive modeling.</td></tr></tbody></table>

### Chapter 5 Takeaways

<table class="no-column-header"><thead></thead><tbody><tr><td>Takeaway</td><td>Why it matters</td></tr><tr><td>Transforms do not finish compression</td><td>They prepare data for modeling and coding.</td></tr><tr><td>LZ methods exploit repeated strings</td><td>This explains many practical formats.</td></tr><tr><td>BWT exploits sorted contexts</td><td>It turns context into local symbol clustering.</td></tr><tr><td>Filters exploit smooth numeric data</td><td>Prediction errors are often small.</td></tr><tr><td>Specialized transforms exploit file knowledge</td><td>Better compression often comes from knowing the data&amp;#x27;s format.</td></tr></tbody></table>

* * *

## 6\. Lossy Compression

### The Big Shift

Lossless compression asks:

```
Can we reproduce the original bits exactly?
```

Lossy compression asks:

```
Can we reproduce something humans accept as the same?
```

That makes perception the model.

flowchart LR
    A\[Original signal\] --> B\[Human perception model\]
    B --> C\[Discard hard-to-notice detail\]
    C --> D\[Quantize\]
    D --> E\[Lossless coding of remaining data\]
    E --> F\[Compressed media\]

### Images

Digital images are already approximations of continuous light. Lossy image compression removes detail that the visual system is less sensitive to.

<table class="no-column-header"><thead></thead><tbody><tr><td>Human visual fact</td><td>Compression use</td></tr><tr><td>Limited spatial resolution</td><td>Do not store invisible fine detail.</td></tr><tr><td>Limited brightness precision</td><td>Quantize small intensity differences.</td></tr><tr><td>Color sensitivity differs from brightness sensitivity</td><td>Store chroma at lower precision than luma.</td></tr><tr><td>Local smoothness is common</td><td>Predict pixels from neighbors or transform blocks.</td></tr></tbody></table>

### Image Format Map

<table class="no-column-header"><thead></thead><tbody><tr><td>Format/topic</td><td>Role in the chapter</td></tr><tr><td>BMP</td><td>Mostly raw pixels, but still an approximation of continuous light.</td></tr><tr><td>GIF</td><td>Palette-based images and simple animation history.</td></tr><tr><td>PNG</td><td>Lossless image compression with filtering.</td></tr><tr><td>TIFF</td><td>Flexible container used in imaging workflows.</td></tr><tr><td>JPEG</td><td>Transform, quantization, and entropy coding for photos.</td></tr><tr><td>JPEG recompression</td><td>Attempts to compress JPEGs further without fully losing practical recoverability.</td></tr></tbody></table>

### JPEG as a Visual Pipeline

flowchart LR
    A\[RGB pixels\] --> B\[Color transform\]
    B --> C\[Chroma subsampling\]
    C --> D\[8x8 blocks\]
    D --> E\[DCT frequency transform\]
    E --> F\[Quantization\]
    F --> G\[Zigzag ordering\]
    G --> H\[Run-length and entropy coding\]
    H --> I\[JPEG file\]

Mental picture:

```
left side of block frequency table  = broad smooth shape
right side of table                 = fine detail

JPEG keeps more of the left side and throws away more of the right side.
```

Why artifacts happen:

<table class="no-column-header"><thead></thead><tbody><tr><td>Artifact</td><td>Cause</td></tr><tr><td>Blocking</td><td>Independent 8x8 block decisions become visible.</td></tr><tr><td>Ringing</td><td>Lost high-frequency detail near sharp edges.</td></tr><tr><td>Color bleeding</td><td>Reduced chroma detail.</td></tr><tr><td>Generational loss</td><td>Repeated decode/re-encode compounds quantization.</td></tr></tbody></table>

### JPEG Recompression

JPEG files are already compressed. Recompressors look for remaining structure:

<table class="no-column-header"><thead></thead><tbody><tr><td>Strategy</td><td>What it tries to exploit</td></tr><tr><td>Better entropy coding</td><td>JPEG&amp;#x27;s stored coefficients may still be coded more compactly.</td></tr><tr><td>Coefficient modeling</td><td>Predict patterns in quantized DCT coefficients.</td></tr><tr><td>Metadata cleanup</td><td>Remove or compact non-image payload.</td></tr><tr><td>Specialized decoding knowledge</td><td>Preserve reconstructable JPEG details while storing them differently.</td></tr></tbody></table>

The chapter surveys historical approaches such as Stuffit, PAQ-based methods, WinZip behavior, and PackJPG. Treat the named results as historical context.

### Video

Video compression adds time.

flowchart LR
    A\[Frame 1\] --> B\[Predict frame 2 from frame 1\]
    B --> C\[Encode motion\]
    C --> D\[Encode residual error\]
    D --> E\[Repeat across frames\]

Why video compresses:

<table class="no-column-header"><thead></thead><tbody><tr><td>Structure</td><td>Compression opportunity</td></tr><tr><td>Adjacent frames are similar</td><td>Store changes instead of full frames.</td></tr><tr><td>Objects move</td><td>Motion vectors describe block movement.</td></tr><tr><td>Human vision tolerates some error</td><td>Quantize residuals.</td></tr><tr><td>Scenes contain spatial redundancy</td><td>Use image-like compression inside frames.</td></tr></tbody></table>

### NTSC and MPEG

<table class="no-column-header"><thead></thead><tbody><tr><td>Topic</td><td>Key idea</td></tr><tr><td>NTSC</td><td>Broadcast video is already shaped by human vision, refresh, interlacing, and color compromises.</td></tr><tr><td>MPEG</td><td>Modern-style video coding predicts frames from other frames, stores motion, quantizes transforms, and entropy-codes the result.</td></tr></tbody></table>

Frame types as a mental model:

```
I-frame: self-contained image
P-frame: predicted from previous frames
B-frame: predicted from past and future reference frames
```

Tradeoff:

<table class="no-column-header"><thead></thead><tbody><tr><td>More prediction</td><td>Better compression, more complexity, more latency</td></tr><tr><td>Less prediction</td><td>Easier seeking and editing, larger files</td></tr></tbody></table>

### Audio

Audio compression uses psychoacoustics: what the ear can and cannot notice.

<table class="no-column-header"><thead></thead><tbody><tr><td>Hearing fact</td><td>Compression use</td></tr><tr><td>Limited frequency range</td><td>Do not store inaudible frequencies.</td></tr><tr><td>Sensitivity varies by frequency</td><td>Allocate bits where hearing is sharpest.</td></tr><tr><td>Loud sounds mask nearby quiet sounds</td><td>Remove masked components.</td></tr><tr><td>Perceived loudness is logarithmic</td><td>Quantization can follow perception.</td></tr><tr><td>Time masking exists</td><td>Sounds can hide nearby sounds in time.</td></tr></tbody></table>

Audio pipeline:

flowchart LR
    A\[PCM samples\] --> B\[Frequency analysis\]
    B --> C\[Psychoacoustic masking model\]
    C --> D\[Quantization and bit allocation\]
    D --> E\[Entropy coding\]
    E --> F\[Compressed audio\]

### Chapter 6 Takeaways

<table class="no-column-header"><thead></thead><tbody><tr><td>Takeaway</td><td>Why it matters</td></tr><tr><td>Lossy compression discards information</td><td>The hard part is choosing information humans will not miss.</td></tr><tr><td>Media compression is perceptual modeling</td><td>Vision and hearing are part of the algorithm.</td></tr><tr><td>Transform plus quantization is central</td><td>Especially for JPEG-like and audio/video systems.</td></tr><tr><td>Recompression is difficult</td><td>Already-compressed data has little easy redundancy left.</td></tr><tr><td>Perfect lossy compression would require understanding</td><td>A movie could theoretically be summarized semantically, but practical systems are far from that.</td></tr></tbody></table>

* * *

## The Grand Unifying Model

flowchart TD
    A\[Data source\] --> B{Lossless or lossy?}
    B -->|Lossless| C\[Preserve every bit\]
    B -->|Lossy| D\[Preserve perceptual meaning\]
    C --> E{Transform useful?}
    D --> F\[Perception transform and quantization\]
    F --> E
    E -->|Yes| G\[Expose patterns\]
    E -->|No| H\[Model directly\]
    G --> H\[Predict next symbol or bit\]
    H --> I\[Code using probability\]
    I --> J\[Package with metadata, checks, maybe encryption\]

Every concrete compressor can be placed in this frame:

<table class="no-column-header"><thead></thead><tbody><tr><td>Compressor family</td><td>Transform</td><td>Model</td><td>Coder</td></tr><tr><td>gzip/deflate</td><td>LZ77 matches</td><td>Huffman-coded literals/lengths</td><td>Huffman</td></tr><tr><td>bzip2-style</td><td>BWT, MTF, RLE</td><td>Local symbol frequencies</td><td>Huffman</td></tr><tr><td>PAQ-style</td><td>Often file-aware contexts</td><td>Context mixing</td><td>Arithmetic/range-like</td></tr><tr><td>PNG</td><td>Image filters</td><td>Deflate model</td><td>Huffman via deflate</td></tr><tr><td>JPEG</td><td>DCT and quantization</td><td>Coefficient/statistical coding</td><td>Entropy coding</td></tr><tr><td>MPEG-like video</td><td>Motion prediction and transforms</td><td>Residual and motion models</td><td>Entropy coding</td></tr><tr><td>MP3/AAC-like audio</td><td>Frequency analysis and masking</td><td>Psychoacoustic bit allocation</td><td>Entropy coding</td></tr></tbody></table>

## Practical Reading Paths

### If You Want to Build a Compressor

1.  Understand Chapter 1 so you stop expecting impossible wins.
2.  Pick a benchmark from Chapter 2 that matches your target data.
3.  Implement a simple coder from Chapter 3 or reuse a known one.
4.  Start with a simple adaptive model from Chapter 4.
5.  Add one transform from Chapter 5 only when it exposes a pattern you can explain.
6.  For media, study Chapter 6 before inventing quality knobs.

### If You Want to Choose a Format

<table class="no-column-header"><thead></thead><tbody><tr><td>Need</td><td>Prefer</td></tr><tr><td>Speed and compatibility</td><td>Deflate/gzip/zip-style tools</td></tr><tr><td>Archival ratio</td><td>Stronger LZMA/context-mixing tools, if time is acceptable</td></tr><tr><td>Text research</td><td>PPM/context-mixing/modern language-model-aware approaches</td></tr><tr><td>Backups</td><td>Deduplication plus compression</td></tr><tr><td>Photos</td><td>JPEG-like formats or newer perceptual image codecs</td></tr><tr><td>Screenshots/graphics</td><td>PNG-like lossless image compression</td></tr><tr><td>Audio/video distribution</td><td>Perceptual audio/video codecs</td></tr></tbody></table>

### If You Want to Understand AI Through Compression

Read Chapter 1 and Chapter 4 together. The essential loop is:

```
understand pattern -> predict better -> encode surprise only -> shorter file
```

Compression is not just storage optimization. It is a measurable way to ask how much structure a system has discovered.

## Cheat Sheets

### Glossary

<table class="no-column-header"><thead></thead><tbody><tr><td>Term</td><td>Short definition</td></tr><tr><td>Lossless</td><td>Decompression recovers the exact original data.</td></tr><tr><td>Lossy</td><td>Decompression recovers an acceptable approximation.</td></tr><tr><td>Model</td><td>Probability estimator for upcoming symbols or bits.</td></tr><tr><td>Coder</td><td>Converts model probabilities into a bitstream.</td></tr><tr><td>Transform</td><td>Rewrites data to expose compressible structure.</td></tr><tr><td>Entropy</td><td>Expected information content under a probability model.</td></tr><tr><td>Context</td><td>Previously seen data used to predict the next symbol.</td></tr><tr><td>Adaptive model</td><td>Updates as data is processed.</td></tr><tr><td>Static model</td><td>Sent or fixed before coding the payload.</td></tr><tr><td>Solid archive</td><td>Compresses multiple files together to exploit shared structure.</td></tr><tr><td>BWT</td><td>Context-sorting transform that clusters similar-symbol contexts.</td></tr><tr><td>Quantization</td><td>Reducing precision, usually the irreversible part of lossy coding.</td></tr><tr><td>Psychoacoustics</td><td>Modeling what humans can hear.</td></tr></tbody></table>

### Algorithm Selection Sketch

```
Mostly repeated bytes?
-> LZ-style match coding

Mostly text?
-> PPM, context mixing, dictionary transforms, or modern language-aware modeling

Mostly smooth numeric samples?
-> predictive filtering plus entropy coding

Mostly photos?
-> perceptual image codec

Mostly backups or VM images?
-> deduplication plus compression

Already encrypted or compressed?
-> expect little or no gain
```

### Red Flags When Evaluating Compression Claims

<table class="no-column-header"><thead></thead><tbody><tr><td>Claim</td><td>Skeptical question</td></tr><tr><td>Compresses every file</td><td>How does it avoid the counting argument?</td></tr><tr><td>Recompresses compressed data repeatedly</td><td>Where does the extra information go?</td></tr><tr><td>Beats all compressors</td><td>On which benchmark and rules?</td></tr><tr><td>No quality loss in lossy mode</td><td>What exact metric or human test supports that?</td></tr><tr><td>Tiny output with universal recovery</td><td>Is the decompressor/model included in the accounting?</td></tr></tbody></table>

## Closing

The book's central message is practical and philosophical at the same time:

> To compress data, find structure. To find structure, predict. To predict well, understand the source.

That is why the same field contains Huffman trees, probability intervals, dictionaries, suffix sorting, image transforms, psychoacoustics, benchmark politics, and AI. They are all different ways of answering one question:

```
What is the shortest description that still lets us recover what matters?
```

## Source and Further Reading

-   Matt Mahoney, [Data Compression Explained](https://mattmahoney.net/dc/dce.html), last updated Apr. 15, 2013.
-   For current tool rankings, consult up-to-date benchmark leaderboards directly; the rankings in the source are historical.
-   For implementation practice, start with a simple RLE or Huffman coder, then build toward adaptive modeling, LZ-style matching, or arithmetic/range coding.