Cutting a MySQL table in half with one line
Cutting a MySQL table in half with one line
We had a wide MySQL table — think analytics events: one row per visit, lots of text columns for URLs, referrers, UTM parameters, click identifiers. Around 8 million rows, 8 GB on disk, most of it sitting in indexes. We were about to roughly 10× the data by backfilling historical years, and the storage math was looking ugly.
The instinct was to normalise: move repeated strings into dimension tables, reference them by integer foreign keys. Classic relational hygiene. But before writing a single migration, we spent twenty minutes on the numbers — and they pointed somewhere different.
Look at the data before you design the schema
We pulled two numbers for every text column: cardinality (how many distinct values) and average length. That gave us a repetition ratio per column.
A few patterns jumped out:
- Low cardinality, high repetition. UTM source, UTM medium, customer type, country: tens of distinct values spread across 8M rows. Great normalisation candidates.
- Medium cardinality, heavy repetition. Landing page paths, referrers, campaign names: thousands of distinct values. Also good candidates, with bigger per-column savings.
- High cardinality, nearly unique. Click identifiers from ad platforms (the long alphanumeric strings). These averaged 80 characters and repeated less than twice on average across the whole table. Normalising them into a dim table would barely help — you'd trade one fat string for a foreign key plus the same fat string sitting in the dim table.
That last category was the biggest single column by raw bytes. And it was the one where dimension tables would add complexity without fixing the problem.
Compression vs normalisation
The sales pitch for normalisation is "store each string once". The sales pitch for InnoDB's ROW_FORMAT=COMPRESSED is "the database does it for you with zlib". For columns that are mostly unique, zlib compression finds patterns within the strings themselves — prefix compression on sorted indexes, especially, is effective on things like URLs and click IDs that share long common substrings.
The practical trade-offs:
| Normalisation | Page compression | |
|---|---|---|
| Schema changes | Many tables, FKs, relationships | One ALTER TABLE |
| Application code | Every reader needs joins/eager loading; every writer needs upsert-lookup | No code changes |
| Covers unique-ish columns? | No — you just move the bytes | Yes — zlib finds within-string patterns |
| Reversible? | Painful (schema change + data migration) | One ALTER TABLE back |
| Wins on indexes? | Only if you index the FK | Yes — indexes are compressed too |
For our workload — bulk inserts from a nightly ETL, lots of read-heavy dashboards, no hot update path — compression was the cheap, reversible, data-compatible option. Normalisation wasn't wrong, it was just premature optimisation before a problem existed.
The one-line migration
On modern MySQL (5.7+) with innodb_file_per_table on, compression is a per-table attribute. There is no server-wide config to flip.
1ALTER TABLE live_entries
2 ROW_FORMAT=COMPRESSED,
3 KEY_BLOCK_SIZE=8,
4 ALGORITHM=INPLACE,
5 LOCK=NONE;KEY_BLOCK_SIZE=8 compresses 16 KB pages into 8 KB on disk. ALGORITHM=INPLACE, LOCK=NONE keeps writes online during the rebuild — you pay for the rebuild time (disk I/O, a shadow copy the size of the table) but the application keeps running.
The result
On the production table — roughly 8 million rows, 8.2 GB on disk:
| Before | After | Change | |
|---|---|---|---|
| Data | 1997 MB | 1109 MB | −44 % |
| Index | 6185 MB | 2524 MB | −59 % |
| Total | 8.18 GB | 3.63 GB | −55 % |
The index side compressed best. That's where the wide string columns doubled-up in the secondary indexes, and zlib's pattern-matching found the most to strip out.
"But won't reads get slower?"
The classic objection to any form of compression: "I'm trading CPU for disk." In practice, for an analytics workload with a modest buffer pool and a table much larger than RAM, the trade usually goes the right way:
- Fewer disk I/Os. Half-sized pages on disk means half the physical reads for the same query.
- Better buffer pool hit rate. More logical rows fit in the same RAM. A scan that used to thrash against a cold cache now touches fewer cold pages.
- CPU cost of decompression is small. zlib on a modern CPU is a rounding error next to random disk I/O.
The one shape where compression can hurt is high-frequency random updates, because recompressing a modified page occasionally fails to fit and triggers a page split. Bulk-insert workloads with few in-place updates — which analytics events fundamentally are — sit in the sweet spot.
We benchmarked the queries we care about: indexed point lookups stayed at low-millisecond response times, and full scans were CPU/I/O bound the same way as before. No regression.
When to normalise anyway
Compression is a first move, not the last. If, after compressing, you still see:
- A specific column driving obvious query slowness (e.g. group-by aggregations on a high-cardinality string column),
- A need to enforce referential integrity or validate values (enums, lookup tables),
- An ergonomic win for the application (managing a dozen channel sources as a small lookup vs free-text),
…then normalise that specific column. But do it because a real query is slow or a real validation is missing, not because "normalised feels cleaner".
Takeaways
- Measure cardinality and length before designing a normalisation. A column that's nearly unique is a bad normalisation candidate regardless of how fat it looks.
- Compression is the cheapest win available for wide, read-heavy tables. One
ALTER TABLE, zero application code changes, trivially reversible. - Indexes often dominate the byte budget on wide tables. They also compress the best. Focus the savings story on the index side, not just the data.
- Pre-commit to the measurement step. Twenty minutes of
COUNT(DISTINCT …)andAVG(LENGTH(…))before writing any migration is the best-value time you will spend this week.