This is a python implementation of LZW (Lempel-Ziv-Welch) compression with different dictionary management strategies (when the dick is full). Includes four base strategies (Freeze, Reset, LRU, LFU), two encoder-decoder synchronization approaches (bitstream signaling vs symmetric/deferred addition), and an optional cascade deletion optimization.
- How LZW Works
- The Core Problem & Strategies
- The Encoder-Decoder Sync Problem
- Complexity Analysis
- Cascade Deletion
- Benchmarking
- Usage
- Repository Structure
LZW replaces repeated byte sequences with short codes. The encoder builds a dictionary of seen patterns, outputting a code whenever a match breaks. The decoder reconstructs the same dictionary by watching the pattern of codes. Both sides use variable-width codes (starting at min_bits, growing up to max_bits) to keep output compact.
Quick example with input "ababab" and alphabet {a:0, b:1}:
| Step | Phrase | Output | New Entry |
|---|---|---|---|
| 1-2 | "ab" not found | 0 (a) | 2:"ab" |
| 3 | "ba" not found | 1 (b) | 3:"ba" |
| 4-5 | "ab" found, "aba" not found | 2 (ab) | 4:"aba" |
| 6-EOF | "ab" found, end | 2 (ab) |
Output: [0, 1, 2, 2], which is just 4 codes instead of 6 characters. Compression!
Once the dictionary hits 2^max_bits entries, you can't add more. What you do next determines compression quality on the rest of the file.
Stop adding entries. Compress the rest of the file with the existing dictionary. Simplest possible approach.
Strengths: Fastest, lowest memory, excellent on repetitive/uniform data where early patterns cover everything. Weaknesses: Cannot adapt after the dictionary fills. Compression degrades on long files with shifting content. Gets worse at mid-range max_bits (see Freeze Mid-Range Catastrophe).
When the dictionary fills, send a RESET_CODE signal, clear everything, reinitialize with just the alphabet, and start learning from scratch.
Strengths: Adapts to context shifts (archives, multi-section files). Nearly as fast as Freeze. Weaknesses: Loses all learned patterns on each reset. Bad on files with globally-common patterns.
Evict the least recently used entry and reuse its code slot. Tracks recency with a doubly-linked list + hashmap for
Strengths: Smoothly adapts to shifting contexts without discarding the entire dictionary. Keeps recent patterns alive. Weaknesses: More complex, slower than Freeze/Reset, requires solving the encoder-decoder sync problem.
Evict the entry used the fewest times. Uses frequency buckets (hashmap of frequency to doubly-linked list) with LRU tie-breaking within each bucket.
Strengths: Preserves globally-common patterns. use() call, with ~20-25 operations vs ~8 for LRU), also requires solving the encoder-decoder sync problem. Has a structural disadvantage in LZW's dictionary building pattern that LRU does not share.
Note: both LRU and LFU share the orphan problem (evicting a parent makes its children unreachable). That affects all eviction strategies equally and is addressed by cascade deletion. The issues below are specific to LFU's eviction policy.
LFU has two structural problems that interact badly with LZW:
Stale entry hoarding. An entry used 200 times in the first half of a file sits at freq=200, and while not literally unevictable, it is effectively stuck for a very long time. You would need to evict 200+ lower-frequency entries before this one becomes the victim. In practice, on most file sizes, high-frequency entries from early in the file occupy slots long after they stop being useful. LRU does not have this problem because unused entries naturally drift to the tail regardless of their history.
Chain killing. LZW builds entries incrementally: "ab" (freq=1) gets matched, creating "abc" (freq=1), which gets matched, creating "abcd" (freq=1). Each new entry starts at freq=1 and is immediately among the lowest-frequency entries in the dictionary. In a full dictionary, these new entries are prime eviction candidates before they ever get a chance to be matched. LRU avoids this because new entries start at the MRU position and get a full cycle through the entire list before becoming evictable. This gives them time to actually be matched and extended.
These are inherent properties of LFU as a caching policy, sometimes called "cache pollution." This is well-known in caching systems generally (Redis, for example, uses LFU with time-decay on frequencies to work around it). Vanilla LFU will always have these issues. A potential fix would be adding frequency decay (entries lose frequency over time when not used), but this breaks the
In benchmarks, LRU beats LFU on most files. LFU only wins on data with a truly stable vocabulary from start to finish (e.g., synthetic English-like text with a fixed word list).
LRU and LFU share a fundamental challenge: the encoder and decoder are one step out of sync. The encoder adds an entry at step
This repo implements two independent solutions:
When I first started implementing LRU eviction, bitstream signaling was the only approach I could think of. The encoder tracks evicted codes in a dictionary. When eviction happens, the code slot is reassigned to a new entry and the old-to-new mapping is recorded. Later, if the encoder needs to output that code (now pointing to a different entry than the decoder expects), it first sends an EVICT_SIGNAL in the bitstream telling the decoder "code X now means Y", then outputs the code normally. The decoder has no LRU/LFU tracker of its own and just follows instructions.
The initial implementation sent the full dictionary entry at every eviction. This was a proof-of-concept only; files expanded massively. From there, I optimized in stages:
Evict-Then-Use Detection. Not all evictions need a signal. The decoder only breaks if the encoder outputs a code whose slot was reassigned. If a code is evicted but never output again, the decoder never notices. This is surprisingly rare (~10-30% of evictions trigger an output of the evicted code). Result: 70-90% reduction in signals.
Output History + Offset/Suffix. Compress the signal itself. The evicted entry's prefix is almost always in recent output history. Instead of sending the full string, send a 1-byte offset into a circular buffer of the last 255 outputs plus a 1-byte suffix character. The old format used ~123 bits while the new format used ~34 bits, which is a 72% reduction per signal. Fallback to full entry if prefix is not in history (never observed in practice).
HashMap Lookup. The output history prefix lookup was originally linear search,
Only the final optimized version is included in the repository. The earlier versions are trivial to implement from these descriptions.
Pros: Fast decompression (decoder is simple). Cons: EVICT_SIGNAL adds overhead to the bitstream. Requires output history tracking on the encoder side.
Both encoder and decoder run identical dictionary code. The encoder "defers" its entry addition by one step to match the decoder's natural timing. Both sides add the same entry at the same time, perform the same LRU/LFU operations in the same order, and therefore always agree on evictions.
The key insight is that the entry the encoder would add at step current_match + next_char) is the same value the decoder computes at step prev_output + current_output[0]). By making the encoder also use the decoder's formula, both sides stay perfectly synchronized.
Pros: Zero bitstream overhead (no EVICT_SIGNAL, no EVICT_SIGNAL code reservation). Simpler encoder (no output history, no evicted_codes tracking). Better compression. Cons: Slower decompression (decoder must run its own LRU/LFU tracker). Entries become available one step later (negligible impact in practice).
Across all benchmarks, symmetric beats bitstream on compression ratio in 105 out of 160 tests (LRU) and 99 out of 160 (LFU). Bitstream LRU never wins a single test outright. Bitstream LFU wins exactly 3 tests, all by margins under 2.5 percentage points. The EVICT_SIGNAL overhead is simply too costly.
Symmetric is also ~15% faster to compress than bitstream (no output history management), though ~1.8x slower to decompress (decoder runs its own tracker). If decompression speed is critical, bitstream's dumb-decoder approach has an edge, but the compression ratio gap is hard to justify.
The deferred/symmetric approach could also be applied to the Reset strategy, eliminating the RESET_CODE signal. Both sides would independently detect "dictionary full" and reset simultaneously. However, Reset's RESET_CODE costs exactly one codeword (trivial overhead), and the sync logic is already dead simple. The complexity savings from eliminating it are near-zero, unlike LRU/LFU where EVICT_SIGNAL is frequent and expensive. Not worth the effort.
All implementations here use Python's string-keyed dict for the LZW dictionary. The greedy matching loop does:
combined = current + char # O(L) string concat
if combined in fwd: # O(L) hash
current = combinedA match of length
True
- Trie with generation counters. Store dictionary as parent-child edges, one edge traversal per byte. Generation counters invalidate stale edges after eviction. I tested this but will not include it in the final repo (pretty straightforward to implement, and doesn't really offer any advantages).
-
Trie with cascade deletion. Same structure, but recursively evict all descendants when a node is evicted. Amortized
$\mathcal O(n)$ . - Array-based trie. Replace hashmap at each node with a fixed 256-pointer array indexed by byte value. No hashing, just array indexing. Memory-hungry but cache-friendly.
- Double-array trie. Compact trie using two parallel arrays (base + check). Memory-efficient, cache-friendly, common in production systems. More complex to implement with eviction.
In pure Python, all these dict pushes all work into CPython's optimized C internals, and that constant-factor advantage dominates at practical file sizes.
The crossover where
Benchmark of symmetric LRU (trie vs string dict, 200KB repetitive text, max_bits=16, avg of 5 runs):
| Version | Compress | Decompress | Compressed Size |
|---|---|---|---|
| String dict | 0.128s | 0.073s | 44,502 |
| Trie | 0.194s | 0.108s | 44,504 |
When evicting an entry like "ab" from the dictionary, any entries that extend it ("abc", "abd", etc.) become unreachable, due to the fact that LZW's greedy matching builds strings incrementally. The encoder constructs "a" + "b" = "ab" before it can ever try "abc". If "ab" is gone, the lookup fails and the encoder never even constructs "abc" to look it up, even though it still exists in the dictionary.
These orphaned entries waste dictionary slots until they naturally age out as LRU/LFU victims.
Cascade deletion fixes the orphan problem: when evicting an entry, also evict all its unreachable descendants, freeing their slots for immediate reuse.
Two extra data structures: children_of (maps each entry to the set of its child entries) and free_codes (freed code slots available for reuse). When evicting entry X, iteratively evict all children of X, then all children of those children, etc. Use freed slots for new entries before allocating new code numbers.
The time complexity of this is amortized
Benchmark of symmetric LRU (string-keyed dict, 200KB repetitive text, avg of 5 runs):
| max_bits | No Cascade | Cascade | Size Delta | Time Delta |
|---|---|---|---|---|
| 10 | 62,791 | 58,456 | -6.9% | +22% |
| 12 | 55,919 | 52,433 | -6.2% | +27% |
| 16 | 44,502 | 44,502 | 0% | +17% |
On binary-like data (200KB, max_bits=10), the improvement is even larger: -12.1% size for +18% time.
At max_bits=16, the dictionary never fills for these file sizes, so cascade does nothing but add overhead with identical output but being 15-35% slower. Cascade is worth it when the dictionary is small and under eviction pressure. At typical max_bits=16 with moderate file sizes, it is pure overhead.
The included cascade implementation is built on the symmetric LRU approach (LZW-Cascade(LRU-Symmetric).py). The same pattern (add children_of tracking, free_codes list, and a cascade_evict function) applies to any eviction-based strategy, such as bitstream LRU, symmetric LFU, bitstream LFU, &c. Only one example is included to avoid redundant code.
Full results from testing 9 implementations across 20 files at 8 max_bits settings (9 through 16).
Symmetric-LRU beats bitstream-LRU in 105/160 tests, averaging 84.0% vs 117.0% compression ratio. Bitstream-LRU never wins a single test outright. The EVICT_SIGNAL overhead is simply too large, especially on high-entropy data with frequent evictions.
Symmetric also beats bitstream on LFU (99 wins vs 47), though the gap is smaller (~8% average), likely because LFU's eviction pattern triggers fewer evict-then-use events (frequency-based eviction tends to evict entries that are less likely to be immediately reused).
| max_bits | Freeze wins | Reset wins | Symmetric wins | Bitstream wins |
|---|---|---|---|---|
| 9 | 20% | 5% | 65% | 10% |
| 10 | 20% | 45% | 30% | 5% |
| 12 | 30% | 35% | 35% | 0% |
| 14 | 50% | 0% | 50% | 0% |
| 16 | 65% | 5% | 30% | 0% |
At small dictionaries (max_bits 9-11), eviction strategies dominate because the dictionary fills quickly and recycling entries matters. At large dictionaries (max_bits 14-16), Freeze dominates because the dictionary rarely fills and eviction overhead has no benefit.
Reset peaks at max_bits 10-11 (45% wins), occupying a narrow niche where the dictionary fills occasionally and complete resets are cheaper than per-entry eviction.
| File Size | Dominant Strategy |
|---|---|
| Under 10 KB | Freeze (88%): dictionary never fills |
| 10-100 KB | Freeze (70%) |
| 100 KB-1 MB | Three-way tie: Freeze, Reset, Symmetric-LFU (~27% each) |
| Over 1 MB | Symmetric-LFU (38%), Reset (31%), Symmetric-LRU (28%) |
Freeze never wins on files over 1 MB. On large files the dictionary always fills, and Freeze's inability to adapt is fatal.
Freeze gets worse as you increase max_bits from 9 to ~13, then recovers:
| max_bits | bed.jpg (Freeze) |
|---|---|
| 9 | 111.7% |
| 11 | 133.0% |
| 13 | 146.4% (worst) |
| 16 | 125.3% |
This pattern also appears on 9 other different high-entropy files. At small max_bits, the dictionary fills fast and codewords don't take up much bits. At mid-range max_bits, the dictionary takes longer to fill and codewords use more bits. At large max_bits, the dictionary is big enough to be genuinely useful. The mid-range is the worst of both worlds.
The largest test file showcases strategy differences clearly:
| Strategy | max_bits 9 |
max_bits 12 |
max_bits 16 |
|---|---|---|---|
| Symmetric-LRU | 60.2% | 58.1% | 57.1% |
| Symmetric-LFU | 85.4% | 78.5% | 73.7% |
| Reset | 65.2% | 58.9% | 58.5% |
| Freeze | 110.3% | 138.4% | 78.5% |
| Bitstream-LRU | 96.3% | 114.3% | 127.1% |
Symmetric-LRU wins at every dictionary size. Bitstream-LRU gets worse as dictionary size grows (signal overhead per eviction increases). Freeze is terrible at small dictionaries but catches up at max16.
For compression speed (avg of 5 runs):
| Strategy | Relative to Freeze |
|---|---|
| Reset | 1.03x |
| Symmetric-LRU | 1.65x |
| Symmetric-LFU | 1.76x |
| Bitstream-LFU | 1.89x |
| Bitstream-LRU | 1.92x |
Symmetric is ~15% faster to compress than bitstream since there is no output history management.
For decompression speed (avg of 5 runs):
| Strategy | Relative to Freeze |
|---|---|
| Reset | 1.17x |
| Bitstream-LFU | 1.14x |
| Bitstream-LRU | 1.29x |
| Symmetric-LRU | 2.51x |
| Symmetric-LFU | 2.91x |
The big gap is between bitstream and symmetric. Bitstream decoders are only ~15-30% slower than Freeze (they just do dict lookups + handle the occasional signal). Symmetric decoders are 2.5-3x slower because they run a full LRU/LFU tracker on every single codeword to stay in sync with the encoder.
Symmetric-LRU beats symmetric-LFU on most files. LFU only wins on data with a stable vocabulary from start to finish. See why LFU Underperforms LRU for the structural explanation.
| Scenario | Recommendation |
|---|---|
| Repetitive/uniform data | Freeze |
| Archives, multi-section files | Reset or Symmetric-LRU |
| Large text files with stable vocabulary | Symmetric-LFU |
| Context-shifting data | Symmetric-LRU |
| Small dictionary with eviction pressure | Symmetric-LRU + cascade |
| Decompression speed critical | Freeze or bitstream |
| Simplicity | Freeze or Reset |
In practice, Freeze and Reset cover most use cases well. The eviction strategies (LRU/LFU) shine on large files and small dictionaries, but add nontrivial complexity.
python [strategy].py compress --alphabet [alphabet] --min-bits [num] --max-bits [num] [input] [output]python [strategy].py decompress [input] [output]Testing is generally done by round-trip difference checking:
# Compress
python [strategy].py compress --alphabet [alphabet] --min-bits [num] --max-bits [num] [input] [output]
# Decompress
python [strategy].py decompress [input] [output (restored)]
# Verify
diff [input] [output (restored)] # Should be identicalMinimal max-bits allow eviction policies to actually do their job so we can observe their behavior. This of course relies on the fact that the eviction strategies and/or the internal logic of compress/decompress are implemented fully correctly, which hopefully they are. I did a lot of debugging and stuff, but I'm not perfect so there may be some errors still. I hope not though.
| Name | Size | Description |
|---|---|---|
ascii |
128 | Standard ASCII (0-127) |
extendedascii |
256 | Extended ASCII (0-255) |
ab |
2 | Binary alphabet (testing) |
Add custom alphabets in the ALPHABETS dict at the top of each file.
For real files: --alphabet extendedascii --min-bits 9 --max-bits 16
For testing eviction behavior: use small --max-bits to force the dictionary to fill quickly, for example, --alphabet ab --min-bits 3 --max-bits 3
LZW-Freeze.py
LZW-Reset.py
LZW-Cascade(LRU-Symmetric).py -- Cascade deletion example
BitstreamEncoding/
LZW-LRU-Bitstream.py -- EVICT_SIGNAL approach (hashmap)
LZW-LFU-Bitstream.py -- Based on LRU Bitstream
DeferredSymmetric/
LZW-LRU-Symmetric.py -- Deferred addition, no signals
LZW-LFU-Symmetric.py -- Based on LRU Symmetric
Header: [min_bits: 8b] [max_bits: 8b] [alphabet_size: 16b] [alphabet: 8b each]
Body: Variable-width codewords (min_bits to max_bits)
Footer: [EOF_CODE at current bit width]
Special codes vary by strategy:
| Code | Value | Used By |
|---|---|---|
| Alphabet | 0 to N-1 | All |
| EOF_CODE | N | All |
| RESET_CODE | N+1 | Reset only |
| EVICT_SIGNAL | 2^max_bits - 1 |
Bitstream LRU/LFU only |
Symmetric strategies use no special codes beyond EOF_CODE. All code slots are available for dictionary entries.
- Frequency decay LFU, so instead of raw counts, decay frequency over time to fix the stale-hoarding problem. Turns LFU into a recency-weighted frequency tracker (similar to what Redis does). Breaks the
$\mathcal O(1)$ bucket structure, likely needs a heap for$\mathcal O(\log n)$ eviction. - Cascade deletion for all strategies, since the included cascade example is on symmetric LRU. The same approach (track
children_of,free_codes, andcascade_evict) applies to any eviction-based strategy. - There's a lot of other cache management strategies, any of those are potential eviction strategies for the dictionary (i.e. FIFO, RR, TTL, totally MRU, &c).
- There can be hybrid strategies, like adaptive switching between LFU + LRU, or adaptive switching between removing unused entries (NOT LFU) + Reset. Stuff like this could definetely work better than single nonchanging strategies.
- Adaptive switching in this case could be done through monitoring compression ratios, like if it gets bad we switch or reset.
- You could run all strategies and take the lowest compresison ratio? idk
- This is a stupid project that took up literally a week of my time for nothing, i fucking hate compression LZW can go fuck itself. 🖕🖕🥱🥱🥱🥱🥱
- Hi wintermute
- April Fool's day is Worst than 9/11. I’m fucking shaking and crying right now y’all, and people aren’t taking me seriously. This is a DUMB FUCKING HOLIDAY, where people say shit that ISN’T FUCKING REAL for NO REASON. I’ve cut off 8 family members already for falling for this shriveled up, half-assed ANNUAL CORPORATE FIG LEAF like the NPC SHEEP THEY ARE. Maybe if they listened to REAL COMEDY like Bill Maher or political satire that validates what I already believe in, they’d be WORTHY OF INTERACTING WITH. BUT NO, I have to scroll through my timeline, seething, wailing and gnashing my teeth as I’m BOMBARDED BY LOW EFFORT CORNY CAPITALIST PROPOGANDA. THIS IS A SERIOUS DAY. I’m allowed to be this pressed about ha-ha corny joke day because IT’S SERIOUS FOR ME AND THEREFORE SHOULD BE FOR EVERYONE. My great uncle was tragically flattened while trying to rob a coca-cola vending machine on this date, and PEOPLE ARE STILL MAKING CORNU FUKUNG JOKES. I’ve had enough
- My bf [27M] will only have sex with me [25F] if I wear cat ears and say “Ugh fine, I guess you are my little pogchamp. Come here.”
- My boyfriend and I have been dating for almost a year now and we’ve been living together for about 2 months. He is very sweet and caring. He has spent plenty of time with me since he got furloughed, but he also spends a considerable amount of time browsing Reddit. I understand that everyone needs space and this is his “alone time”, so I try not to get involved in his “alone time” business. He often likes to quote memes he picks up online. It was dorky in a cute way until about a week ago when he became obsessed with this pogchamp meme.
- First, he wanted me to call him “my little pogchamp” and rub his belly while we were watching anime. I did because, like I said, it was dorky and cute. The next night, he asked me to call him “pogchamp” in bed. I said it but it was kind of a turnoff because of how oddly obsessed he is with it. 3 nights ago, I wanted to be intimate with him and he asked me to say “Ugh fine, I guess you are my little pogchamp. Come here.” I did but I really didn’t want to. As I was saying it, he popped the biggest boner. Then he had me put on his cat ear gaming headphones while we had sex.
- The next night he asked me to do the same thing. I was honest and told him it turns me off. He then begged me until his face was bright red and sweaty. I told him to get over it and he literally stormed out of the room and slept on the couch. I tried to convince myself he was just having a bad night, but last night he just stood next to the bed and said “Are you gonna do it?” I told him to stop being silly and he went straight to the couch again!
- Today while he was at Walmart buying lotion and tissues, I found his laptop open on the couch with a Reddit video of this anime cat girl speaking the exact meme he had been asking me to say. Since I assume he found this meme from Reddit, I am asking Reddit for advice. I am legitimately concerned that my boyfriend is sexually obsessed with an anime cat girl reciting a meme and that it is seriously affecting our physical relationship to the point where he is more attracted to it than me. Am I crazy to think this or are my concerns valid?