In classical programming, we have Cheney's GC, which serves as a base for generational GCs. Today we have strikingly similar problem with the LLM context compaction: both infant mortality and long GC times. So the question is: can't we just compact on the go? In additional, we can run 2 LLMs at once (running on different machines), where the main LLM uses the second one as mid-term memory - to store and retrieve memos (by generating "recall" tokens).
Here is what Claude tells us about it...
The analogy is sharp and the research literature has chased exactly this. The honest answer: the real Cheney/generational equivalent doesn't fully exist in production yet, but several pieces of it do, and the gap is closing.
Where the analogy maps cleanly
| GC concept | LLM context analog |
|---|---|
| Heap | KV cache (per-layer keys + values for every token) |
| Allocation | Token generation (append-only, in order) |
| Infant mortality | Tool outputs, intermediate reasoning, retried turns — most context dies young |
| Long-tail survivors | System prompt, user identity, central task, established facts |
| Stop-the-world GC | "Context full → summarize the oldest 50%, restart from summary" (Claude Code, ChatGPT, every long-running agent today) |
| Pause-the-world cost | Multi-second stalls; agent loses fine-grained context |
The pain points are even isomorphic: long GC pauses ↔ multi-second summarization stalls; promotion logic ↔ "what do I keep in the summary"; tenured-generation policies ↔ persistent memory / RAG layers.
Where the analogy breaks (and why this is hard)
- No clean liveness oracle. GC has reachability from roots — provably correct. LLM "liveness" is probabilistic: every token can in principle influence every future token via attention. You're guessing which tokens still matter, and you can't easily check if you guessed wrong without rolling back the whole conversation.
- Position is content, not address. GC can move an object and rewrite pointers. LLM tokens carry their position in their representation (RoPE / absolute position embeddings). Drop token at position 5 and you have a hole — positions 6+ were computed assuming 5 was there. Models trained on contiguous sequences behave weirdly on Swiss-cheese contexts unless you handle this carefully.
- Eviction is lossy in a way GC isn't. Free a dead object → zero info loss. Drop a "dead-looking" token → possibly lose the one detail that mattered three turns later. Worse: you find out only by getting a bad answer, not a NullPointerException.
- No write barriers. GC can intercept every pointer write. The transformer doesn't expose hooks at every attention update to mark tokens as "still alive" without performance cost.
What does exist (research → production gradient)
| Technique | What it does | Maturity |
|---|---|---|
--context-shift (llama.cpp) |
When full, drop oldest tokens. Crude. No promotion, no summarization. Like only collecting from the bottom of the heap. | Production, default in recent builds |
| StreamingLLM (Xiao et al., 2023) | Keep first ~4 "attention sink" tokens + sliding window of recent ~N tokens. Drop the middle entirely. Surprisingly accurate for chat-like workloads. | Research, some prod forks |
| H2O — Heavy-Hitter Oracle | Track per-token attention scores during generation; evict tokens with consistently low scores. Closest thing to "generational marking based on access frequency". | Research; some vLLM patches |
| SnapKV / FastGen | Per-attention-head pruning policies — different heads attend differently (local, global, periodic), compress each accordingly. Continuous, not stop-the-world. | Research, Microsoft has prototypes |
| MemGPT / Anthropic's projects / OpenAI memory | Two-tier: small working context + retrieval-on-demand from a vector-store "tenured generation". Closest production analog to actual generational GC. | Production (the consumer-facing AI apps) |
| Hierarchical summarization (recursive levels) | Every N turns, summarize older N. Every M of those, summarize again. Multi-level "old generations". Used in long-running agents (AutoGPT-era). | Production, but lossy |
The pieces of an incremental scheme are all there — sliding window + attention-score promotion + retrieval-based tenured generation. What's missing is the single coherent system that schedules these to feel "free", the way modern JVMs do for memory. That's a real research direction, not a finished product.
Why isn't this standard yet
- Accuracy evaluation is brutal. "Does eviction policy A degrade reasoning vs B?" requires running each on long benchmarks. The signal is noisy, the cost is high. Compared to "buy a bigger context window" (Anthropic / Gemini's 1M+, OpenAI's 200K), evicting smarter is a small win on a hard problem.
- The economics favor bigger context. Hardware vendors and frontier labs are happier selling you 1M-token context than they are doing eviction R&D. Same way 64-bit address spaces de-emphasized JVM heap tuning for most apps.
- Engineering integration: production caches (vLLM, llama.cpp paged KV) need surgery to support arbitrary eviction. Doable, not done.
- Most apps never hit it. Chat under 8K tokens is fine. The pain shows up in agents and coding assistants — a narrower audience than "all LLM users".
For your actual setup
You probably won't feel this for personal Neo chat — 579-token system prompt + an 8K window = ~7K usable, fine for most conversations. The places you would feel it:
- Coding assistant on a large repo (your future workstation use case). 32K+ context with frequent file dumps. This is where StreamingLLM-style sliding windows are worth the engineering.
- Long-running blog-moderation queue that accumulates state. Solve this by not accumulating: each comment is its own short conversation, no shared state. No GC needed because nothing escapes function scope.
- Multi-step agent workflows ("summarize all my docs, then plan", etc.). Here the hierarchical summarization pattern actually wins — pre-summarize each doc, feed only summaries into the planner.
llama.cpp's --context-shift (already on) gets you the bottom-eviction baseline. If you ever need more, the building blocks are in research papers; nothing currently shipping in mainstream local inference engines does H2O or StreamingLLM out of the box. Quasi-production patches exist on GitHub (search llama.cpp h2o, llama.cpp streamingllm) but require building from source.
Tl;dr
Your instinct is correct and the literature is right there with you. The gap isn't conceptual — it's a combination of correctness-is-hard-to-evaluate, positional embeddings make removal weird, and the industry would rather sell bigger context than smaller-but-smarter. A true incremental KV-cache GC is a fundable PhD topic right now. For your interactive use case, the simpler "small working set + retrieval for everything else" architecture (MemGPT-style) gets 90% of the value without the research lift — and you can build that yourself when you need it, on top of llama-server's existing API.