Lesson 06 · Memory

Context filling up? Learn to compact.

"The agent can forget strategically and keep working forever." Strategic forgetting is an engineering capability.

⏱ ~12 min · 📝 3 interactive widgets · 🧑‍💻 Based on shareAI-lab · s06_context_compact.py

Why compact at all?

As an agent runs, messages[] balloons: each read_file returns thousands of tokens, each bash hundreds, plus the model's reasoning text every turn. After 50 turns, context can easily reach 100K+. Two consequences:

  • Hitting the model limit: you crash at the context window, or every API call costs linearly more.
  • Attention dilution: the current task drowns in irrelevant tool_results from 30 turns ago and the model starts drifting.

s06's approach: let the agent proactively forget unimportant content while preserving critical state. Three layers, lightest to heaviest.

Layer 1 · micro_compact (runs silently every turn)

The cheapest layer. Runs before every LLM call, replacing tool_results older than the most recent 3 with a placeholder:

# From turn 10 onward, most tool_results become:
{
  "type": "tool_result",
  "tool_use_id": "toolu_01A",
  "content": "[Previous: used bash]"   # shrunk from thousands to tens of chars
}

One exception: read_file results are never compressed. Why? Because read output is reference material - compressing it forces the model to re-read the file, which costs more than keeping it.

PRESERVE_RESULT_TOOLS = {"read_file"}  # never compressed

Watch micro_compact age old results turn by turn

Step through 10 simulated turns, running micro_compact before each one. Watch old tool_results become [Previous: ...] while the most recent 3 stay intact.

Layer 2 · auto_compact (triggered at a threshold)

Even with micro running continuously, accumulated context will eventually blow up. s06 sets a threshold (default 50,000 tokens):

  1. Estimate token count: len(str(messages)) // 4 (rough but good enough).
  2. Over threshold? Write the full transcript to .transcripts/transcript_TIMESTAMP.jsonl (for recovery).
  3. Ask the LLM to summarize the entire conversation.
  4. Replace the entire messages list with a single "[compressed] SUMMARY..." entry.

The trade-off is obvious - you lose specific tool outputs and conversational tone, retaining only an outline. But the agent can keep going, which is the core benefit.

Layer 3 · the model calls the compact tool itself

auto_compact is triggered by the harness without the model knowing. Layer 3 flips this: give the model a compact tool and let it actively request compression - for instance when it decides the earlier exploration is no longer useful and a new phase is beginning.

The model calls:

tool_use("compact", focus="keep the API design decisions")

This triggers the same process as auto_compact, but can carry a focus parameter telling the summary what to prioritize. Extremely useful in practice - the model knows which sub-tasks are "finished", making it a better judge than the harness heuristic.

Which layer fits? Judgment calls

Given the scenarios below, decide which is the most appropriate trigger: micro / auto / manual.

Interactive

Widget 1 · Micro Compact · watch tool_results age across turns

Click Step to advance. Old tool_results become [Previous: used X] while the most recent 3 remain intact. read_file results are never compressed (highlighted green).

Turn: 0 · Tokens: ~0
Interactive

Widget 2 · Threshold Simulator · which layer activates as tokens rise

Drag the slider to change the token estimate and see which of the three layers becomes active.

3000
Interactive

Widget 3 · Which layer fits · 6 scenario judgment calls

For each scenario, choose micro / auto / manual. Think about when each layer is most appropriate.

Correct: 0 / 6