Context filling up? Learn to compact.
"The agent can forget strategically and keep working forever." Strategic forgetting is an engineering capability.
Why compact at all?
As an agent runs, messages[] balloons: each read_file returns thousands of tokens, each bash hundreds, plus the model's reasoning text every turn. After 50 turns, context can easily reach 100K+. Two consequences:
- Hitting the model limit: you crash at the context window, or every API call costs linearly more.
- Attention dilution: the current task drowns in irrelevant tool_results from 30 turns ago and the model starts drifting.
s06's approach: let the agent proactively forget unimportant content while preserving critical state. Three layers, lightest to heaviest.
Layer 1 · micro_compact (runs silently every turn)
The cheapest layer. Runs before every LLM call, replacing tool_results older than the most recent 3 with a placeholder:
# From turn 10 onward, most tool_results become: { "type": "tool_result", "tool_use_id": "toolu_01A", "content": "[Previous: used bash]" # shrunk from thousands to tens of chars }
One exception: read_file results are never compressed. Why? Because read output is reference material - compressing it forces the model to re-read the file, which costs more than keeping it.
PRESERVE_RESULT_TOOLS = {"read_file"} # never compressed
Watch micro_compact age old results turn by turn
Step through 10 simulated turns, running micro_compact before each one. Watch old tool_results become [Previous: ...] while the most recent 3 stay intact.
Layer 2 · auto_compact (triggered at a threshold)
Even with micro running continuously, accumulated context will eventually blow up. s06 sets a threshold (default 50,000 tokens):
- Estimate token count:
len(str(messages)) // 4(rough but good enough). - Over threshold? Write the full transcript to
.transcripts/transcript_TIMESTAMP.jsonl(for recovery). - Ask the LLM to summarize the entire conversation.
- Replace the entire
messageslist with a single"[compressed] SUMMARY..."entry.
The trade-off is obvious - you lose specific tool outputs and conversational tone, retaining only an outline. But the agent can keep going, which is the core benefit.
Layer 3 · the model calls the compact tool itself
auto_compact is triggered by the harness without the model knowing. Layer 3 flips this: give the model a compact tool and let it actively request compression - for instance when it decides the earlier exploration is no longer useful and a new phase is beginning.
The model calls:
tool_use("compact", focus="keep the API design decisions")
This triggers the same process as auto_compact, but can carry a focus parameter telling the summary what to prioritize. Extremely useful in practice - the model knows which sub-tasks are "finished", making it a better judge than the harness heuristic.
Which layer fits? Judgment calls
Given the scenarios below, decide which is the most appropriate trigger: micro / auto / manual.