Lesson 05 · Planning

Domain knowledge loaded on demand

"Don't put everything in the system prompt. Load on demand."

⏱ ~10 min · 📝 3 interactive widgets · 🧑‍💻 Based on shareAI-lab · s05_skill_loading.py

The "stuff everything in the system prompt" trap

You have 20 skills, each pretty detailed: pdf-processing (how to read PDFs), code-review (a review checklist), git-workflow (common git patterns)... The intuitive move: concatenate all of them into the system prompt so the model can look them up at any time.

The result:

  • Every call burns 15-30K input tokens, even when the task uses none of the skills.
  • Model attention is diluted - compliance with rules buried in a long system prompt degrades.
  • Change one skill and every cached conversation is invalidated.

s05's approach: split it into two layers.

The two-layer architecture

Layer 1 - cheap: The system prompt holds only the skill name and a one-sentence description (~100 tokens each). 20 skills = 2K tokens. Acceptable.

# Skill registry in the system prompt
Skills available:
  - pdf: Process PDF files. Extract text, tables, metadata.
  - code-review: Systematic code review checklist.
  - git-workflow: Common git branching and rebase patterns.

Layer 2 - on demand: When the model wants to use a skill, it calls load_skill(name="pdf") and the full skill body (potentially 5-10K tokens) is injected via tool_result. Unused skills cost zero tokens.

# tool_result contains the full skill body
<skill name="pdf">
  Step 1: Use pdfplumber for extraction...
  Step 2: Handle OCR fallback when needed...
  Step 3: Structure output as Markdown table...
</skill>

Token cost comparison

Let's measure a real scenario. Assume 20 skills, each body averaging 3000 tokens. The user asks a question that probably needs no skills at all (e.g. "fix the login endpoint bug").

The SKILL.md format

Skill files use YAML frontmatter + body:

---
name: pdf
description: Process PDF files. Extract text, tables, metadata.
tags: document,parsing
---

Step 1: Use pdfplumber for extraction. Handle multi-column layouts...
Step 2: For scanned PDFs, fall back to OCR via tesseract...

The frontmatter feeds Layer 1 (name/description/tags); the body feeds Layer 2. This style is borrowed from static site generators (Jekyll, Hugo) - anyone familiar with them will recognize it immediately.

Interactive

Widget 1 · Token Economy · two architectures side by side

Left: everything in the system prompt. Right: two-layer architecture. Drag the slider to see cumulative token count over N conversations.

Everything in system prompt
System prompt: 60000 tokens
(20 skills x 3000 tokens each, all loaded)
x conversations: 1

Total: 60000 tokens
Two-layer architecture
System prompt: 2000 tokens
(20 descriptions x ~100 tokens each)
+ on-demand skill bodies: 0 tokens
(loaded every 5 conversations)

Total: 2000 tokens
1
Saved 0%
Interactive

Widget 2 · Frontmatter Parser · extracting skill metadata

Edit the SKILL.md content and watch the s05 YAML frontmatter parsing logic split it into Layer 1 and Layer 2.

SKILL.md (editable)
Layer 1 · injected into system prompt

          
Layer 2 · tool_result from load_skill

          
Interactive

Widget 3 · Discoverability · good descriptions let the model find skills

The Layer 1 description is the model's only signal for picking a skill. Three pairs of descriptions - select the better one. Some phrasings make a skill permanently invisible to the model.

Correct: 0 / 3