You Spec'ed It. Now Engineer It.

The spec gets you the blueprint. These are the engineering decisions that determine whether it actually holds up.

Mar 19, 2026

Earlier this week, I published a framework for speccing AI skills — in short, prepare a product system spec to direct how agents execute, don’t prompt and hope for the best.

The response told me people are ready to move past prompts. The most common follow-up wasn’t about the framework. It was: “Okay, I specced it. Now what?

Now you engineer it.

A note on scope: Everything I’ve written about building — and everything in this post — is within the Anthropic ecosystem: Claude, Claude Code, CoWork, MCP, skills, and plugins. The principles apply broadly, but the specifics are grounded in this platform because that’s where I’m doing the work.

An Anthropic employee, Thariq, recently published lessons from building hundreds of skills inside Claude Code. What struck me wasn’t the tips themselves — many are solid — but how clearly they map to a gap I’ve recognized while building my own: the distance between knowing what a skill needs and knowing how to deliver it at runtime. That gap is where the engineering decisions live, and in March 2026, many of them are still being worked out in real time.

This post picks up where the spec framework left off — and it's where the personal operating system I've been writing about starts to take shape. You can't build one without speccing well, and you can't spec well without making engineering decisions about how your skills actually run. None of it matters if those skills aren't interoperable — they need to meet you where you are, not the other way around.

Your skill is a folder, not a file

The spec framework doesn’t cover this. It should.

I’m planning a spring break road trip with my daughter, and I didn’t want to build a one-off itinerary — I wanted a skill I could reuse for any similar trip. So I specced a road trip planner: a multi-stage skill that researches a set of destinations and builds a paced, day-by-day itinerary with lodging, flights, rental cars, activities, road stops, dining, a photo shot list, and a cost summary — delivered as a spreadsheet with a KML file for Google Maps.

The SKILL.md was just the entry point. The real skill lived in its folder: reference files for activity research, templates for the output spreadsheet, example itineraries that showed the model what “good” looks like, and a KML generation script. The markdown told Claude what to do. The folder gave it the materials to do it well.

Anthropic calls this progressive disclosure — the file system as a form of context engineering. You tell the model what files exist, and it reads them when it needs them. Not all at once, not upfront, but on demand.

This matters because context windows are finite. Stuffing everything into one SKILL.md is the skill equivalent of the prompt trap I wrote about last time — burning context on information the model may never need for this particular run. Instead, structure it:

SKILL.md — activation logic, workflow, guardrails, and pointers to everything else
references/ — domain knowledge and lookup tables, the model can pull as needed
templates/ — output formats and starter files the model should copy and fill
scripts/ — helper code the model can call instead of building from scratch
examples/ — sample outputs that anchor quality and style

If this sounds like RAG (retrieval-augmented generation), it is. Folder structure instead of a vector database. No infrastructure required, and for most personal use cases, you won’t need any. When your reference data outgrows flat files — too large, too dynamic, or needing real queries — that’s when you graduate to a database or vector store. But start with files. Many personal skills won’t outgrow them.

My spec framework has three layers: Activation, Execution, and Resilience. The file system is the fourth — the delivery mechanism for everything in the other three.

Instruction order is weight

Where you place an instruction in the skill file determines how much weight the model assigns to it. I picked this up through iteration; the Anthropic article confirms it.

Models process sequentially. What comes first carries disproportionate influence. If your most critical guardrail is buried below a paragraph about output formatting, it will get less attention than it deserves.

The pattern I’ve settled on:

Context and setup — what the skill does, tools it needs, what it should read first
Workflow — the ordered steps, with checkpoints
Guardrails and constraints — hard boundaries, permission scoping
Output contract — what “done” looks like
Gotchas — the failure modes you’ve already hit

Gotchas go last, not because they’re least important — they’re arguably the most important — but because by that point the model has full context on what it’s trying to do. “Don’t do X” means more when you already understand the workflow in which X might occur.

And avoid starting with what not to do. Leading with prohibitions puts the model on defense. Lead with what the skill does and why. Then scope the boundaries.

Guardrails you can enforce, not just describe

In the spec post, I argued that guardrails are permissions design. I stand by that. But there’s a meaningful difference between a guardrail that’s written and one that’s enforced.

A written guardrail says: “Do not delete calendar events you didn’t create.” The model reads it, generally follows it, and occasionally doesn’t.

An enforced guardrail uses a hook — a rule that runs automatically and blocks the action before it happens. You can set hooks that only activate when a specific skill is running, scoped to the task, not always on. A hook that prevents the agent from deleting files or overwriting data. Another that restricts edits to a specific folder so the agent can’t wander into files it shouldn’t touch.

The spec defines the policy. The hook enforces the mechanism.

Rule of thumb: If the failure mode is recoverable — wrong output format, missed a step — a written instruction is fine. If it’s destructive — data loss, irreversible actions — enforce it with a hook.

The examples I’ve shared in this series — generating spreadsheets, reconciling records, building itineraries — are intentionally accessible. They’re good starting points because the risk profile is low: they output files and surface information for you to act on. But if you’re building skills that handle financial transactions, modify production systems, or touch data you can’t recover, hooks aren’t optional.

Memory is what turns a skill into infrastructure

When I built the vinyl collection skill, the master spreadsheet wasn’t just an output — it was the skill’s memory. Every subsequent run reads the previous state, compares it against new email data, and produces a delta. The skill doesn’t start from zero. It picks up where it left off.

Practical tip: Retain copies of your output at each stage. I kept versioned snapshots as I expanded scope — one month, one quarter, one year — so nothing got silently overwritten and I could verify each iteration against the last. It builds trust in the output, especially when you’re still learning where the skill drifts.

Where does the memory live? It can be as simple as a log file the skill appends to after each run, a JSON config it reads on startup, or the output itself — like my vinyl spreadsheet, where the deliverable is the memory. Yesterday, Anthropic released a stable storage path ${CLAUDE_PLUGIN_DATA} that survives skill upgrades — use it, because data stored inside the skill's own directory gets wiped when the skill updates. The design principle: data integrity over autonomy. Flag conflicts for review rather than silently overwriting.

The design principle: data integrity over autonomy. When accumulated memory conflicts with new input, flag the conflict for review rather than silently overwriting.

Keep in mind that you can’t rely on the model to remember everything during a long session. If your skill runs a multi-step workflow, have it write progress to a file as it goes. What’s written down stays accurate. What the model is holding in its head may not.

One big skill or three small ones?

This is the architecture decision I keep coming back to, and there’s no universal answer — which is exactly why it requires product judgment.

My vinyl skill is self-contained. It does one thing, uses a consistent set of tools, and the entire workflow belongs together. Breaking it apart would add complexity without meaningful reuse.

The family road trip planner is a different story. When I got to the lodging step, I realized that finding an Airbnb was its own problem — with its own preferences, its own search logic, and its own set of things I care about.

The road trip skill knows it needs to find lodging — and the family travel profile tells it when an Airbnb makes sense versus a hotel. But everything specific to Airbnb — how to search, what to filter on, how to evaluate a listing — that's its own domain, with preferences that change more often and apply well beyond family trips: work travel, personal trips, friends visiting Seattle.

So I pulled it out. I specced the road trip skill so that when it identifies the need to find an Airbnb, it calls the Airbnb finder skill — which knows how to look: the search methodology, the evaluation logic, the browser workflow. It’s a capability that any workflow can use, not just this one.

The Airbnb finder has its own reference file — airbnb-profile.md — with my accommodation preferences. The family travel profile has the structural requirements for a family trip. And the road trip skill has the trip-level context: dates, destination, and number of guests. Three layers, each with a clear job: what to look for (my reference file), how to look (my Airbnb skill), when and why (my road trip skill).

If I change my Airbnb standards, I update one file, and every skill that searches for Airbnbs picks it up. Clean boundaries. Single source of truth.

Keep it together when the workflow is tightly coupled, the trigger is clear, or splitting would mean one skill needs to hand off detailed context to another — which you could build yourself, but there’s no simple built-in way to do it in the platform yet.
Split it apart when a capability serves multiple contexts, the “how” is complex enough to be its own domain, or you’re about to duplicate logic across two skills.

Skills can call other skills today — my road trip planner calls the Airbnb finder, and it just works. There's no built-in way to declare that dependency yet, so list yours clearly and define what happens when one is missing.

Your skill isn’t done when it works once

Your skill triggered correctly. The output looks good. Done, right?

You’re where I was with the vinyl skill after the first clean run on one month of data. Which is to say: the beginning.

Two things to watch: is the skill activating when it should, and is the output right when it does? If it's not firing, it's almost always a description problem. If the output is off, the fix is iteration: start narrow, verify manually, expand scope, feed corrections back.

The gotchas section is where this learning lives. Every failure mode, every edge case, every false assumption. Anthropic calls it the highest-signal content in any skill. I agree — a skill without a gotchas section that grows over time isn’t being maintained. It’s decaying. I’ll go deeper on measurement and evaluation tooling in a future post.

The ground is moving

I’m writing this in March 2026. Several of the decisions above sit on assumptions that may not hold in six months.

Context windows are getting larger. Progressive disclosure won’t become irrelevant, but the threshold will shift. Design for today’s constraints; don’t over-engineer for them.

Skills are becoming shareable. Marketplaces and plugin systems mean other people can find and install your skills, which makes your description field more important, and your dependencies need to be explicit.

MCP and browser integrations are expanding. The surface area of what a skill can reach — email, calendars, Slack, APIs, the browser — is growing fast. Skills can do more, which means the permissions question gets harder. Scope your tool access per-skill, not globally.

Models are getting better at ambiguity. Some guardrails I wrote months ago are now unnecessary — the model handles those edge cases natively. Revisit your skills periodically and prune what the model no longer needs. Over-speccing reduces flexibility without adding safety.

The principle that holds regardless: design your skills to be modular and portable enough that pieces can be swapped, upgraded, or retired without rebuilding the whole system.

The real differentiator hasn’t changed

A model can generate a SKILL.md. It can’t decide whether to put domain knowledge in a reference file or inline it. It can’t feel that a skill is doing too much. That’s still product judgment — applied one layer deeper.

Every skill you build teaches you something about the next one — and the skills themselves work in concert, building on each other, sharing context, making the whole system more capable than any single piece. That’s what makes it an operating system, not a collection of prompts.

Start engineering.

This is a follow-up to “Most People Prompt AI. I Spec It.” and is part of the personal operating system series, which includes “Building My Personal Operating System” and “What My Vinyl Collection Taught Me About Designing Agentic Systems.” For more details from Thariq’s perspective, find the article published on X.

Discussion about this post

Ready for more?