Skip to main content

2 posts tagged with "reproduction"

View All Tags

Empirical validation: the audit-callback pattern fires (and the profile only gets you partway)

· 7 min read
Tosin Akinosho
Helmdeck maintainer

Hook

We ran the same prompt twice on openai/gpt-oss-120b:free — baseline agent with generic skill prose, then a custom agent shaped by a per-model prompting profile. The profile-aware agent deposited 2 real artifacts, called artifact.verify_manifest with all_present: true, 2 of 2 verified, and hallucinated zero manifest entries. It also produced only 2 platform variations when the skill table listed 9. The library helps. It does not finish the job.

Context

This is the third post in a series that started with an honest reckoning: even after three architectural fixes closed the most common Tier C failure modes (skill-prose ignored, required arg missing, multi-step chain hallucinated), the underlying problem — that small open-weight models behave very differently from frontier models on the same skill text — wasn't going to be fixed by more pack-layer work alone. The next thing to test was at the input layer: shape the prompt to match what the model actually responds to, per its training docs.

So we shipped the first entry in a model-profile library: models/openai-gpt-oss-120b-free.yaml, sourced from OpenAI's Harmony response-format docs, Together AI's GPT-OSS guide, and IBM watsonx's GPT-OSS behavior guidelines. The profile encodes one specific prompting shape: Objective → Source priority → Constraints → Output format → Success criteria. Not "step 1, step 2, step 3."

Then we set up two OpenClaw agents pointed at the same skill, both on the same free model, differing only in their AGENTS.md. Baseline used the categorical four-modes-and-decision-rules prose we ship by default. Profile-aware used the Harmony-shaped success-criteria framing the YAML profile prescribes.

Finding

Same prompt, same model, two agents. The trace counts say everything:

MetricBaseline agent (generic prose)Profile-aware agent (Harmony-shaped)
helmdeck.plan calls11
pipeline-run calls02
Real blog artifacts in store02
artifact.verify_manifest calls01
verify_manifest resultn/aall_present: true, 2 of 2 verified
Hallucinated manifest entries in chat6 (earlier session) or 0 (later, skipped manifest)0
6-section structured outputpartialcomplete
Platform variations actually produced4 in chat, 0 deposited2 deposited, skill table listed ~9

This is the first time we've watched the audit-callback pattern (PR #462) fire end-to-end from a real Tier C trace. The profile-aware agent called pipeline-run twice (one per source URL), polled pack-status until completion, listed the resulting artifacts, called verify_manifest with the actual keys, got all_present: true back, and only then composed its final response. The verification result landed in the model's context window before the text reply was written; the response honestly reports verified: 2 of 2.

We have the audit pattern. We have empirical proof it fires. And we still got 2 platform variations instead of 9.

The agent reasoned about the objective (artifacts in the store) and picked the most efficient path: one pipeline-run per source URL produces a finished blog artifact via the built-in builtin.scrape-rewrite-blog pipeline (which internally calls blog.publish to deposit). That's two real artifacts, both verified, both downloadable. Per the operator's USER.md the skill table called for ~9 platform-native variations. The agent chose 2.

This isn't a bug. It's exactly the behavior the Together AI docs describe: GPT-OSS "performs best when given clear objectives while avoiding over-prompting or micromanaging the method." We gave it an objective; it picked a method we hadn't anticipated.

The strategic truth this validates

The profile library is necessary but not sufficient for non-frontier models.

TierWhat the profile doesWhat's left to the operator
Tier A (frontier)Probably nothing — verify on your own modelGeneric skill prose works out of the box (helmdeck assumption; please verify)
Tier B (mid-tier)Unknown — your experiment is the data we needOpen research question
Tier C (free open-weight)Raises floor of structural compliance — 6-section output, audit-callback firesPer-use-case customization — the AGENTS.md success criteria must encode YOUR use case's specific commitments (N platforms, N deposits, N variations), because the model will optimize for the objective and may simplify when the criteria don't pin a specific N

The profile gets you reliability of the audit-callback shape. It does not get you a specific use-case implementation. Operators adopting helmdeck on Tier C models will need to:

  1. Use the model profile from models/<provider>-<model>.yaml as the starting point
  2. Fork SOUL.md, USER.md, AGENTS.md for their specific operator persona
  3. Encode use-case-specific success criteria that pin the exact commitments (N=9 platform variations, not "platform variations") so the model can't simplify them away
  4. Run a verification trace on their own prompt before relying on the agent

The library is a starting point. Operators must finish the job.

Why this matters to you

If you're shipping an agent on a free model, three principles fall out of today's work:

  1. Profile your model with its official docs. Generic skill prose is wrong-fit for at least two of every three free models we've tested. Each model's training harness wants a specific prompting shape (Harmony-style for GPT-OSS, plain-English step-by-step for Llama, explicit ordered procedures for Nemotron). The first cuts of a per-model library now live in helmdeck's models/ directory, but the more useful artifact is the methodology: read the model's official docs, encode the prompting shape, and verify with an A/B trace.

  2. Make verification a typed tool call, not advisory prose. The artifact.verify_manifest audit-callback pattern fired on Tier C only because the AGENTS.md success criteria framed it as a definition of validity, not as a separate "step 4b" advisory. Tier C ignores advisory prose; it executes objectives. Frame verification as part of the objective.

  3. Don't expect one skill to fit every use case. The library is a starting point. Even with the profile applied, the model will simplify the skill's pluggable specifics (number of platforms, number of variations, number of deposits) toward its own efficient interpretation of the objective. If your use case has hard counts, pin them in the operator's AGENTS.md success criteria — not in skill prose, which the model treats as guidance rather than contract.

Share your findings

Every operator running a custom Tier C agent is producing data the rest of the community needs. Three contribution paths:

  • Profile contribution: if you customize a profile for a new model (or refine an existing one), open a PR to models/<provider>-<model>.yaml with your trace evidence in the community_traces[] field
  • Use-case contribution: if you used an existing profile on a new use case (research summarizer, code reviewer, etc.) with different results, open an issue with the trace excerpt and comparison metrics
  • Failure-mode contribution: if you hit a new failure mode (not skipped / hallucinated / simplified), file an issue tagged field-report with the trace data. We're building a vocabulary of Tier C failure modes; novel ones strengthen the whole community's understanding

See docs/howto/add-free-models.md for the detailed workflow.

See also

Tier A is structurally better. The deposit-step failure is universal.

· 7 min read
Tosin Akinosho
Helmdeck maintainer

Hook

anthropic/claude-sonnet-4.6 ran 8 real blog.rewrite_for_audience calls in parallel, executed a full 6-criterion InfoQ fit check with per-criterion grades, stated a 5-step execution plan upfront, asked exactly one clarifying question per the AGENTS.md rule, and produced zero hallucinated manifest entries. Then it skipped the mandatory artifact.put deposit step entirely — same as both Tier C variants. The deposit-step skipping is tier-invariant, not a Tier C failure mode we can patch with a per-model profile.

Context

The 2026-06-09 morning's three architectural fixes + the audit-callback pattern + the per-model profile library all targeted Tier C reliability. We assumed Tier A "works out of the box" because frontier models handle generic skill prose. We never empirically tested it.

Issue #466 tracked the gap. This post closes it.

The methodology: take the existing tech-blog-publisher agent (already on openrouter/auto, which routes to Tier A models), run the same mcp-adr-analysis-server prompt we used on Tier C all day, and watch the trace. Same skill prose. Same workspace files (SOUL / IDENTITY / USER / AGENTS already layered per OpenClaw's canonical model). No per-model profile injected. Tier A or it isn't.

The router picked anthropic/claude-sonnet-4.6 for this run.

Finding

The trace produced two distinct results — one that supports the "Tier A is better at structural compliance" claim, and one that doesn't.

What Tier A handled that Tier C didn't

BehaviorTier C baselineTier C w/ profileTier A (Sonnet 4.6)
Parallel tool use at startup✓ 3 simultaneous (read SKILL.md + 2 web-scrapes)
Real blog.rewrite_for_audience calls4 in chat0 (used pipeline-run)✓ 8 (matched the skill table)
InfoQ 6-criterion fit checkskippedskipped✓ per-criterion grades, "Possible fit" verdict
Multi-step plan acknowledgedpartialpartial✓ 5-step plan stated upfront
"Ask at most ONE clarifying question"✗ (hedged with "let me know")✓ one specific question + stated default

Every structural row swung Tier A's way. The model honored the SKILL.md's required structure end to end. The InfoQ fit check is particularly notable — Tier C agents on the same prompt have either skipped it entirely or produced a vague "Possible fit" without specifics. Tier A returned a full 6-row grade table with concrete gaps to close before submission.

The "one clarifying question" rule is the cleanest signal of skill obedience. Tier C agents either hedge ("let me know how you'd like to proceed") or skip the question and improvise. Tier A asked one question, gave a sharp default, and committed to executing the default if the operator stayed silent. That's exactly the SOUL.md voice.

What Tier A also didn't handle

Mandatory rule from SKILL.mdTier C baselineTier C w/ profileTier A (Sonnet 4.6)
artifact.put after each variation 0 calls 0 calls (used auto-deposit) 0 calls
artifact.verify_manifest after manifest 0 calls 1 call (all_present: true) 0 calls
New artifacts in store from session02 (via pipeline auto-deposit)0

Tier A's text at the moment of truth (17:08:32 in the trace):

"Now appending CTAs and depositing to artifacts — all in parallel."

Its actual parallel tool calls were 8 invocations of blog.append_cta (a CTA-appender that returns markdown, not a deposit). The model conflated "append CTA" with "deposit to artifacts." Even when those 8 calls all failed (the cause was an unrelated pack-contract gap), the agent didn't pivot to call artifact.put directly. The mandatory deposit step was never executed.

Reading the agent's text reveals the misunderstanding: it treated the entire workflow as "rewrite → append CTA → done," with "depositing" living somewhere inside the pack pipeline rather than as an explicit step the agent must invoke. The SKILL.md says §4 is "MANDATORY, NOT ADVISORY" with the exact tool name helmdeck__artifact-put. Tier A ignored it.

Naming the pattern

This is tier-invariant deposit-step skipping: the agent reads the mandatory-deposit rule, acknowledges in text that it's depositing, but never invokes the actual artifact.put tool. It's distinct from the plausibility-shaped output we documented earlier — Tier C fabricated a manifest; Tier A truthfully says it's depositing but doesn't.

Both failure modes have the same root cause: skill prose alone is insufficient to drive a typed tool call. Mandatory-by-prose is treated as advisory by every model tier we've tested.

The implication is uncomfortable: the layered architectural work isn't done. PR #450 (typed deposit), PR #462 (audit callback), and the per-model profile library all assume the agent will call the typed pack when the skill says to. Today's data says: it won't, regardless of tier.

What this changes architecturally

Phase 3 of issue #461 — engine-level post-call hook that fires the registered auditor without skill-prose dependency — was originally framed as "deferred until Phase 1 + 2 prove the pattern is generally useful." Today's trace flips that justification: the pattern is necessary because skill prose can't carry the mandatory-call weight on any tier, not just Tier C.

The architectural shape that closes this loop:

  1. Producer pack registers a paired auditor (e.g., blog.publishblog.verify-published)
  2. Engine intercepts the producer's completion and auto-invokes the auditor with the producer's output
  3. Auditor result is attached to the producer's response envelope — the LLM sees both in its next-turn context
  4. No skill-prose dependency — the agent doesn't need to remember to call the auditor, because the engine fired it

This removes "the agent will read the skill and call the verify pack" from the trust chain. It's the same architectural shape as ADR 052's av-validate post-step, applied at the artifact-deposit layer instead of the video-encoding layer.

Why this matters to you

If you're building an agent on any tier, three principles fall out of today's three-trace comparison:

  1. Don't ship "MANDATORY, NOT ADVISORY" skill prose and expect it to work. Every tier treats prose mandates as advisory. Architectural enforcement is the only durable answer.

  2. Tier A is better at structural compliance, not at typed-tool dispatch. Frontier models handle 8-step chains, parallel tool use, structured output, and clarifying-question discipline beautifully. They still skip explicit deposit calls if the skill describes "deposit" as part of a chained workflow without making the tool call the explicit terminal step.

  3. Engine-level post-call hooks are the answer. Pack the producer + auditor pair into the engine's contract so the agent can't choose to skip the audit. Both PR #462's pattern and the planned Phase 3 generalize across producer/auditor pairs.

See also