Pricing
Rewrites

Anatomy of a testable rewrite

June 12, 2026 · 5 min read
Rewrites
Evals

"Improve this prompt" is easy to say and hard to verify. Let's walk a real one end to end, so the improvement is something you can point at rather than something you assert.

The weak instruction

Write a summary of the support ticket.

It reads fine. It is also untestable. How long is a summary? For whom? What counts as done? Two reasonable people will produce wildly different outputs and both will claim they followed it.

The rubric dimensions it fails

  • Objective clarity — "a summary" names a format, not an outcome. Summary for what decision?
  • Output specification — no length, no structure, no required fields.
  • Bounded behavior — nothing says what to leave out, or what to do when the ticket is empty.

The rewrite

Summarize the support ticket for an on-call engineer deciding whether to escalate. Output three labelled lines: Issue (one sentence), Impact (who/how many are affected), Next step (the single most useful action). If the ticket lacks enough detail to fill a line, write "unknown" for that line. Do not include the customer's name.

Every clause maps to a dimension the original failed: an outcome (escalation decision), a precise shape (three labelled lines), and an edge rule (handle missing detail; omit PII).

The eval that proves it

A rewrite is a claim. The eval is the evidence. For this one, the check is concrete: feed a ticket with no impact information and assert the Impact line reads "unknown" and the customer's name never appears. The old instruction has no reason to pass that check; the new one is built to.

Run it in CI with rubrkit test and the bar travels with the instruction. When someone "tidies up" the prompt six weeks from now and quietly drops the PII rule, the build goes red — not the customer's trust.