Proof, not vibes
Stop guessing whether your prompt got better.
Rubrkit grades your instruction against a rubric, rewrites it, and runs the eval that proves the new version passes what the old one failed. Score delta, diff, evidence — not vibes.
specimen / RBR-114
example
38
BEFORE
+48
86
AFTER
What a rewrite cannot give you.
One specimen, four marks: a structural finding with a line number, the diff, the eval case that flips from fail to pass, and the score delta on the dimension that moved. Shown here as an example.
1 · Finding
L1 No success criteria — “good launch post” has no testable definition of done.
2 · Diff
− Write a good launch post for my product.
+ Write a launch post for [AUDIENCE] introducing [PRODUCT]… end with a single [CTA].
3 · Eval case
“A first-time reader can name the product, problem, outcome, and CTA in under 30 seconds.”
4 · Score delta
38
→
86
Output specification +4 → pass
Deterministic findings, not opinions.
An inline "make it better" hands back a new paragraph and a vibe. Rubrkit grades against a rubric, so the weaknesses it surfaces are checks you can re-run — and the proof is a behaviour change, not a nicer sentence.
Structural gaps
A missing output contract, no success criteria, an unbounded step — each flagged with the line it occurs on, not described in the abstract.
Scored dimensions
Objective clarity, output specification, evaluation criteria, and more — each on a 0–5 scale with the evidence behind the mark.
A behaviour change
The eval case the old version fails and the new one passes. Improvement you can watch happen rather than assert.
Reproducible and versioned
Run the same grade and eval on the same version and get the same result. The delta is anchored to a specific artifact, not a moment.
Answers before you start.
A rewrite gives you a new version with no account of what changed or whether it is actually better. Rubrkit grades the original against a rubric, names the weak dimensions with evidence, and runs an eval case the old version fails and the new one passes. You see the score delta and the diff, not just a different paragraph.
Know which instructions are ready to run.
Grade an instructionFollow the review loop as it ships.
Notes on AI artifact testing, rubr_flow conversion, evals, and proof reports.