The most expensive moment to discover a weak instruction is after the agent acts on it. A 2,000-word system prompt that almost says what you meant will burn a long run, half-finish a migration, or quietly drop the one constraint that mattered — and you find out at the end, from the output.

The fix is to move the review earlier. Before you kick off the run, grade the instruction the agent is about to follow. The Rubrkit MCP server lets you do that without leaving the client the agent already lives in — Claude, Cursor, Codex, any MCP-compatible host.

Why "before it starts" is the right checkpoint

A long prompt or multi-step workflow has the same failure mode as untested code: it looks fine until something exercises the path you didn't think about. The longer the prompt, the more places for an unstated objective or an unbounded edge case to hide.

Grading after a run tells you it failed. Grading before the run tells you where it will fail — while fixing it is still free.

Clear objective. Bounded behavior. Testable result.

The loop, over MCP

The server exposes the same tools the web app and REST API run on, so an agent can audit its own instructions in-context. The shape of the loop:

Stage the instruction. Create a bundle with rubrkit_create_artifact_bundle, then push the prompt, agent, or skill file with rubrkit_upload_artifact_files. The thing you're about to run is now an artifact with a version, not a paste.
Grade it. Start an audit with rubrkit_start_audit, poll with rubrkit_poll_job, and read the verdict with rubrkit_read_audit. You get a score per rubric dimension — objective clarity, bounded behavior, output specification — instead of a gut feeling.
Harden the weak dimensions. The audit names what's soft. Write the rewrite back with rubrkit_update_artifact_file, so the improvement is a new version with a diff and a reason, not an overwrite.
Prove it before you trust it. Run rubrkit_run_evals to confirm the rewrite actually closes the gap the audit found, and rubrkit_export_proof_report when you want the evidence to travel with the change.

Because every tool enforces the same scopes and credits as the REST API, the key you hand your agent only does what you granted it.

Long prompts that are really workflows

Some "prompts" are workflows wearing a prompt's clothes: do this, then that, branch if X, don't forget Y. Those are the ones most likely to drift mid-run.

For those, rubrkit_convert_to_rubr_flow turns a sprawling instruction into a bounded flow — explicit steps and edges instead of a wall of prose the model has to hold in its head. You audit the flow the same way, before it executes. A workflow you can read as steps is a workflow you can grade as steps.

What this buys you

The payoff is boring in the best way: fewer runs that fail on something you could have seen in the instruction. You stop paying for the discovery in tokens and wall-clock, and start paying for it in a thirty-second audit before the run.

An instruction you wouldn't ship to production without reading is an instruction worth grading before you run it. The MCP just puts the grader where the agent already is.

Audit the prompt before you run the agent

Why "before it starts" is the right checkpoint

The loop, over MCP

Long prompts that are really workflows

What this buys you