Pricing
CI

How we grade our skills with CI

June 28, 2026 · 4 min read
CI
Skills

We just published an open Skill Library: a small, curated set of development skills you can browse, copy, or install with one command. Open-sourcing instructions is easy. Keeping them good is the part most collections skip — a folder of SKILL.md files drifts into stale advice the moment no one is grading it.

So we did the obvious thing for a company whose whole product is grading instructions: we point Rubrkit at our own skills, in CI, on every push.

The gate runs per skill

Each skill in the public Rubrkit/development-skills repo is a SKILL.md — an instruction an agent will actually follow. The repo's GitHub Actions workflow grades them as a matrix — one job per skill, each scored as its own artifact bundle:

npx rubrkit test skills/<skill> --remote --fail-under 75 --ci

Every skill is scored on its own against the same 10-dimension rubric the product uses, and its job exits non-zero the moment that skill drops below 75. Grading each skill separately means a regression in one turns the build red on its own — it can't hide behind the others' scores. The "graded by Rubrkit" badge at the top of the README isn't decoration — it's the live verdict on the current commit.

If we wouldn't ship an instruction that scores below our own bar, we shouldn't publish one for other people to install either.

Why a score instead of a glance

"This skill reads fine" is exactly how a skill collection rots. A glance can't be repeated, can't be compared across two versions, and can't fail a pipeline. A rubric replaces the vibe with named dimensions you can gate on — objective clarity, bounded behavior, output specification, and the rest — and a single number you can put a threshold on.

Threshold 75 is a deliberate "good, not precious" bar. It's high enough to catch a skill that has gone vague or contradictory, and low enough that an honest, well-scoped instruction clears it. When a contributor sharpens a skill, the score moves; when someone quietly waters one down, the build catches it before anyone installs the weaker version.

Run the same gate on your instructions

None of this is special to our repo. The same quality gate is one line in any pipeline:

npx rubrkit test "your/skills/**" --remote --fail-under 80 --ci

Stable exit codes, JSON and JUnit output, and a threshold you choose. Point it at the prompts, agents, and skills your team actually reuses, and the instructions that run your product get held to a bar — the same way the rest of your code already is.