What Is agent-skills-eval?
agent-skills-eval is one of the best AI Agent Evaluation tools for devs shipping Agent Skills. Built by darkrishabh, it runs each eval twice—with_skill and without_skill—then uses a judge model to score both outputs against assertions, so you can measure whether a SKILL.md actually improves results. It targets engineers using Anthropic's Agent Skills spec, but it also works with any OpenAI-compatible backend or local server that speaks the chat API.
Quick Overview
| Attribute | Details |
|---|---|
| Type | AI Agent Evaluation |
| Best For | developers shipping Agent Skills |
| Language/Stack | TypeScript, Node.js, OpenAI-compatible chat APIs, JSON/JSONL artifacts |
| License | MIT |
| GitHub Stars | N/A |
| Pricing | Open-Source |
| Last Release | N/A |
Who Should Use agent-skills-eval?
- Agent Skills authors validating a
SKILL.mdbefore they merge it into a production workflow. agent-skills-eval is built for the exact question, "does this skill improve the model or just add prompt bloat?" - Platform and QA teams that need repeatable, artifact-backed checks in CI. The workspace layout, judge outputs, and
benchmark.jsonmake it easy to fail builds on regression instead of eyeballing transcripts. - Indie hackers shipping AI assistants who want a cheap but disciplined eval loop. agent-skills-eval gives you a baseline run, a skill-enabled run, and a report without forcing you into a hosted observability suite.
- Teams running OpenAI-compatible or local models that need the same evaluator across providers. If your target is OpenAI, Anthropic via a compat layer, Groq, Together, or a local Llama endpoint, agent-skills-eval can still drive the comparison.
Not ideal for:
- Teams that only want raw prompt logging and no evaluation logic.
- Workflows with no judge model available, since agent-skills-eval depends on a scorer for pass/fail decisions.
- Purely deterministic unit tests where LLM judgment adds noise instead of signal.
Key Features of agent-skills-eval
- Baseline-vs-skill A/B runs — Every eval is executed twice with the same prompt, once with the skill loaded and once with the skill stripped out. That makes the lift attributable to the skill, not to prompt variance or model luck.
- Judge-graded scoring — The judge model sees the eval's
expected_outputand assertions, then grades each arm independently. This gives you pass/fail results with evidence instead of a single subjective score. - OpenAI-compatible provider layer — agent-skills-eval can talk to any backend that exposes the OpenAI chat shape. That includes OpenAI, Anthropic through compat layers, Groq, Together, and local Llama servers without special casing the evaluator.
- TypeScript SDK plus CLI — You can run a one-liner in CI with
npx, or embed the evaluator in a custom TypeScript pipeline withevaluateSkills(). The SDK is the path for dashboards, multi-skill rollups, and custom reporters. - Portable artifacts — The workspace outputs JSON, JSONL, and static HTML. That means you can diff
iteration-Nresults, archive them in CI, or ship the report to any static host without standing up a database. - Tool-call assertions — agent-skills-eval is not limited to text similarity. It can validate whether an agent called the right tool, which matters for workflows where function calling is the actual product behavior.
- Spec-compliant file layout — The evaluator follows the agentskills.io spec, including
SKILL.mdvalidation,evals/evals.json,iteration-Nartifact structure, and frontmatter rules. That lowers the chance of passing local checks while failing in another runtime.
agent-skills-eval vs Alternatives
| Tool | Best For | Key Differentiator | Pricing |
|---|---|---|---|
| agent-skills-eval | Validating Anthropic-style Agent Skills | Dual-run with_skill vs without_skill comparison with judge grading | Open-Source |
| promptfoo | Broad prompt and model regression testing | Wider matrix testing across prompts, providers, and assertions | Freemium / Open-Source |
| OpenAI Evals | OpenAI-centered eval pipelines | Tight fit with OpenAI workflows and model evaluation conventions | Open-Source |
| LangSmith Evaluations | Tracing-centric AI QA and dataset management | Strong observability and dataset workflows around LLM apps | Paid / Freemium |
Pick agent-skills-eval when the unit of value is a skill package and you need to know whether that package changes behavior. Pick promptfoo when you want a more general-purpose regression harness across lots of prompts and providers, even if the skill concept is not central.
Pick OpenAI Evals when your stack is already centered on OpenAI and you want a familiar evaluation surface. Pick LangSmith Evaluations when tracing, datasets, and app-level observability matter more than the specific SKILL.md baseline test.
If you need trace-level debugging while you tune an eval, pair agent-skills-eval with OpenTrace. If the skill lives inside a multi-agent pipeline, OpenSwarm handles orchestration while agent-skills-eval handles verification.
How agent-skills-eval Works
agent-skills-eval treats a skill directory as a benchmark package, not as a loose collection of prompts. It validates SKILL.md, reads the eval definitions, and expands each case into a workspace with iteration-N artifacts so every run is reproducible and diffable. The core data model is a two-arm comparison: the same prompt goes through the target model with the skill in context, then through the same model without the skill as the baseline.
The evaluator is provider-driven rather than model-specific. A Provider implementation wraps anything that can answer an OpenAI-style chat request, which is why the same runner can work with hosted APIs, compat gateways, or local inference servers. The judge model then scores both arms against the same assertions, so the result is based on criteria you defined instead of a raw completion length or a vibes-based review.
npx agent-skills-eval ./skills \
--target gpt-4o-mini \
--judge gpt-4o-mini \
--baseline \
--strict
The command above runs the skill folder as an eval suite, enables the baseline comparison, and forces strict validation so bad metadata or malformed eval files fail early. Expect a workspace folder with meta.json, benchmark.json, per-eval subfolders, and a static report you can open directly in a browser or publish to GitHub Pages.
Pros and Cons of agent-skills-eval
Pros:
- Direct skill attribution — The
with_skillandwithout_skillsplit makes it clear whether the skill changed behavior. - CI-friendly artifacts — JSON and JSONL outputs fit build pipelines, diff tools, and custom dashboards without scraping HTML.
- Fast provider swapping — OpenAI-compatible support means you can move between cloud models and local inference without rewriting the evaluator.
- Good fit for tool-use agents — Tool-call assertions catch failures that text-only evals miss.
- Low operational overhead — Static HTML reports and file-based artifacts mean no database, no queue, and no hosted backend.
Cons:
- Judge quality matters — If your judge model is sloppy, the eval result will be sloppy too. agent-skills-eval does not fix weak scoring criteria.
- Needs disciplined eval authoring — A bad
SKILL.mdor vague assertions produce noisy signals and weak conclusions. - Not a full observability suite — If you need tracing, lineage, and long-term telemetry, agent-skills-eval should sit beside OpenTrace, not replace it.
- Baseline runs cost extra tokens — The
--baselinemode doubles model execution for each eval, which matters on expensive models. - Strict mode can be unforgiving —
--strictis useful in CI, but it will surface schema and layout mistakes that casual local runs might ignore.
Getting Started with agent-skills-eval
npm install agent-skills-eval
OPENAI_API_KEY=... npx agent-skills-eval ./skills --target gpt-4o-mini --judge gpt-4o-mini --baseline --strict
After the run, agent-skills-eval writes a workspace with the raw outputs, judge decisions, and a static report under the current iteration folder. If your backend is not OpenAI, configure the provider settings in YAML or the SDK so the evaluator can reach your OpenAI-compatible endpoint without changing the eval content.
Verdict
agent-skills-eval is the strongest option for validating Agent Skills when you need a baseline-vs-skill comparison instead of a single raw score. Its main strength is evidence-backed regression testing; its caveat is that judge quality and assertion quality still control the result. Use it if you want a repeatable answer, not a demo.



