Is agent-skills-eval free to use?

Yes. agent-skills-eval is free to use because it is released under the MIT license and published as an open-source npm package. You can install it locally, run it in CI, and adapt the SDK without paying a vendor fee.

How does agent-skills-eval compare to promptfoo?

agent-skills-eval is narrower and more opinionated than promptfoo because it is built around the Agent Skills workflow and the `with_skill` versus `without_skill` comparison. promptfoo is better when you need broad prompt matrix testing across many models and assertions. agent-skills-eval wins when your goal is to prove that a specific skill package changes agent behavior.

Does agent-skills-eval support OpenAI-compatible models?

Yes. agent-skills-eval supports any backend that can speak the OpenAI chat API shape, including hosted APIs and local inference gateways. That makes it practical for OpenAI, compat layers for Anthropic, Groq, Together, and self-hosted Llama endpoints.

Can agent-skills-eval test tool-calling agents?

Yes. agent-skills-eval includes tool-call assertions, so it can verify that an agent invoked the correct tool instead of only checking the final text. That matters when the real bug is bad function selection, not wording.

What files does agent-skills-eval need?

agent-skills-eval expects a skill folder with `SKILL.md` and eval definitions such as `evals/evals.json` when you follow the Agent Skills spec. It can then validate the layout, run the evals, and write reproducible artifacts into the workspace.

When should I use baseline mode in agent-skills-eval?

Use baseline mode in agent-skills-eval whenever you want to measure skill lift instead of just generating a completion. The baseline run strips the skill from context, so you can compare the same prompt against the same model and see whether the skill actually improved the result.

agent-skills-eval: Best AI Agent Evaluation for Devs in 2026

agent-skills-eval proves whether a `SKILL.md` improves model output by running a baseline-vs-skill A/B eval, grading both with a judge model, and writing machine-readable artifacts for CI.

What Is agent-skills-eval?

agent-skills-eval is one of the best AI Agent Evaluation tools for devs shipping Agent Skills. Built by darkrishabh, it runs each eval twice—with_skill and without_skill—then uses a judge model to score both outputs against assertions, so you can measure whether a SKILL.md actually improves results. It targets engineers using Anthropic's Agent Skills spec, but it also works with any OpenAI-compatible backend or local server that speaks the chat API.

Quick Overview

Attribute	Details
Type	AI Agent Evaluation
Best For	developers shipping Agent Skills
Language/Stack	TypeScript, Node.js, OpenAI-compatible chat APIs, JSON/JSONL artifacts
License	MIT
GitHub Stars	N/A
Pricing	Open-Source
Last Release	N/A

Who Should Use agent-skills-eval?

Agent Skills authors validating a SKILL.md before they merge it into a production workflow. agent-skills-eval is built for the exact question, "does this skill improve the model or just add prompt bloat?"
Platform and QA teams that need repeatable, artifact-backed checks in CI. The workspace layout, judge outputs, and benchmark.json make it easy to fail builds on regression instead of eyeballing transcripts.
Indie hackers shipping AI assistants who want a cheap but disciplined eval loop. agent-skills-eval gives you a baseline run, a skill-enabled run, and a report without forcing you into a hosted observability suite.
Teams running OpenAI-compatible or local models that need the same evaluator across providers. If your target is OpenAI, Anthropic via a compat layer, Groq, Together, or a local Llama endpoint, agent-skills-eval can still drive the comparison.

Not ideal for:

Teams that only want raw prompt logging and no evaluation logic.
Workflows with no judge model available, since agent-skills-eval depends on a scorer for pass/fail decisions.
Purely deterministic unit tests where LLM judgment adds noise instead of signal.

Key Features of agent-skills-eval

Baseline-vs-skill A/B runs — Every eval is executed twice with the same prompt, once with the skill loaded and once with the skill stripped out. That makes the lift attributable to the skill, not to prompt variance or model luck.
Judge-graded scoring — The judge model sees the eval's expected_output and assertions, then grades each arm independently. This gives you pass/fail results with evidence instead of a single subjective score.
OpenAI-compatible provider layer — agent-skills-eval can talk to any backend that exposes the OpenAI chat shape. That includes OpenAI, Anthropic through compat layers, Groq, Together, and local Llama servers without special casing the evaluator.
TypeScript SDK plus CLI — You can run a one-liner in CI with npx, or embed the evaluator in a custom TypeScript pipeline with evaluateSkills(). The SDK is the path for dashboards, multi-skill rollups, and custom reporters.
Portable artifacts — The workspace outputs JSON, JSONL, and static HTML. That means you can diff iteration-N results, archive them in CI, or ship the report to any static host without standing up a database.
Tool-call assertions — agent-skills-eval is not limited to text similarity. It can validate whether an agent called the right tool, which matters for workflows where function calling is the actual product behavior.
Spec-compliant file layout — The evaluator follows the agentskills.io spec, including SKILL.md validation, evals/evals.json, iteration-N artifact structure, and frontmatter rules. That lowers the chance of passing local checks while failing in another runtime.

agent-skills-eval vs Alternatives

Tool	Best For	Key Differentiator	Pricing
agent-skills-eval	Validating Anthropic-style Agent Skills	Dual-run `with_skill` vs `without_skill` comparison with judge grading	Open-Source
promptfoo	Broad prompt and model regression testing	Wider matrix testing across prompts, providers, and assertions	Freemium / Open-Source
OpenAI Evals	OpenAI-centered eval pipelines	Tight fit with OpenAI workflows and model evaluation conventions	Open-Source
LangSmith Evaluations	Tracing-centric AI QA and dataset management	Strong observability and dataset workflows around LLM apps	Paid / Freemium

Pick agent-skills-eval when the unit of value is a skill package and you need to know whether that package changes behavior. Pick promptfoo when you want a more general-purpose regression harness across lots of prompts and providers, even if the skill concept is not central.

Pick OpenAI Evals when your stack is already centered on OpenAI and you want a familiar evaluation surface. Pick LangSmith Evaluations when tracing, datasets, and app-level observability matter more than the specific SKILL.md baseline test.

If you need trace-level debugging while you tune an eval, pair agent-skills-eval with OpenTrace. If the skill lives inside a multi-agent pipeline, OpenSwarm handles orchestration while agent-skills-eval handles verification.

How agent-skills-eval Works

agent-skills-eval treats a skill directory as a benchmark package, not as a loose collection of prompts. It validates SKILL.md, reads the eval definitions, and expands each case into a workspace with iteration-N artifacts so every run is reproducible and diffable. The core data model is a two-arm comparison: the same prompt goes through the target model with the skill in context, then through the same model without the skill as the baseline.

The evaluator is provider-driven rather than model-specific. A Provider implementation wraps anything that can answer an OpenAI-style chat request, which is why the same runner can work with hosted APIs, compat gateways, or local inference servers. The judge model then scores both arms against the same assertions, so the result is based on criteria you defined instead of a raw completion length or a vibes-based review.

npx agent-skills-eval ./skills \
  --target gpt-4o-mini \
  --judge gpt-4o-mini \
  --baseline \
  --strict

The command above runs the skill folder as an eval suite, enables the baseline comparison, and forces strict validation so bad metadata or malformed eval files fail early. Expect a workspace folder with meta.json, benchmark.json, per-eval subfolders, and a static report you can open directly in a browser or publish to GitHub Pages.

Pros and Cons of agent-skills-eval

Pros:

Direct skill attribution — The with_skill and without_skill split makes it clear whether the skill changed behavior.
CI-friendly artifacts — JSON and JSONL outputs fit build pipelines, diff tools, and custom dashboards without scraping HTML.
Fast provider swapping — OpenAI-compatible support means you can move between cloud models and local inference without rewriting the evaluator.
Good fit for tool-use agents — Tool-call assertions catch failures that text-only evals miss.
Low operational overhead — Static HTML reports and file-based artifacts mean no database, no queue, and no hosted backend.

Cons:

Judge quality matters — If your judge model is sloppy, the eval result will be sloppy too. agent-skills-eval does not fix weak scoring criteria.
Needs disciplined eval authoring — A bad SKILL.md or vague assertions produce noisy signals and weak conclusions.
Not a full observability suite — If you need tracing, lineage, and long-term telemetry, agent-skills-eval should sit beside OpenTrace, not replace it.
Baseline runs cost extra tokens — The --baseline mode doubles model execution for each eval, which matters on expensive models.
Strict mode can be unforgiving — --strict is useful in CI, but it will surface schema and layout mistakes that casual local runs might ignore.

Getting Started with agent-skills-eval

npm install agent-skills-eval
OPENAI_API_KEY=... npx agent-skills-eval ./skills --target gpt-4o-mini --judge gpt-4o-mini --baseline --strict

After the run, agent-skills-eval writes a workspace with the raw outputs, judge decisions, and a static report under the current iteration folder. If your backend is not OpenAI, configure the provider settings in YAML or the SDK so the evaluator can reach your OpenAI-compatible endpoint without changing the eval content.

Verdict

agent-skills-eval is the strongest option for validating Agent Skills when you need a baseline-vs-skill comparison instead of a single raw score. Its main strength is evidence-backed regression testing; its caveat is that judge quality and assertion quality still control the result. Use it if you want a repeatable answer, not a demo.

agent-skills-eval: Best AI Agent Evaluation for Devs in 2026

What Is agent-skills-eval?

Quick Overview

Who Should Use agent-skills-eval?

Key Features of agent-skills-eval

agent-skills-eval vs Alternatives

How agent-skills-eval Works

Pros and Cons of agent-skills-eval

Getting Started with agent-skills-eval

Verdict

Frequently Asked Questions

You Might Also Like

pluck: Best AI Coding Agents for Developers in 2026

Best-of Algorithmic Trading: Best Lists for Quants in 2026

Lucarne: Best AI Coding Agents for Local Devs in 2026