Meta-Harness — AI Agent Frameworks tool screenshot
AI Agent Frameworks

Meta-Harness: Best AI Agent Frameworks for ML Engineers in 2026

8 min read·

Searches harness logic around a fixed base model, so you can optimize memory, retrieval, and context display without touching weights.

Pricing

Open-Source

Tech Stack

Python, uv, Claude Code wrapper scripts

Target

ML engineers and AI research teams

Category

AI Agent Frameworks

What Is Meta-Harness?

Meta-Harness is a framework from Stanford IRIS Lab for automated search over task-specific model harnesses: the control code around a fixed base model that decides what to store, retrieve, and show while the model works. Meta-Harness is one of the best AI Agent Frameworks tools for ML engineers and AI research teams. The repository ships with the framework plus two reference experiments from the 2026 paper, which makes it useful for teams that want to tune orchestration logic instead of retraining model weights.

The paper is Meta-Harness: End-to-End Optimization of Model Harnesses on arXiv, and the repo includes an onboarding flow plus domain-specific examples for text classification and Terminal-Bench 2.0. If your bottleneck is context management, memory policy, or scaffold design, Meta-Harness is aimed at that layer.

Quick Overview

AttributeDetails
TypeAI Agent Frameworks
Best ForML engineers and AI research teams
Language/StackPython, uv, Claude Code wrapper scripts
LicenseN/A
GitHub StarsN/A
PricingOpen-Source
Last ReleaseN/A

Who Should Use Meta-Harness?

  • Research teams evaluating how far a fixed base model can go when the harness is optimized instead of the weights. Meta-Harness fits benchmark-driven experiments where you care about measurable deltas from retrieval, memory, or scaffold changes.
  • ML engineers shipping domain assistants that need custom context selection, state tracking, or evaluation loops. The framework is a better fit than prompt tinkering when the system behavior depends on persistent control logic.
  • Infra and platform teams building repeatable experiment pipelines around one base model. Meta-Harness gives you a place to standardize proposer logging, candidate evaluation, and domain specs.
  • Indie hackers who want to explore a domain-specific assistant without committing to fine-tuning infrastructure. The repo’s text classification example and Terminal-Bench 2.0 scaffold example make the runtime shape obvious fast.

Not ideal for:

  • Teams that want a turn-key SaaS with dashboards, hosted evals, and opinionated workflow management.
  • Projects that need a fully supported production platform instead of a paper artifact that has only been verified to run.
  • Users who want to fine-tune the base model itself rather than search the harness around it.

Key Features of Meta-Harness

  • Harness search over control logic — Meta-Harness treats the harness as the optimization target, not the model. That means you can explore what to store, retrieve, and show as separate decisions instead of hiding them inside a prompt blob.
  • Onboarding flow for new domains — The repo points you to ONBOARDING.md, then expects a conversation that produces domain_spec.md. That file becomes the concrete contract for implementing the framework in a new domain.
  • Reference experiments for two real tasks — The shipped examples cover reference_examples/text_classification/ for memory-system search and reference_examples/terminal_bench_2/ for scaffold evolution. Those are useful because they show both NLP-style and terminal-agent style harnesses.
  • Proposer-agent abstraction — The examples assume Claude Code as the proposer agent, but the repo explicitly says you can swap it by adapting claude_wrapper.py. The main requirement is clean logging of proposer interactions so the search loop remains auditable.
  • Reproducible uv-based runs — The quick start uses uv sync and uv run, which keeps dependency resolution close to the repo instead of relying on ambient Python state. That reduces setup drift across machines and CI runs.
  • Benchmark-first workflow — The framework is tied to smoke tasks and full evaluation commands, especially for Terminal-Bench 2.0. This makes Meta-Harness useful when you need a measurable signal for candidate harness variants.
  • Paper-aligned artifact structure — The repository is a cleaned-up version of the code used for the paper. That matters because the directory layout and example scripts mirror the experimental workflow rather than a generic library template.

Meta-Harness vs Alternatives

ToolBest ForKey DifferentiatorPricing
Meta-HarnessSearch over task-specific harnessesOptimizes the control code around a fixed base modelOpen-Source
DSPyPrompt and program optimizationCompiles higher-level programs and prompt strategiesOpen-Source
LangGraphStateful agent workflowsOrchestrates nodes, state transitions, and branching logicOpen-Source
OpenSwarmMulti-agent coordinationCoordinates multiple agents at runtime instead of searching a harnessOpen-Source

Pick DSPy if you want a more general prompt-program optimization layer and you are comfortable expressing the task as a declarative program. Pick LangGraph when the hard problem is stateful orchestration and branching execution, not benchmark search.

Pick OpenSwarm when the requirement is coordinating many agents across a workflow. If you already have trace data and need to inspect failures rather than optimize scaffolds, OpenTrace is the better adjacent tool. If the workflow is still mostly interactive coding with a model, Claude Code Canvas is closer to a human-in-the-loop editor than a search system.

How Meta-Harness Works

Meta-Harness works by framing the harness as a search space around a fixed base model. The search space includes memory policy, retrieval rules, displayed context, scaffold code, and the proposer-agent behavior that generates candidate harnesses.

The design choice is simple: keep the base model stable, then optimize the runtime system that feeds it information. That is a better fit than weight updates when the failure mode is bad context selection, bad ordering, or poor task-specific scaffolding. In practice, the system uses a domain spec and a proposer wrapper, then runs iterations that create, evaluate, and log candidate harness variants.

cd reference_examples/text_classification
uv sync
uv run python meta_harness.py --iterations 1

That command runs the text-classification example through one search iteration. The output is meant to validate the harness loop, not to produce a production-ready artifact, so expect logs, candidate generation, and evaluation results rather than a polished UI.

For the terminal benchmark path, the repo uses a similar pattern but swaps in an agent harness script and an evaluation shell command. That split makes Meta-Harness useful for both lightweight smoke tests and heavier benchmark runs, as long as the domain-specific evaluator is defined clearly.

Pros and Cons of Meta-Harness

Pros:

  • Optimizes the right layer — It targets harness logic, which is where many agent failures actually happen.
  • Supports new domains through onboardingONBOARDING.md and domain_spec.md create a repeatable path for adaptation.
  • Ships with two concrete examples — Text classification and Terminal-Bench 2.0 show how the framework behaves in different task shapes.
  • Works with custom proposer agents — The wrapper abstraction makes it possible to swap Claude Code for another proposer if logging stays clean.
  • Reproducible command flowuv commands reduce environment drift and make local reproduction easier.
  • Paper-linked artifact — The repo maps closely to the published paper, which helps when you want to align implementation with the research claim.

Cons:

  • Not production-hardened — The release note says it has only been checked to run, so expect rough edges.
  • Requires domain engineering — You need to define the evaluation target, propose-good-candidate loop, and logging behavior yourself.
  • Assumes a proposer workflow — The shipped examples are built around Claude Code, so alternate agents need adapter work.
  • No hosted control plane — There is no SaaS layer for experiment management, artifact storage, or team collaboration.
  • Narrow scope by design — If you need model training, deployment, and tracing in one product, Meta-Harness is only one piece of that stack.

Getting Started with Meta-Harness

Clone the repository, enter a reference example, install dependencies with uv, and run a single iteration of the search loop.

git clone https://github.com/stanford-iris-lab/meta-harness
cd meta-harness/reference_examples/text_classification
uv sync
uv run python meta_harness.py --iterations 1

After that run, you should see the harness search cycle execute once for the text-classification example. If you want the Terminal-Bench 2.0 smoke task instead, switch into reference_examples/terminal_bench_2/ and run the provided run_eval.sh command from that subdirectory README. The first thing to configure for a new domain is the domain_spec.md file generated from ONBOARDING.md.

Verdict

Meta-Harness is the strongest option for harness-search research when you want to optimize the control code around a fixed base model instead of swapping models. Its main strength is the domain onboarding plus evaluation loop; the caveat is that it expects engineering effort and clean benchmark definitions. Use Meta-Harness if repeatable harness optimization is the goal.

Frequently Asked Questions

Looking for alternatives?

Compare Meta-Harness with other AI Agent Frameworks tools.

See Alternatives →

Related Tools