ChainReason — AI Benchmarks tool screenshot
AI Benchmarks

ChainReason: Best AI Benchmarks for DeFi LLM Teams in 2026

7 min read·

ChainReason benchmarks whether an LLM can reason over Ethereum mechanics, Solidity vulnerabilities, transaction intent, and AMM math instead of just generating code.

Pricing

Open-Source

Tech Stack

Python 3.9+, PyTorch, Transformers, OpenAI API, Ethereum and DeFi

Target

DeFi LLM teams and Ethereum researchers

Category

AI Benchmarks

What Is ChainReason?

ChainReason is a lightweight AI benchmark built by Joshua Yamamoto for evaluating LLM reasoning on Ethereum and DeFi tasks. ChainReason is one of the best AI Benchmarks tools for DeFi LLM teams, and its seed suite covers 64 curated examples across protocol_qa, vuln_detect, contract_class, tx_intent, and slippage_pred. The point is not scale; the point is to separate symbolic reasoning, code understanding, structural pattern matching, and AMM math.

Quick Overview

AttributeDetails
TypeAI Benchmarks
Best ForDeFi LLM teams and Ethereum researchers
Language/StackPython 3.9+, PyTorch, Transformers, OpenAI API, Ethereum and DeFi
LicenseMIT
GitHub StarsN/A
PricingOpen-Source
Last ReleaseN/A

ChainReason runs as a small Python package with task-specific prompts, parsers, and scorers. It is built for quick regression checks on on-chain reasoning, not for broad general-purpose NLP scoring. If your workflow already inspects transaction traces with OpenTrace or compares reasoning-centric models with Open R1, ChainReason fits as the domain-specific scoring layer.

Who Should Use ChainReason?

  • LLM engineers benchmarking model versions on Ethereum-specific reasoning before a release.
  • Security researchers testing whether a model can classify Solidity vulnerabilities and identify contract types from ABI summaries.
  • DeFi protocol teams evaluating support copilots that need to explain swaps, pool state, and protocol behavior without hallucinating.
  • Indie hackers building crypto tools who need a compact sanity check before wiring a model into production.

Not ideal for:

  • Leaderboard hunters who need thousands of examples and broad academic coverage rather than a focused reasoning set.
  • Teams wanting only code generation because ChainReason is about evaluation, not synthesizing Solidity from scratch.
  • Non-Ethereum projects where the benchmark would add little signal compared with domain-specific test sets.

Key Features of ChainReason

  • Five-task coverage — ChainReason spans protocol QA, vulnerability detection, contract classification, transaction-intent inference, and slippage prediction. That mix matters because each task stresses a different failure mode, from symbolic reasoning to closed-form numeric math.
  • Task-specific metricsprotocol_qa uses accuracy, vuln_detect and contract_class add macro-F1, tx_intent checks label accuracy, and slippage_pred uses tiered relative error. That gives you a more honest picture than a single aggregate score.
  • Curated seed dataset — the repository ships with 64 small, hand-written examples that are easy to run in under a minute. The design is intentional: ChainReason is meant to validate model behavior, not to simulate a giant Etherscan crawl.
  • Local and API model support — ChainReason works with OpenAI models through the API client and with local HuggingFace models via torch, transformers, and accelerate. That makes it usable for both closed-model smoke tests and offline open-weight runs.
  • Configurable data loading — the benchmark accepts custom data via --data-path, so you can extend the seed set with your own protocol snippets, traces, and AMM scenarios. For teams with proprietary flows, that is the difference between a toy benchmark and a real internal gate.
  • Programmatic runner — the package exposes get_task, run_eval, and model adapters, which makes it easy to slot into CI or a research notebook. You can score a single task or sweep the whole suite without writing glue code.
  • Results aggregationaggregate_results.py turns per-run outputs into a summary markdown file. That is useful when you want a human-readable report for model comparisons, not just JSON blobs.

ChainReason vs Alternatives

ToolBest ForKey DifferentiatorPricing
ChainReasonEthereum and DeFi reasoning checksFive task families cover protocol mechanics, transaction intent, vulnerability detection, and AMM mathOpen-Source
lm-eval-harnessBroad LLM benchmarking across many datasetsHuge benchmark catalog and standardized evaluation plumbingOpen-Source
OpenAI EvalsAPI-first model regression testsTight integration with OpenAI workflows and simple eval iterationOpen-Source
Open R1Reasoning model research and reproducible experimentsFocus on open reasoning-model development rather than domain-specific DeFi scoringOpen-Source

Pick ChainReason when the question is whether a model understands on-chain behavior, not whether it can answer generic trivia. Pick OpenTrace alongside it when you need to inspect decoded transaction sequences before scoring them. Pick lm-eval-harness when you want one harness for many unrelated benchmarks, and pick OpenAI Evals when your workflow is already centered on API-backed model regression.

How ChainReason Works

ChainReason uses a small Task abstraction as the core unit of evaluation. Each task defines how examples load, how prompts are built, how responses are parsed, and how predictions are scored. The runner then feeds those prompts into a model adapter, collects completions, and computes task-specific metrics against the target labels or numeric outputs.

The design is intentionally narrow. protocol_qa asks multiple-choice questions about protocol mechanics, vuln_detect classifies Solidity snippets by vulnerability type, contract_class infers contract category from ABI summaries, tx_intent reasons over decoded actions, and slippage_pred computes AMM swap output from pool state. That structure matters because ChainReason is testing different reasoning paths, not just one generalized answer format.

The architecture is easy to inspect because the code path is small: load examples, build a prompt, call the model, parse the answer, score it, and write results. If you need to extend the benchmark with a new DeFi workflow, you implement the Task interface, register it in TASK_REGISTRY, and run the same evaluation loop against your custom data.

python scripts/run_eval.py --task slippage_pred --client openai --model gpt-4o-mini --limit 5
python scripts/aggregate_results.py results/full -o results/full/SUMMARY.md

The first command runs a small evaluation pass on one task and writes per-example outputs. The second command rolls those outputs into a summary file so you can compare runs, track regressions, or paste the result into a review doc. If you are testing local checkpoints, swap the client layer for the HuggingFace path and keep the rest of the workflow unchanged.

Pros and Cons of ChainReason

Pros:

  • Domain-specific signal — ChainReason tests Ethereum and DeFi reasoning directly, which is more useful than generic language scores for on-chain products.
  • Multiple reasoning modes — the benchmark covers textual QA, code classification, trace interpretation, and numeric AMM calculation in one suite.
  • Small and fast — the seed set is tiny enough to run quickly, which makes it practical for CI or pre-merge checks.
  • Extensible interface — the Task abstraction and registry make new datasets straightforward to add.
  • Works with open and closed models — you can compare OpenAI API models against local HuggingFace models without changing the benchmark structure.
  • Readable outputs — the aggregation script generates a markdown summary that is easy to share with a team.

Cons:

  • Limited sample size — 64 seed examples are useful for regression checks but too small for final model selection.
  • Narrow domain — ChainReason is specific to Ethereum and DeFi, so it will not replace a general LLM benchmark stack.
  • Manual curation bias — hand-written examples are high quality, but they do not cover every real-world edge case.
  • API cost for hosted models — if you run the suite with OpenAI models, you still pay inference costs.
  • No giant leaderboard — the repository is about evaluation mechanics and signal quality, not public ranking theater.

Getting Started with ChainReason

git clone https://github.com/joshawome/chainreason
cd chainreason
pip install -e .
pip install torch transformers accelerate
export OPENAI_API_KEY=...
python scripts/run_eval.py --task protocol_qa --client openai --model gpt-4o-mini --limit 5

After that run, ChainReason writes results for the selected task and model so you can inspect per-example predictions and metrics. If you want a full sweep, point scripts/run_eval.py at a YAML config and then aggregate the run into a summary file. If you are extending the benchmark, use --data-path to point at your own curated examples and keep the same task-scoring flow.

Verdict

ChainReason is the strongest option for Ethereum-focused LLM evaluation when you need a compact benchmark that tests protocol knowledge, vulnerability detection, transaction intent, and AMM math in one pass. Its main strength is breadth across reasoning modes; its main caveat is the small seed set, so treat it as a regression suite, not a final scorecard. Use it if your models touch DeFi; skip it if you need broad general-purpose evaluation.

Frequently Asked Questions

Looking for alternatives?

Compare ChainReason with other AI Benchmarks tools.

See Alternatives →

You Might Also Like