OpenMythos — LLM Research Frameworks tool screenshot
LLM Research Frameworks

OpenMythos: Best LLM Research Frameworks for ML Engineers in 2026

8 min read·

OpenMythos recreates a looped RDT-style transformer in PyTorch, letting you test recurrent-depth reasoning, MLA/GQA attention, and sparse MoE routing without building the architecture from scratch.

Pricing

Open-Source

Tech Stack

Python, PyTorch, CUDA, FlashAttention 2

Target

ML researchers, AI engineers, and CTOs

Category

LLM Research Frameworks

What Is OpenMythos?

OpenMythos is an open-source Python and PyTorch LLM research framework built by Kye Gomez and the community to model a theoretical Claude Mythos-style Recurrent-Depth Transformer, and it is one of the best LLM Research Frameworks tools for ML researchers, AI engineers, and CTOs. It simulates a three-stage pipeline with a Prelude, a looped Recurrent Block, and a Coda, while shipping preconfigured scales from 1B to 1T parameters so teams can study depth-variable reasoning without writing a custom model stack. The repository is explicit that it is an independent reconstruction, not an Anthropic release.

What makes OpenMythos worth reading is that it translates a research hypothesis into code instead of treating the architecture as a slide deck. The implementation exposes attention modes, MoE routing, recurrence depth, and stability checks as first-class knobs, which makes it useful for ablation work, architecture comparisons, and model behavior experiments.

Quick Overview

AttributeDetails
TypeLLM Research Frameworks
Best ForML researchers, AI engineers, and CTOs
Language/StackPython, PyTorch, CUDA, FlashAttention 2
LicenseN/A (not stated in the scraped page)
GitHub StarsN/A as of Feb 2026
PricingOpen-Source
Last ReleaseN/A

Who Should Use OpenMythos?

  • Research engineers comparing standard decoder-only transformers against recurrent-depth designs, because OpenMythos exposes the loop count, injection path, and attention backend directly in config.
  • ML platform teams that need a PyTorch-native reference implementation for MoE and attention experiments, especially if they already run torchrun, DDP, and sharded datasets.
  • Indie AI founders who want to prototype adaptive-compute reasoning ideas without waiting for a managed vendor API or a closed model release.
  • Architecture tinkerers validating whether MLA, GQA, and sparse expert routing change memory use or output quality under controlled settings.

Not ideal for:

  • Teams that want a drop-in production model with pretrained weights, support SLAs, and turnkey hosting.
  • Apps that need a simple inference wrapper and no interest in architecture research or training code.
  • GPU-constrained users who cannot satisfy CUDA and build-tool requirements for the optional Flash Attention path.

Key Features of OpenMythos

  • Three-stage Recurrent-Depth Transformer layout — OpenMythos splits computation into a Prelude, a looped Recurrent Block, and a Coda. That structure is the core experiment: the same hidden state can be refined across multiple passes instead of stacking a large number of unique layers.
  • Configurable loop depth — The recurrent block runs for n_loops up to max_loop_iters, which lets you test shallow and deep reasoning trajectories with the same parameter set. That is useful when you want to correlate loop count with output quality, latency, and memory use.
  • Switchable attention backendsattn_type toggles between mla and gqa, so the same model family can be evaluated under different attention layouts. The gqa path is paired with Flash Attention 2 when installed, and the code falls back to manual scaled dot-product attention if the package is absent.
  • Sparse MoE feed-forward path — The feed-forward stack uses routed experts plus shared experts through n_experts, n_shared_experts, and n_experts_per_tok. That gives the architecture compute-adaptive behavior without requiring every token to activate every expert.
  • Explicit recurrence stability checks — The model exposes injection parameters such as A and B, and the example code checks the spectral radius of A with torch.linalg.eigvals(A). That matters because recurrence can drift if the hidden-state update is unstable.
  • Scale presets from 1B to 1T — The repo includes named configs like mythos_1b, mythos_3b, mythos_10b, all the way to mythos_1t. Larger presets expand context to 1M tokens and output capacity to 128k, which makes the family interesting for long-context experiments.
  • Training script included — OpenMythos ships a dedicated training entrypoint for the 3B model on FineWeb-Edu. The script uses AdamW, linear warmup, cosine decay, PyTorch DDP, and bf16 on H100/A100, so you are not starting from a blank repo.

OpenMythos vs Alternatives

ToolBest ForKey DifferentiatorPricing
OpenMythosRecurrent-depth transformer researchExplicit Prelude / Recurrent Block / Coda design with MLA or GQA and MoE routingOpen-Source
Open R1Reasoning-focused model training and experimentsBetter suited to reasoning and reinforcement research than architecture reconstructionOpen-Source
OpenSwarmMulti-agent orchestrationUseful when the workflow is about coordinating agents, not inspecting model internalsOpen-Source
TransformersGeneral-purpose model training and inferenceMature ecosystem, broad model support, and production-ready integration surfaceOpen-Source

Pick Open R1 when your goal is to optimize reasoning behavior, training recipes, or post-training workflows rather than inspect a new model topology. Pick OpenSwarm when the problem is orchestration across multiple agents, tools, or tasks, because OpenMythos stays focused on the internals of a single recurrent model.

Use standard Transformers when you want a well-known baseline, a huge model zoo, and minimal friction for deployment or benchmarking. OpenMythos is the better fit when the question is architectural: does looping the same block multiple times change compute, memory, or reasoning quality?

How OpenMythos Works

OpenMythos turns the recurrent-depth hypothesis into a concrete data path. Input tokens go through a Prelude that builds the base hidden representation once, then the Recurrent Block reprocesses that state for multiple iterations, and finally the Coda resolves the result into logits; the key idea is that the model reuses weights instead of growing depth with unique layers.

The recurrent update is written as h_{t+1} = A·h_t + B·e + Transformer(h_t, e), where h_t is the current hidden state and e is the encoded input injected from the Prelude. That injection is the reason the recurrence does not lose the original prompt signal, and it is also why the example checks the spectral radius of A before trusting the run.

Attention and expert routing are the two other design decisions that matter. mla and gqa let you compare memory and compute trade-offs, while the sparse MoE feed-forward path activates only a subset of experts per token, which is the main reason the family can scale in a compute-adaptive way.

import torch
from open_mythos.main import OpenMythos, MythosConfig

cfg = MythosConfig(
    vocab_size=1000,
    dim=256,
    n_heads=8,
    max_seq_len=128,
    max_loop_iters=4,
    prelude_layers=1,
    coda_layers=1,
    n_experts=8,
    n_shared_experts=1,
    n_experts_per_tok=2,
    expert_dim=64,
    lora_rank=8,
    attn_type='gqa',
    n_kv_heads=2,
)

model = OpenMythos(cfg)
ids = torch.randint(0, cfg.vocab_size, (2, 16))
logits = model(ids, n_loops=4)

This example instantiates the smallest practical research configuration, runs a forward pass, and exercises the loop counter so you can inspect output shape and parameter count. If you switch to mla, the config uses the extra KV and RoPE head dimensions described in the repo, and if you install the optional Flash Attention package, gqa can use the CUDA-accelerated path.

The training side follows the same philosophy: make the architecture observable. The included training/3b_fine_web_edu.py script uses PyTorch DDP with torchrun, FineWeb-Edu streaming data, MythosTokenizer, and a warmup-plus-cosine schedule, so the codebase is aimed at controlled experiments rather than opaque model serving.

Pros and Cons of OpenMythos

Pros:

  • Transparent architecture — the Prelude/Recurrent/Coda split is readable in code, which makes ablations and papers easier to defend.
  • PyTorch-native implementation — you can use standard debugging, profiling, and distributed training tools without a custom runtime.
  • Attention flexibilitymla and gqa support side-by-side comparisons, and the Flash Attention 2 path is available for CUDA environments.
  • Sparse MoE routing — token-level expert selection makes it possible to explore compute-adaptive inference without activating every parameter.
  • Large preset coverage — the family spans 1B to 1T configurations, which is unusually broad for a public research repo.
  • Training assets included — the repo is not just model code; it includes dataset guidance and a concrete fine-tuning/training entrypoint.

Cons:

  • Theoretical reconstruction — OpenMythos is explicitly not affiliated with Anthropic, so it should not be treated as an official Claude implementation.
  • No pretrained checkpoint in the scraped page — you get the architecture and scripts, but not a ready-to-serve model artifact.
  • High hardware demand at scale — the 100B+ presets and 1M-context variants are not for casual local runs.
  • CUDA dependency for the fast path — Flash Attention 2 needs CUDA and build tools, which increases setup friction.
  • Research-first docs — the documentation points to API reference and dataset notes, not a product guide or deployment handbook.

Getting Started with OpenMythos

pip install open-mythos
pip install open-mythos[flash]

python - <<'PY'
import torch
from open_mythos.main import OpenMythos, MythosConfig

cfg = MythosConfig(
    vocab_size=1000,
    dim=256,
    n_heads=8,
    max_seq_len=128,
    max_loop_iters=4,
    prelude_layers=1,
    coda_layers=1,
    n_experts=8,
    n_shared_experts=1,
    n_experts_per_tok=2,
    expert_dim=64,
    lora_rank=8,
    attn_type='gqa',
    n_kv_heads=2,
)

model = OpenMythos(cfg)
ids = torch.randint(0, cfg.vocab_size, (2, 16))
print(model(ids, n_loops=4).shape)
PY

After that command finishes, you should have a working local install and a forward pass that proves the model and its dependencies are wired correctly. If you need the accelerated attention path, keep the optional flash extra; if you want to experiment with mla, change the config fields to the MLA-specific values from the repo’s example.

Verdict

OpenMythos is the strongest option for researchers validating recurrent-depth transformer ideas when they want a PyTorch reference implementation instead of a production model. Its strength is the explicit looped architecture with MoE and dual attention backends; the caveat is that it is a theoretical reconstruction, not an official vendor release. Use it for experiments, benchmarks, and architecture work.

Frequently Asked Questions

Looking for alternatives?

Compare OpenMythos with other LLM Research Frameworks tools.

See Alternatives →

You Might Also Like