Natural Language Autoencoders — LLM Interpretability Libraries tool screenshot
LLM Interpretability Libraries

Natural Language Autoencoders: Open-Source LLM Research Library

8 min read·

NLA turns a hidden residual-stream vector into a natural-language explanation and reconstructs it back, so you can test whether the words preserve the activation direction instead of just sounding plausible.

Pricing

Open-Source

Tech Stack

Python, PyTorch, Hugging Face Transformers, SGLang, Ray/Miles, Megatron, FSDP2, GRPO

Target

mechanistic interpretability researchers, LLM researchers, and ML engineers

Category

LLM Interpretability Libraries

What Is Natural Language Autoencoders?

Natural Language Autoencoders (NLA) is an open-source LLM interpretability library from kitft, built around Anthropic's 2026 Transformer Circuits post, that converts residual-stream activation vectors into natural language and back for mechanistic interpretability researchers and LLM engineers. Natural Language Autoencoders is one of the best LLM Interpretability Libraries tools for ML researchers because it ships eight released checkpoints across Qwen2.5, Gemma-3, and Llama-3.3 families and measures whether an explanation preserves vector direction rather than just producing fluent text.

The design is straightforward and testable. An activation verbalizer (AV) maps vector to text, while an activation reconstructor (AR) maps text back to vector, so the round trip becomes a concrete metric instead of a vague qualitative claim.

Quick Overview

AttributeDetails
TypeLLM Interpretability Libraries
Best Formechanistic interpretability researchers, LLM researchers, and ML engineers
Language/StackPython, PyTorch, Hugging Face Transformers, SGLang, Ray/Miles, Megatron, FSDP2, GRPO
LicenseN/A
GitHub StarsN/A as of Feb 2026
PricingOpen-Source
Last ReleaseN/A

Who Should Use Natural Language Autoencoders?

  • Mechanistic interpretability researchers who need a bidirectional check on whether a verbal explanation actually preserves the information in a residual-stream vector.
  • LLM infra teams that already work with PyTorch, Hugging Face, SGLang, or distributed RL stacks and want a reproducible research pipeline instead of a toy demo.
  • Research engineers building activation-analysis tooling around Parquet datasets, layer-wise probes, and checkpoint evaluation.
  • Teams studying model internals at scale who need data generation, supervised fine-tuning, and GRPO-style reinforcement learning in one codebase.

Not ideal for:

  • Product teams that want a drop-in analytics SDK with no GPU orchestration, no model serving, and no research workflow.
  • Small apps that only need explanation text and do not care about reconstruction error or latent direction fidelity.
  • Teams without access to Anthropic API usage or multi-GPU hardware, because the training path assumes real infrastructure and the docs mention H100-class runs.

Key Features of Natural Language Autoencoders

  • Dual-model AV/AR architecture — NLA splits the problem into two fine-tuned language models. The AV turns a vector into text, and the AR turns that text back into a vector, which makes the explanation testable with a round-trip score instead of a subjective review.
  • Raw activation injection through input_embeds — The trainer constructs the embed sequence on its side, splices the activation vector into a fixed prompt, and sends the finished tensor to SGLang over HTTP. That means SGLang never needs special support for NLA semantics.
  • Direction-only reconstruction metric — Both vectors are L2-normalized before comparison, so MSE(reconstructed, original) = 2(1 - cos) measures direction agreement only. Low error means the verbal explanation preserved the direction of the hidden state.
  • Checkpoint coverage across major model families — The repo publishes eight checkpoints in the kitft/nla-models Hugging Face collection, including Qwen2.5-7B-Instruct, Gemma-3-12B-IT, Gemma-3-27B-IT, and Llama-3.3-70B-Instruct. The extraction layer sits roughly two-thirds through each model, which keeps the residual stream semantically rich without collapsing toward the unembedding.
  • Full training pipeline in one repo — NLA includes data generation, AR supervised fine-tuning, AV supervised fine-tuning, GRPO reinforcement learning, and checkpoint conversion. That is useful when you want to reproduce the research path rather than only consume finished weights.
  • Miles and SGLang integration — Training rides on Miles for orchestration and SGLang for rollout serving. The repo uses extension points like --custom-rm-path, --data-source-path, and --custom-generate-function-path, then adds its own --custom-actor-cls-path and --force-use-critic hooks without patching upstream in place.
  • Sidecar metadata instead of hardcoded assumptions — Each checkpoint ships an nla_meta.yaml file with the prompt template, injection token IDs, and scale factors. That matters because the docs explicitly say to load the metadata rather than hardcode model-specific constants.

Natural Language Autoencoders vs Alternatives

ToolBest ForKey DifferentiatorPricing
Natural Language Autoencodersbidirectional explanation of residual-stream activationslearns a natural-language verbalizer plus a reconstructor, then scores the round tripOpen-Source
Sparse Autoencoders (SAEs)latent feature discovery in activationsproduces sparse feature dictionaries, not human-language explanationsOpen-Source
Neuronpediabrowsing and inspecting learned featureshosted UI for exploring features and examples without running your own training stackFreemium
OpenTracetracing activation provenance and system behaviorbetter for observability and debugging than for explanation generationOpen-Source

Pick Sparse Autoencoders when you want compact, sparse feature bases and you do not need the model to speak English. Pick Neuronpedia when you want a browser for features and examples rather than a training repo.

Pick OpenTrace when the core problem is tracing where an activation came from or how it changed across a system boundary. If your activation corpora live in Parquet and need storage or retrieval around the training loop, DataHaven is the adjacent tool that fits the data layer better than NLA.

How Natural Language Autoencoders Works

NLA is built around a simple core abstraction: a hidden-state vector becomes text, and that text must reconstruct the original direction well enough to be useful. The AV takes a residual-stream activation vector, injects it as a single token embedding into a fixed prompt, and autoregresses a natural-language description. The AR uses a truncated K+1-layer LM plus a Linear(d, d) head and reads the final token to recover the vector.

The key technical decision is that the system measures semantic faithfulness through geometry, not prose quality. Because vectors are normalized before comparison, the loss focuses on direction agreement, which makes the metric stable across scale changes and lets the authors compare models with different hidden sizes such as 3584 for Qwen2.5-7B and 8192 for Llama-3.3-70B.

The serving stack is intentionally boring in the right way. The trainer builds [seq, d] tensors, looks up prompt tokens in the model's own embedding table, inserts the activation vector at the injection slot, and sends the completed embedding sequence to SGLang over HTTP. That keeps the implementation modular, so future adapter ideas like an affine W·v + b transform can stay trainer-side only.

python -m sglang.launch_server --model-path kitft/nla-qwen2.5-7b-L20-av --port 30000 --disable-radix-cache &
python nla_inference.py kitft/nla-qwen2.5-7b-L20-av --sglang-url http://localhost:30000 --parquet path/to/activations.parquet

The first command starts an SGLang server on the released AV checkpoint, and the second command sends activation vectors from a Parquet file for inference. Expect the runner to read an activation_vector column, feed each vector through the AV, and return explanations that you can score with the AR or inspect manually.

Pros and Cons of Natural Language Autoencoders

Pros:

  • Bidirectional verification — NLA does not stop at a generated explanation; it checks whether the explanation can reconstruct the original activation direction.
  • Released checkpoints across four model families — The repo already includes Qwen2.5, Gemma-3, and Llama-3.3 variants, which makes it usable without a fresh training run.
  • Production-grade training plumbing — Miles, SGLang, FSDP2, Megatron, and GRPO give the project real-scale training and serving pathways.
  • Explicit metadata handling — The nla_meta.yaml sidecar prevents brittle copy-paste constants and makes checkpoint loading less error-prone.
  • Trainer-side injection design — Using input_embeds keeps the serving layer simple and avoids special server patches for vector injection semantics.
  • Good fit for research workflows — The repository includes docs, worked examples, and a minimal inference path, which makes reproduction less painful.

Cons:

  • Heavy infrastructure requirements — The training notes mention multi-H100 setups, so this is not a laptop tool.
  • Research-first ergonomics — The codebase assumes you are comfortable with RL fine-tuning, checkpoint conversion, and model-serving primitives.
  • Model-specific gotchas — The docs call out scale-factor handling and the Gemma sqrt(d) embedding-scale issue, which means naïve inference can be wrong.
  • Depends on the right activation format — If your data is not already organized as activation_vector rows with the expected dimensionality, you need a preprocessing step.
  • Not a general-purpose explanation API — NLA is designed for residual-stream activations, not arbitrary embeddings or black-box text summaries.

Getting Started with Natural Language Autoencoders

For inference, install the runtime dependencies, start an AV checkpoint behind SGLang, then point the runner at a Parquet file of activations. The repo explicitly recommends loading the released metadata and using a file with an activation_vector column of d_model-wide float lists.

uv pip install torch transformers safetensors httpx orjson pyyaml numpy
uv pip install sglang[all]

python -m sglang.launch_server --model-path kitft/nla-qwen2.5-7b-L20-av --port 30000 --disable-radix-cache &
python nla_inference.py kitft/nla-qwen2.5-7b-L20-av --sglang-url http://localhost:30000 --parquet path/to/activations.parquet

After those commands run, NLA will send activations to the AV and stream back explanations for review or downstream scoring. If you do not already have a Parquet corpus, the repo documents how to fabricate a small one from hidden states, and the sidecar metadata tells you which prompt template and injection scale to use.

If you want to reproduce training instead of inference, the repository points to docs/setup.md, docs/design.md, and configs/TRAINING_NOTES.md for the full Miles + SGLang + RL path. That route is heavier, but it is the only way to regenerate checkpoints and study the full AV/AR training loop.

Verdict

Natural Language Autoencoders is the strongest option for interpreting residual-stream activations when you need a bidirectional check on explanation quality. Its main strength is the AV/AR round trip over normalized vectors; the caveat is the GPU-heavy Miles plus SGLang stack. Use it for serious interpretability work, not lightweight app analytics.

Frequently Asked Questions

Looking for alternatives?

Compare Natural Language Autoencoders with other LLM Interpretability Libraries tools.

See Alternatives →

You Might Also Like