SSD — LLM Inference Engines tool screenshot
LLM Inference Engines

SSD: Best LLM Inference Engines for AI Inference Engineers in 2026

6 min read·

SSD runs drafting and verification of speculative decoding in parallel on separate hardware, eliminating drafting overhead when anticipation matches outcomes.

Pricing

Open-Source

Tech Stack

Python/CUDA/PyTorch

Target

AI inference engineers

Category

LLM Inference Engines

What Is SSD?

SSD is a speculative speculative decoding inference engine for large language models, built by tanishqkumar as an open-source reference implementation. It supports Qwen3 and Llama3 model families with features like tensor parallelism, PagedAttention, CUDA graphs, torch compilation, and prefix caching. SSD is one of the best LLM inference engines for AI inference engineers optimizing throughput on H100 GPUs, achieving exact inference faster than standard speculative decoding by parallelizing drafting and verification across hardware.

Quick Overview

AttributeDetails
TypeLLM Inference Engines
Best ForAI inference engineers
Language/StackPython/CUDA/PyTorch
LicenseMIT
GitHub Stars546 as of Oct 2024
PricingOpen-Source
Last ReleaseN/A — latest commit Oct 2024

Who Should Use SSD?

  • AI inference engineers on multi-GPU clusters tuning LLM serving for production workloads needing 2x+ speedups via parallel speculation.
  • ML researchers benchmarking decoding algorithms on datasets like Humaneval and Alpaca, requiring baselines for autoregressive, standard SD, and SSD modes.
  • HPC teams with H100/A100 hardware stacks handling tensor-parallel Llama-3 70B or Qwen-3 32B models under CUDA 12.8+.
  • Indie AI hackers prototyping fast local inference without vLLM overhead, focusing on exact speculative methods.

Not ideal for:

  • CPU-only environments, as SSD mandates CUDA 12.8+ and GPU architectures like sm_90 (H100).
  • Single-GPU low-memory setups under 80GB, due to tensor parallelism and model loading demands.
  • Teams needing broad model compatibility beyond Qwen3/Llama3, lacking support for GPT or Mistral families yet.

Key Features of SSD

  • Parallel Speculative Decoding — Small draft model anticipates verification outcomes across branches on separate GPUs, enabling immediate token acceptance if correct, with zero sequential overhead.
  • Optimized Baselines — Includes autoregressive decoding and standard speculative decoding modes for fair benchmarking, all under torch.compile for kernel fusion.
  • Tensor Parallelism — Splits Llama-3 70B or Qwen-3 32B across multiple GPUs via tensor parallelism, supporting up to H100 clusters with sm_90 compute capability.
  • PagedAttention Integration — Uses vLLM-style PagedAttention for efficient KV cache management, reducing memory fragmentation during long-sequence generation.
  • CUDA Graphs and Prefix Caching — Captures repetitive kernel launches in graphs for 20-30% latency reduction; caches prefixes to skip recomputation on repeated prompts.
  • Benchmark Suite — Built-in eval on Humaneval, Alpaca, and others via bench/ scripts, with --all flag for 4-dataset averages and --numseqs for scalable sampling.
  • UV Dependency Management — Leverages uv for fast Python 3.11+ env sync, including extras for download scripts handling HF hub models.

SSD vs Alternatives

ToolBest ForKey DifferentiatorPricing
SSDParallel speculation on multi-GPUAnticipatory branching eliminates draft overheadOpen-Source
vLLMHigh-throughput servingPagedAttention + continuous batching, broader modelsOpen-Source
TensorRT-LLMNVIDIA-optimized latencyEngine compilation for single-GPU peaksOpen-Source
SGLangRuntime optimizationsZero-overhead loop + RadixAttentionOpen-Source

vLLM excels in dynamic batching for API servers but runs speculation sequentially, capping at 1.5-2x autoregressive speeds; pick it for Mistral/GPT support over SSD's Qwen/Llama focus. TensorRT-LLM delivers sub-ms token latencies on A100s via static graphs but lacks native speculation, suiting latency-critical chats. SGLang optimizes Python runtimes with RadixAttention for 3x batching gains, better for variable-length inference than SSD's fixed parallelism. For more options, browse all LLM Inference Engines.

How SSD Works

SSD extends speculative decoding by having the draft model precompute tokens for all likely verification paths simultaneously on auxiliary hardware. The target model verifies in parallel; matching branches commit instantly, while mismatches fall back to single-step generation. This design assumes branching factor from Borges-inspired forking paths, handling exponential speculation without explosion via probability pruning.

Core data flow uses PyTorch tensors split via tensor parallelism: draft logits feed into a speculation tree, verified against target logits in one fused forward pass per branch. PagedAttention manages KV caches across devices, with CUDA graphs capturing load-compile-gen loops for steady-state throughput. Torch compilation fuses ops like softmax-crossentropy, yielding 15-25% kernel speedups on sm_90.

# Clone and setup
uv sync
export SSD_HF_CACHE=/data/huggingface/hub
export SSD_CUDA_ARCH=9.0  # H100
python scripts/download_from_hf.py llama
cd bench
python -O bench.py --model llama-3-8b --method ssd --numseqs 128

This installs deps, downloads Llama-3-8B to HF cache, then benchmarks SSD mode on 128 sequences per dataset. Expect 2-5 minute warmup for graph capture and torch.compile; subsequent runs hit peak tokens/sec matching dataset predictability.

Pros and Cons of SSD

Pros:

  • Achieves up to 3x autoregressive speeds on predictable data like code (Humaneval), via perfect speculation hits.
  • Parallel hardware utilization maximizes H100 cluster throughput, unlike sequential SD in vLLM.
  • Reference-quality baselines enable research: toggle --method ar/sd/ssd for apples-to-apples metrics.
  • Minimal deps beyond PyTorch/CUDA, with uv sync under 30s for reproducible envs.
  • Exact inference guarantees no quality loss, critical for eval benchmarks.
  • Prefix caching + PagedAttention handles 8k+ contexts without OOM on 80GB GPUs.

Cons:

  • Limited to Qwen3/Llama3; no GPT/Mistral sharding yet, requiring model conversion.
  • High setup barrier: CUDA 12.8+, Python 3.11, H100-optimal (sm_90), no CPU fallback.
  • No production serving: bench-focused, lacks HTTP endpoints or dynamic batching.
  • Warmup/compile times 3-5min for 70B models, unsuitable for cold-start APIs.
  • Dataset-specific gains: <1.5x on random prompts vs. 4x on code/math.

Getting Started with SSD

Start on a CUDA 12.8+ machine with H100/A100 GPUs. Install uv if missing, clone the repo, and sync the virtualenv.

git clone https://github.com/tanishqkumar/ssd && cd ssd
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"
uv sync
source .venv/bin/activate
python -c "from ssd import LLM; print('ok')"
export SSD_HF_CACHE=/path/to/huggingface/hub
export SSD_DATASET_DIR=/path/to/processed_datasets
export SSD_CUDA_ARCH=9.0  # Adjust for A100=8.0
python scripts/download_from_hf.py llama
HF_DATASETS_CACHE=/path/to python scripts/get_data_from_hf.py --num-samples 10000
cd bench
python -O bench.py --model llama-3-8b --method ssd --numseqs 64

Commands create a .venv, download Llama models to $SSD_HF_CACHE, process datasets to $SSD_DATASET_DIR, then run SSD benchmarks. Initial run compiles kernels and builds CUDA graphs, printing tokens/sec per method/dataset. Scale with --all for full suite across Humaneval/Alpaca/etc., monitoring GPU util via nvidia-smi.

Verdict

SSD is the strongest option for AI inference engineers benchmarking parallel speculative decoding on Llama/Qwen when targeting H100 clusters under research constraints. Its anticipatory branching delivers unmatched speed on predictable data, backed by clean PyTorch baselines. Tradeoff is narrow model support and no serving layer—pair with OpenSwarm for agentic workloads, but skip for general APIs.

Frequently Asked Questions

Looking for alternatives?

Compare SSD with other LLM Inference Engines tools.

See Alternatives →

Related Tools