Just Think AIStart thinking

GlossaryTerm

Speculative Decoding

A technique that uses a small draft model to speed up a large model's generation.

Speculative decoding is an inference optimization where a small, fast "draft" model generates several candidate tokens at once, and the large target model verifies them in parallel. Because verification is faster than generation (the large model can check multiple tokens simultaneously via the attention mechanism), and because the draft model is right most of the time, the effective throughput of the large model increases significantly.

Typical speedups: 2-3× token generation rate for the same model quality. The draft model needs to be in the same "family" as the target model for high acceptance rates — e.g., a Llama 3 8B draft with a Llama 3 70B target.

From a buyer perspective: this is an inference-layer optimization you get for free when providers implement it. Some providers (Google with Gemini) mention it explicitly; others just pass the throughput benefits through. For self-hosted deployments with vLLM or TensorRT-LLM, speculative decoding is a configurable option worth enabling for latency-sensitive workloads.

Bring this to your business

Knowing the term is one thing. Shipping it is another.

We do two-week AI Sprints — one term, one workflow, into production by Day 10.