Inference
Running a trained model to generate output. The expensive part of AI in production.
Read definition →Latency
How long a model takes to respond. Measured as time-to-first-token and total time.
Read definition →Throughput
How many requests or tokens a system can serve per second.
Read definition →AI Cost (Per-Token Pricing)
You pay per million input and output tokens. Output is 3-5× more expensive than input.
Read definition →Prompt Caching
Reusing cached computation for repeated prompt prefixes — cuts cost 80-90%.
Read definition →Streaming
Receiving model output token-by-token as it generates, not waiting for the full response.
Read definition →Model Routing
Sending requests to different models based on complexity, cost, or content type.
Read definition →Rate Limiting (AI APIs)
The caps providers set on requests and tokens per minute — and how to work around them.
Read definition →LLMOps
The operational practice of running LLM-based systems in production — monitoring, versioning, and iteration.
Read definition →Observability (AI Systems)
The ability to understand what your AI system is doing in production — inputs, outputs, latency, cost.
Read definition →Tracing (AI / LLM)
Recording the full execution path of an AI request — every LLM call, tool call, and intermediate step.
Read definition →KV Cache
The memory a transformer uses during generation to avoid recomputing previous tokens.
Read definition →Speculative Decoding
A technique that uses a small draft model to speed up a large model's generation.
Read definition →