Start a project

GlossaryInfrastructure

Infrastructure.

Inference, latency, throughput, cost — what it takes to run models in production.

13 terms

Inference
Running a trained model to generate output. The expensive part of AI in production.
Read definition
Latency
How long a model takes to respond. Measured as time-to-first-token and total time.
Read definition
Throughput
How many requests or tokens a system can serve per second.
Read definition
AI Cost (Per-Token Pricing)
You pay per million input and output tokens. Output is 3-5× more expensive than input.
Read definition
Prompt Caching
Reusing cached computation for repeated prompt prefixes — cuts cost 80-90%.
Read definition
Streaming
Receiving model output token-by-token as it generates, not waiting for the full response.
Read definition
Model Routing
Sending requests to different models based on complexity, cost, or content type.
Read definition
Rate Limiting (AI APIs)
The caps providers set on requests and tokens per minute — and how to work around them.
Read definition
LLMOps
The operational practice of running LLM-based systems in production — monitoring, versioning, and iteration.
Read definition
Observability (AI Systems)
The ability to understand what your AI system is doing in production — inputs, outputs, latency, cost.
Read definition
Tracing (AI / LLM)
Recording the full execution path of an AI request — every LLM call, tool call, and intermediate step.
Read definition
KV Cache
The memory a transformer uses during generation to avoid recomputing previous tokens.
Read definition
Speculative Decoding
A technique that uses a small draft model to speed up a large model's generation.
Read definition

Other categories

Models23 Agents & Tools13 RAG & Retrieval14 Evaluation7 Safety & Trust7 Enterprise AI7