Inference is the act of running a model on new input — the part you actually do in production, after training is done. It's where 95% of AI cost lives once a model is in users' hands. Optimizing inference (batching, quantization, smart routing) is usually a bigger lever than picking a different model.
Two metrics matter: latency (how long a single response takes) and throughput (how many you can handle per second). They trade off — bigger batches improve throughput but worsen latency. Streaming responses improve perceived latency without changing total time.
Bring this to your business
Knowing the term is one thing. Shipping it is another.
We do two-week AI Sprints — one term, one workflow, into production by Day 10.