Back
Prerequisites
Chapter 1: Prerequisites

Scale and Specialization

AI inference at scale is fundamentally different from running a model on your laptop. When you move from development to production, you encounter challenges around latency, throughput, cost, and reliability that don't exist in a notebook environment.

Key Concepts:
- Scale refers to handling thousands or millions of requests per second. A single GPU can process maybe 10-50 LLM requests/second — serving millions of users requires careful orchestration.
- Specialization means choosing the right model size and architecture for your specific task. A 70B parameter model is overkill for sentiment classification — a fine-tuned 1B model may outperform it at 1/70th the cost.
- The inference cost curve is non-linear: doubling model size often more than doubles inference cost due to memory bandwidth bottlenecks.

Rule of thumb: Always start with the smallest model that meets your quality bar, then scale up only if needed.