Inference

Running a trained model to produce outputs — as opposed to training, which updates its weights.

Inference is the serving phase: weights are frozen and the model generates predictions for new inputs. Its cost is dominated by memory bandwidth and the number of tokens processed, which is why throughput (tokens/second) and latency (time to first token) matter for deployment.