Tokenization

Splitting text into the discrete units a model actually reads.

Before a model sees text, a tokenizer breaks it into tokens — usually subword pieces — and maps each to an integer. Methods like byte-pair encoding ensure any string can be represented while keeping common words compact.

Tokenization shapes context-window limits, pricing (which is per-token), and how a model handles rare words, numbers, and other languages.