FF's Notes
← Home

Token

Jun 21, 2025

A token is the smallest unit of text that a model treats as an atomic symbol during training or inference.

It could be an entire word ("language", "modeling"), or sub-word piece ("lan", "guage") chosen by an algorithm such as BPE.

A token has two representations:

  1. String form, the human-readable characters ("lan"),
  2. Integer ID, an index into the model's vocabulary.

During computation the integer IDs are looked up in an embedding matrix $E \in \mathbb{R}^{|V|\times d}$ to obtain the continuous vectors fed to the network. $d$ is the token size.