Tokenization and Encoding, A Primer

Lately, all we have heard about is tokenization and embeddings and the role they play in the greater LLM and AI ecosystem. These two concepts are one of the most fundamental concepts in language modeling and remain the foundation of the technology we interact with on a daily basis. In this article, we will cover some of the basics around tokenizing and embedding sequences of texts and the nuances of them.

This article assumes that the user has a basic understanding of Python.

Tokenization

At its core, tokenization converts a string (sequence of words) into a numerical representation (sequence of integers). Now, a language model places a probability distribution over this sequence of integers. A common name for these integers are tokens, hence the term tokenizer.

Now, in order to turn strings into these tokens, we need to make a process that encodes into tokens. Additionally, we will need to decode them back into strings. A term you will see frequently is vocabulary size. This is to denote the number of possible tokens. For example, the BERT tokenizer uses ~30,000 tokens while Transformer XL uses around ~267,000 tokens. Tokenizers don’t have to be limited to just words, they can also include partial phrases, pairs of characters (or bytes), and even special symbols. It honestly depends on what your use cases are.

If you want to experiment around with tokenizers and see how they work, here is a cool site to visualize it.

Common Tokenizers

In this section, we will cover a very high level of how a tokenizer works and some different tokenizers work. In real life, you will rarely find yourself implementing a tokenizer unless if it is a bespoke case. This is merely an exercise to solidify some of our knowledge.

Word-based Tokenizer

We first start off with a basic example. Let us take the sentence “I love to eat pie, do you?” as our example. The simplest way to do this is to split it up by spaces:

sentence = "I love to eat pie, do you?"
tokens = sentence.split(' ')

Now, we obtain a result of:

['I', 'love', 'to', 'eat', 'pie,', 'do', 'you?']

Each one of these individual tokens is matched to an integer, which can give us an example like:

[31, 998, 211, 11, 4, 67, 33, 72]

Great! We have created our first tokenizer. However, there are a few problems:

Word based tokenizing creates a large vocabulary size. This can lead to increased memory usage and time complexity. There is not a fixed vocabulary size either.
Languages that don’t natively have spaces may not be represented well (Chinese, Turkish, etc.)
This approach will not be able to handle unknown, rare, or niche dialects effectively due to their reduced frequency in the distribution.

Byte-based Tokenizer

Let’s try a different approach. Instead of looking at each word, let us try mapping each of the characters inside of a word into an ASCII byte. Some may require lots, some may require little, but at the end of the day, we are able to map characters to bytes. For example,

"I love pie" -> b"\x49\x20\x6c\xcf\x76\x65\x20\x70\x69\x65

This is great since we have at most 256 values (in a byte), it allows us to keep a small vocabulary size. However, the compression ratio is not performant which does not make it a viable candidate for transformer architectures (due to the limited context windows).

Byte-pair Encoding

Byte-pair encoding takes a different approach compared to the two above. The basic idea is that there is no static tokenizer, but instead, we “train” a tokenizer to determine the vocabulary that is generated. The core of the algorithm is that there are common sets of characters (pair of bytes) which we can then represent as a single token. We can take a step further and have sequences represented by these tokens.

As an example, assume that we have a string of “aaabbcabd”. We can go through and craft these pairs:

"aa", "ab", "bc", "ab", "dd"

By doing so, we can create a mapping now of byte pairs to tokens:

aa -> X 
ab -> Y 
bc -> Z 
dd -> T

With this set map, the now encoded string becomes the following: “XYZYd” with the character “d” being left out. We can now recurse this by mapping XY -> A which will result in “AZYd”. The conversion table is stored along with the tokenized text which means by performing the same result backwards, we are able to effectively recover the original text.

Another benefit of this is the ability to capture the statistics of the entire body of text you are encoding, giving us a way to observe most frequent byte pairs.

Embeddings

Great, we now know how to tokenize and encode text. However, these sequences of tokens do not have any meaning to them. This is where embeddings come into play. Within natural language processing, “embeddings refer to the process of mapping non-vectorized data, such as tokens, into a vector space that has meaning¹.” Through embedding, models are able to learn the relationships between words and meanings without humans having to manually craft out these relationships.

Common Embedders

While embeddings can be used in any domain such as text, video, and imagery, we focus mainly on the text embeddings. Additionally, due to the large amount of research and content, we merely provide an overview of embeddings and link to other resources².

Word2Vec³

Word2Vec is a technique which is used in NLP for obtaining vector representatons. AS discussed before, the generated vecturs capture contextual information of the words. Word2Vec was one of the first embedding models developed by researchers at Google.

GloVe⁴

Another algorithm is GloVe (Global Vectors for Word Representation). GloVe is a unsupervised learning algorithm which gives us vector representations for words.

Wrap-Up⁵

We covered alot of content at a high level of the building blocks of LLMs: tokenization and embeddings. There is still much more than what is just in this article. However, as a summary to recap what we learned:

Input text is tokenized into different sequences of tokens.
Those tokens are encoded and mapped to a unique ID.
The encoded tokens are then embedded into an n-dimensional vector.
These vectors are used to train a model.

And that’s all! Stay tuned for more fun articles around concepts in language modeling.