Building a Tokenizer for an LLM

Welcome to Cloud Dude's Blog page! This blog post is based on the second video in our series on building Large Language Models (LLMs) from scratch, following the book "Build a Large Language Model from Scratch" by Sebastian Raschka, available at Manning.com.

Introduction

In this series, we're diving deep into the world of LLMs, like ChatGPT, to understand how they work and how to build one from scratch. This post covers Chapter 2 of the book, which focuses on building a dataset for training our LLM.

Chapter 2: Building a Dataset

Understanding the Dataset

The core idea behind an LLM is to take user input, process it, and generate meaningful output. For instance, if you ask the model to write a book about different breeds of cats, it needs to understand the request, fetch relevant information, and then generate the content. This process involves training the model on a dataset that acts as its knowledge base.

Below is an image on how this might look from a high level.

Training the Dataset

Training a dataset involves encoding a large amount of text into tokens that the model can understand and use. Essentially, you have to give it a vocabulary to then be able to go out into the world and understand and process the information. I mentioned in the YouTube video that I believe you should give it the English Dictionary because then it would have all the words and letters and would be able to process the information.

In the book, Sebastian Raschka uses "The Verdict" as an example dataset. The goal is to encode this book into tokens, which the model can then use to process other tasks.

Practical Implementation

Using Jupyter Notebooks

The book provides code snippets in Jupyter Notebooks, which are great for running pieces of code interactively. However, as a developer, you might prefer to build a complete program. This involves creating classes and importing them into a main file to run the entire process as a cohesive unit.

Your folder structure would look like so:

┣ 📜main.py
┣ 📜simple_tokenizer.py

You would have the class in simple_tokenizer.py and that would look like this:

import re
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        #ids = [self.str_to_int[s] for s in preprocessed]
        ids = []
        for s in preprocessed:
            if s in self.str_to_int:
                ids.append(self.str_to_int[s])
            else:
                print(f"Warning: '{s}' not found in vocabulary. Using '<UNK>' token.")
                ids.append(self.str_to_int[self.unknown_token])
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids]) 

        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

Then, as we are working in Python, you would import that class into your m̀ain.py file. Like so

import simple_tokenizer

To now use that class and process the vocabulary so that it can be put into tokens, you would write something similar to the following:

import urllib.request
import re
import simple_tokenizer

## Download the Text File
url = ("https://raw.githubusercontent.com/rasbt/"
       "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
       "the-verdict.txt")
file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)

## Read the Text File
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

# Define The Vocab
# For the Vocab to work on a book you need to pass it a lot of vocabulary. Essentially to get this working I had to pass it the book
# to process the book. Is this not counter interuitive?
# Once it had all the vocab then it can run and process the book into tokens. 
#vocab = {
#    'The': 1, 'verdict': 2, 'is': 3, 'in': 4, '.': 5,
#    'It': 6, 'was': 7, 'a': 8, 'great': 9, 'success': 10, 's':11
#}

## Now Process the Book and convert it to Tokens.
preprocessed = re.split(r'([,.?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(len(preprocessed))

## Sort words Alphabetically This I believe removes white space and vocab punctuation as well.
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size)
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

## Now print the first 51 TokenIDs. 
vocab = {token:integer for integer,token in enumerate(all_words)}
vocab = {token:integer for integer,token in enumerate(all_tokens)}
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

## Initialise the tokenizer
tokenizetext = simple_tokenizer.SimpleTokenizerV1(vocab)

## Encode the Text
converttoids = tokenizetext.encode(raw_text)
print("Encoded Text:", converttoids)

## Decode the Text
decodeids = tokenizetext.decode(converttoids)
print("Decoded Text", decodeids)

This would be simple enough to build a very basic vocabulary to process the book.

Handling Errors

One common issue when working with datasets is handling unknown tokens. For example, if the model encounters a word that isn't in its vocabulary, it should handle this gracefully rather than throwing an error. In Python, this can be managed by checking for the presence of each token in the vocabulary and using a placeholder for unknown tokens.

You could do this like so:

class SimpleTokenizerBetterErrorV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i: s for s, i in vocab.items()}
        self.unknown_token = '<UNK>'
        self.str_to_int[self.unknown_token] = len(self.str_to_int) + 1
        self.int_to_str[len(self.int_to_str) + 1] = self.unknown_token

    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = []
        for s in preprocessed:
            if s in self.str_to_int:
                ids.append(self.str_to_int[s])
            else:
                print(f"Warning: '{s}' not found in vocabulary. Using '<UNK>' token.")
                ids.append(self.str_to_int[self.unknown_token])
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

Instead of Python just spitting out to me lines of code in my error that really do not make much sense:

Traceback (most recent call last):
  File "/home/jason/Dev_Ops/LLMS/MyLLMVersion/Ch2/main.py", line 52, in <module>
    converttoids = tokenizetext.encode(raw_text)
  File "/home/jason/Dev_Ops/LLMS/MyLLMVersion/Ch2/simple_tokenizer.py", line 12, in encode
    ids = [self.str_to_int[s] for s in preprocessed]
  File "/home/jason/Dev_Ops/LLMS/MyLLMVersion/Ch2/simple_tokenizer.py", line 12, in <listcomp>
    ids = [self.str_to_int[s] for s in preprocessed]
KeyError: 'I'

With the above code, I am able to make the error make better sense.

Warning: 'I' not found in vocabulary. Using '<UNK>' token.
Traceback (most recent call last):
  File "/home/jason/Dev_Ops/LLMS/MyLLMVersion/Ch2/main.py", line 52, in <module>
    converttoids = tokenizetext.encode(raw_text)
  File "/home/jason/Dev_Ops/LLMS/MyLLMVersion/Ch2/simple_tokenizer.py", line 19, in encode
    ids.append(self.str_to_int[self.unknown_token])
AttributeError: 'SimpleTokenizerV1' object has no attribute 'unknown_token'

As I touched on in the video, this really is a letdown of the Python language, and in Go, you can't do this. Go error handles well, and it also gets you, as a developer, to think about what type of error is going to come back to the user. Making you essentially a better developer.

Conclusion

I would highly recommend buying this book; you can get it from here: https://www.manning.com/books/build-a-large-language-model-from-scratch

To begin building an LLM, you need to first build a tokenizer that can take in human language and convert it into numbers or, essentially, a database. Then, you can ask it to process data and return human-readable results.

I hope that was insightful, and I look forward to writing the next post about this project.

Happy coding,
Cloud Dude

Building My Own LLM: A Journey into Language Models Building a Tokenizer 🛠️

Introduction

Chapter 2: Building a Dataset

Understanding the Dataset

Training the Dataset

Practical Implementation

Using Jupyter Notebooks

Handling Errors

Conclusion

Comments

More from this blog

Making Pulumi Feel Like Terraform: A Python Developer's Guide

Playing Around With Dictionaries in Python

The Power of Input Validation with AWS and Python

Two ALB Target Groups One ECS with Pulumi & Python

Writing a Bash CLI Program.

Command Palette

Introduction

Chapter 2: Building a Dataset

Understanding the Dataset

Training the Dataset

Practical Implementation

Using Jupyter Notebooks

Handling Errors

Conclusion

Comments

More from this blog