Tokenization Explorer: Understanding How LLMs Read Text#


What is Tokenization?#

Tokenization is the process of breaking down text into smaller units called tokens. Think of it as teaching a computer how to “read” text.

Why Does This Matter?#

  1. LLMs don’t read characters - They read tokens (which can be parts of words, whole words, or even punctuation)

  2. Cost implications - API costs are based on token count, not character count

  3. Context limits - Models have maximum token limits (e.g., GPT-4 has 8K, 32K, or 128K token limits)

  4. Language efficiency - Different languages require different numbers of tokens for the same meaning

Key Concepts:#

  • A token ≈ 0.75 words in English (roughly 4 characters)

  • Different tokenizers split text differently

  • Common words are often single tokens: "hello"[hello]

  • Uncommon words get split: "tokenization"[token, ization]

  • Non-English text often requires more tokens


Setup: Installing Required Libraries#

Local setup

This notebook is excluded from the website build and is not executed during normal site generation. To run it locally from the course repository, install the optional LLM dependency group first:

poetry install --with llm

If you are using a separate Jupyter or Colab environment, install the notebook packages there instead:

%pip install tiktoken transformers
# Import dependencies
import tiktoken
from transformers import AutoTokenizer

Part 1: Basic Tokenization#

Let’s start by seeing how a simple sentence is tokenized.

# Load OpenAI's GPT-4 tokenizer
tokenizer = tiktoken.get_encoding("cl100k_base")  # Used by GPT-4 and GPT-3.5-turbo

# Enter a sentence you wish to tokenize
text = "Artificial intelligence is transforming technology."

# Tokenize
tokens = tokenizer.encode(text)
token_strings = [tokenizer.decode([token]) for token in tokens]

print(f"Original text: {text}")
print(f"Tokens: {tokens}")
print(f"Token strings: {token_strings}")
Original text: Artificial intelligence is transforming technology.
Tokens: [9470, 16895, 11478, 374, 46890, 5557, 13]
Token strings: ['Art', 'ificial', ' intelligence', ' is', ' transforming', ' technology', '.']

Observation#

Notice how:

  • Each word is typically one token

  • Punctuation marks are separate tokens

  • Spaces are included with some tokens


Part 2: Comparing Different Tokenizers#

Different models use different tokenizers. Let’s see how the same text is tokenized differently!

# Sample text
sample_text = "Artificial intelligence is transforming technology."

# Initialize different tokenizers
gpt2_tok   = AutoTokenizer.from_pretrained("gpt2")
bert_tok   = AutoTokenizer.from_pretrained("bert-base-uncased")
t5_tok     = AutoTokenizer.from_pretrained("t5-small")
deep_tok  = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Llama-8B")

# Tokenize with each
gpt2_tokens = gpt2_tok.encode(sample_text, add_special_tokens=False)
bert_tokens = bert_tok.encode(sample_text, add_special_tokens=False)
t5_tokens = t5_tok.encode(sample_text, add_special_tokens=False)
deep_tokens = deep_tok.encode(sample_text, add_special_tokens=False)

print(sample_text)
print("")
print(gpt2_tokens)
print(bert_tokens)
print(t5_tokens)
print(deep_tokens)
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Artificial intelligence is transforming technology.

[8001, 9542, 4430, 318, 25449, 3037, 13]
[7976, 4454, 2003, 17903, 2974, 1012]
[24714, 6123, 19, 3, 21139, 748, 5]
[9470, 16895, 93375, 3843, 598, 55857, 59342, 13]

We can reverse the process and decode the tokens from their numerical values back to natural language, provided we use the right tokeniser:

print("Correct tokenizer:")
print(gpt2_tok.decode(gpt2_tokens))
print("")
print("Incorrect tokenizer:")
print(bert_tok.decode(gpt2_tokens))
Correct tokenizer:
Artificial intelligence is transforming technology.

Incorrect tokenizer:
revised luxuryah [unused313] truncated interest [unused12]

Exercise 1: Decode the individual tokens from the text:#

Task: Instead of decoding all the tokens, to reproduce the individual text, decode each token one at a time so we can see the individual tokens.

Questions to answer:

  1. Do all tokenisers work the same way?

  2. How is punctuation handled differently?

Hint:

for token in token_list:
    token_string = tokeniser.decode(token)
    print(f"{token}\t -> \t'{token_string}'")

try using different tokenisers

# using gpt2
for token in gpt2_tokens:
    token_string = gpt2_tok.decode([token])
    print(f"{token}\t -> \t'{token_string}'")
8001	 -> 	'Art'
9542	 -> 	'ificial'
4430	 -> 	' intelligence'
318	 -> 	' is'
25449	 -> 	' transforming'
3037	 -> 	' technology'
13	 -> 	'.'
# using deepseek
for token in deep_tokens:
    token_string = deep_tok.decode(token)
    print(f"{token}\t -> \t'{token_string}'")
9470	 -> 	'Art'
16895	 -> 	'ificial'
93375	 -> 	'intelligence'
3843	 -> 	'istr'
598	 -> 	'ans'
55857	 -> 	'forming'
59342	 -> 	'technology'
13	 -> 	'.'

Key Takeaway#

  • Different models use different tokenization strategies (This is why token counts can vary when using different models!)

  • Tokenisation breaks language into more usable chunks.

  • Tokens can include punctuation, cases and symbols.


Part 3: Hands-On Exercises#

Now it’s your turn! Try these exercises to deepen your understanding.

Exercise 1: Analyze Your Own Text#

Task: Take a paragraph from your research, documentation, or a recent email and analyze its tokenization.

Questions to answer:

  1. How many tokens does it use?

  2. What’s the character-to-token ratio?

  3. Which words get split into multiple tokens?

  4. How much would it cost to process this text 1000 times with GPT-4 given gpt4_input_price_per_1k_tokens_usd = 0.03.

# Exercise 1: Analyze your own text (worked example)

encoding = tiktoken.get_encoding("cl100k_base")

my_text = (
    "In our LLM workshop, we compare tokenization strategies to understand cost, "
    "latency, and context-window limits. A single sentence can split into very "
    "different token sequences depending on punctuation, capitalization, and "
    "compound terms such as fine-tuning and retrieval-augmented generation."
)

# 1. How many tokens does it use? 
token_ids = encoding.encode(my_text)
token_count = len(token_ids)
print(f"1) Token count: {token_count}")

# 2. What is the character to token ratio for your chosen tokenizer?
char_count = len(my_text)
char_to_token_ratio = char_count / token_count if token_count else 0
print(f"2) Character-to-token ratio: {char_to_token_ratio:.2f} ({char_count} chars / {token_count} tokens)")


# 3. Which words get split into multiple tokens?
words = [w.strip(".,!?;:\"'()[]{}") for w in my_text.split()]
split_words = []
for w in words:
    if not w:
        continue
    w_tokens = encoding.encode(w)
    if len(w_tokens) > 1:
        split_words.append((w, len(w_tokens), w_tokens))



# 4. How much would it cost to process this text 1000 times with GPT-4 given `gpt4_input_price_per_1k_tokens_usd = 0.03`. 
gpt4_input_price_per_1k_tokens_usd = 0.03
cost_per_run_usd = (token_count / 1000) * gpt4_input_price_per_1k_tokens_usd
cost_1000_runs_usd = cost_per_run_usd * 1000

print("3) Words split into multiple tokens:")
if split_words:
    for word, n_tokens, ids in split_words:
        print(f"   - {word}: {n_tokens} tokens -> {ids}")
else:
    print("   - None")

print(
    f"4) Cost for 1000 runs with GPT-4 input pricing (${gpt4_input_price_per_1k_tokens_usd}/1K tokens): "
    f"${cost_1000_runs_usd:.4f}"
)
1) Token count: 54
2) Character-to-token ratio: 5.41 (292 chars / 54 tokens)
3) Words split into multiple tokens:
   - LLM: 2 tokens -> [4178, 44]
   - workshop: 2 tokens -> [1816, 8845]
   - tokenization: 2 tokens -> [5963, 2065]
   - strategies: 2 tokens -> [496, 70488]
   - understand: 2 tokens -> [8154, 2752]
   - latency: 2 tokens -> [5641, 2301]
   - context-window: 2 tokens -> [2196, 42866]
   - punctuation: 2 tokens -> [79, 73399]
   - capitalization: 2 tokens -> [66163, 2065]
   - fine-tuning: 3 tokens -> [63157, 2442, 38302]
   - retrieval-augmented: 6 tokens -> [265, 9104, 838, 7561, 773, 28078]
4) Cost for 1000 runs with GPT-4 input pricing ($0.03/1K tokens): $1.6200

Exercise 3: Optimize a Prompt#

Task: You have this verbose prompt. Can you reduce the token count while keeping the same meaning?

Original prompt:

"Please provide me with a comprehensive and detailed explanation regarding 
the various different ways in which machine learning algorithms can be 
utilized and applied in the field of healthcare and medical diagnostics."

Goal: Reduce tokens by at least 30% without losing meaning.

# Exercise 2: Your code here
encoding = tiktoken.get_encoding("cl100k_base")

original_prompt = """Please provide me with a comprehensive and detailed explanation 
regarding the various different ways in which machine learning algorithms can be 
utilized and applied in the field of healthcare and medical diagnostics."""

optimized_prompt = """[Your optimized version here]"""

# Compare token counts
original_tokens = len(encoding.encode(original_prompt))
optimized_tokens = len(encoding.encode(optimized_prompt))

print(f"Original: {original_tokens} tokens")
print(f"Optimized: {optimized_tokens} tokens")
print(f"Reduction: {((original_tokens - optimized_tokens) / original_tokens * 100):.1f}%")
Original: 37 tokens
Optimized: 6 tokens
Reduction: 83.8%

Key Takeaways#

By now, you should understand:

What tokenization is and why LLMs need it
How different tokenizers work and produce different results
Token counts directly impact costs - optimize your prompts!
Context windows are measured in tokens - manage them carefully

Practical Applications:#

  1. Cost Optimization: Shorter, clearer prompts = lower costs

  2. Multilingual Apps: Budget more for non-English languages

  3. Context Management: Track tokens to avoid hitting limits