Part 2: The Building BlocksChapter 5 of 12

Speaking AI's Language - How Computers Read Text

15 min4,300 words298 reading now
Tokenization: How AI Reads Text

Computers only understand numbers. So how do they read "Hello, world"?

It's like translating English to Morse code, but more complex. Let me show you exactly how AI converts your words into numbers it can understand.

🔢The Translation Problem

Human sees:
"Hello"
Computer sees:
[15496, 18435, 37]
Human sees:
"I love pizza"
Computer sees:
[23, 5937, 48902]

Every word, every letter, every space gets converted to numbers. This process is called tokenization.

Tokenization: Breaking Words into Pieces

AI doesn't always see whole words. It breaks text into "tokens" - like syllables:

"Understanding" might become:
→ ["Under", "stand", "ing"]
→ [8721, 1843, 292]
"Unbelievable" might become:
→ ["Un", "believ", "able"]
→ [642, 11237, 1490]

Why Not Just Use Letters?

You might wonder: why not just use A=1, B=2, etc.?

The Problem with Letters

"Cat" = [3, 1, 20]
"Act" = [1, 3, 20]
Same numbers, different order!
Completely different meanings
Plus: "C" in "Cat" and "C" in "Ocean" make different sounds

The Token Solution

"Cat" = [8934]
"Act" = [2341]
"Ocean" = [54209]
Each meaningful unit gets its own unique number!

The Dictionary: AI's Vocabulary

Every AI model has a vocabulary - typically 30,000 to 50,000 tokens:

Token Dictionary Example

Token 1: "the"
Token 2: " and" (notice the space!)
Token 3: "ing"
Token 4: " is"
Token 5: "er"
...
Token 48,293: "pizza"
Token 48,294: "🍕" (yes, emojis too!)

Real Example: How ChatGPT Sees Your Message

You type:

"I'm thinking about adopting a cat! 🐱"

ChatGPT sees:

"I" → [40]
"'m" → [2846]
" thinking" → [7831]
" about" → [9274]
" adopt" → [3556]
"ing" → [292]
" a" → [102]
" cat" → [8923]
"!" → [341]
" 🐱" → [49281]
Final sequence:
[40, 2846, 7831, 9274, 3556, 292, 102, 8923, 341, 49281]

The Power of Context: Word Embeddings

Word Embeddings: Mapping Meaning in Space

Here's where it gets interesting. Each token isn't just a number - it's a location in "meaning space":

Imagine a Map of Words

Close together (similar meanings):

  • • "King" and "Queen" are neighbors
  • • "Happy" and "Joyful" are near each other
  • • "Cat" and "Kitten" are close

Far apart (different meanings):

  • • "King" and "Bicycle" are distant
  • • "Happy" and "Thermometer" are far
  • • "Cat" and "Mathematics" are separated
🧮

The Famous Word Math

This actually works in AI:

King - Man + Woman = Queen
Paris - France + Japan = Tokyo
Puppy - Dog + Cat = Kitten

How? Because words are stored as coordinates in space!

Languages: Same Concepts, Different Tokens

English: "Love"
→ Token 5937
Spanish: "Amor"
→ Token 8234
Japanese: "愛"
→ Token 42311
French: "Amour"
→ Token 7625

But in meaning space, they all occupy similar positions! This is why AI can translate between languages.

Special Tokens: AI's Stage Directions

AI uses special tokens like stage directions in a play:

[START] - Begin generating text
[END] - Stop generating
[PAD] - Fill empty space
[MASK] - Hidden word (for training)
[SEP] - Separate different sections
[USER] - Human is speaking
[ASSISTANT] - AI is responding
Example conversation in tokens:
[START][USER] What's 2+2? [SEP][ASSISTANT] 2+2 equals 4. [END]

The Tokenization Challenge: Different Languages

English is relatively easy. Other languages are harder:

LanguageWordToken Count
English"Hello"1 token
Chinese"你好"2-3 tokens
Arabic"مرحبا"3-4 tokens
Emoji"😀"1 token
Code"function()"2-3 tokens

This is why AI sometimes struggles with non-English languages - they need more tokens for the same meaning!

🎯

Hands-On: Play with Tokenization

Try This Tokenizer Experiment:

1. Go to: OpenAI Tokenizer (free online tool)

Search for "OpenAI Tokenizer" to find the tool

2. Type: "Hello world"

See: 2 tokens

3. Type: "Supercalifragilisticexpialidocious"

See: Gets broken into chunks

4. Type: "你好世界" (Hello world in Chinese)

See: More tokens than English

5. Type: Some emojis

See: Each emoji is usually 1 token

Interesting Discoveries:

  • • Spaces matter! "Hello" vs " Hello" are different tokens
  • • Capital letters often create new tokens
  • • Common words = fewer tokens
  • • Rare words = broken into pieces

Why This Matters for You

Understanding tokens helps you:

Write better prompts

Shorter tokens = faster responses

Understand costs

APIs charge per token

Debug weird behavior

Sometimes AI splits words oddly

Optimize performance

Fewer tokens = better performance

🎓 Key Takeaways

  • Computers only understand numbers - text must be converted to tokens
  • Tokenization breaks words into pieces - not always full words
  • Embeddings create meaning space - similar words are close together
  • Word math actually works - King - Man + Woman = Queen
  • Languages use different token counts - English is more efficient
  • Tokens matter for cost and performance - understand them to optimize

Ready to Understand Neural Networks?

In Chapter 6, discover how neural networks work through simple Lego block analogies. Learn the AI brain architecture!

Continue to Chapter 6
Free Tools & Calculators