One of the cardinal rules of investing is:

If it’s too good to be true — it probably is.

When it comes to AI, I don’t think it falls in the “too good to be true” category.

Yet.

But it’s damn good.

And I use it all the time.

Yet I don’t want to be blind and naive about its limitations.

And while I want to understand what is happening under the hood — I have no appetite for the super technical details.

Andrej Karpathy was on the founding team of OpenAI and has the most incredible video primer on How LLMs work. However, it’s 3 hours and 30 minutes long and I don’t think most people will put in the work.

(Your loss.)

But it inspired me to use Karpathy’s framework to explain to my smart (and non-technical) professionals what’s really happening when you engage with ChatGPT.

I’m launching a small AI mastermind for finance professionals looking to level up their AI Mastery.
Click the link below to submit your application:

Apply to the AI Accelerator →

Meet the Most Knowledgeable Person in the World

This is Aiepon (pronounced “AY-uh-pon”, who shares her namesake with OpenAI.)

Aiepon will go from a tiny child to the Most Knowledgeable Person in the World.

It turns out that her learning journey will closely resemble how LLMs are built and trained for us to use.

We’ll cover:

Pre-training: The Sponge
Fine-tuning: The Apprentice
Inference: The Sage

Phase 1: Pre-training and “The Sponge”

From a young age, Aiepon has always been a sponge. A bookworm. An Internet Nerd. A voracious learner.

And a perceptive observer of the world surrounding her.

Your favorite LLM works the same way. During the Pre-Training Phase, it ingests billions of documents from across the Internet, including:

A web crawl of the entire Internet
Books and literature from (Project Gutenberg and the controversial Books3)
News archives
Academic publications (like PubMed and arXiv)
Code repositories (Github, Stack Overflow)
Social media and User-generated content (Twitter, Reddit, Quora, YouTube)
Curated datasets (WikiPedia, WikiHow)

Try it yourself:

Visit huggingface.co

 for open source data sets that are used to train the models.

Here's

IMDB Large Movie Review Dataset with 25,000 highly polar movie reviews for training, and 25,000 for testing.

This data set is massive and must then be cleaned up and distilled so that the model can then be trained.

Do you have a smart friend who's struggling to use AI?

Forward them this newsletter and they can subscribe to Future-Proof your Career with AI using the button below.

Subscribe Today →

From Words to Tokens: Breaking Language into Smaller Chunks

But Aiepon doesn’t just memorize entire textbooks or academic papers. She breaks them into digestible chapters, sections, flash cards and cheat sheets so that the information becomes easier to process, access and retain.

LLMs have a similar process called tokenization.

ChatGPT doesn't process full sentences or paragraphs at once. It breaks language down into tokens – smaller units that might be words, parts of words, or even punctuation.

For example:

“ChatGPT is amazing!”

might be broken into something like:

"Chat", "GPT", " is", " amazing", "!"

This tokenization is crucial for a few reasons:

It improves efficiency and storage
It facilitates pattern recognition and pattern extraction
These patterns allow LLMs to “generalize” their knowledge across new inputs

There are free websites that let you try out tokenization (Like the OpenAI Tokenizer):

According to OpenAI:

A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).

What’s interesting is that “KHE” (i.e. an uncommon word) is the only sequence that’s represented by two tokens.

You’ll often hear about tokens in the context of “context windows” — the amount of information that can go into a prompt. Here’s a quick table of tokens sizes:

(And for context, ChatGPT 4o has a context window of 128,000 tokens — so one novel and 5 research papers.)

Neural Networks: The Architecture of the Brain

It’s not enough for Aiepon to have all this information in her brain. She needs a way of connecting it all in a cohesive manner.

She hunts for patterns, spotting how words and sentences fit together. For example, she notices that the phrase “I love” is often followed by something nice like “dogs” or “ice cream.”

She quizzes herself, by covering portions of text and guessing what comes next. She realizes that a phrase like “The cat and the ____” is usually followed by the word “hat.” And that “Donald” is usually followed by “J. Trump” or “Duck.” With repetition, her guesses get better and better.

And finally, she puts together a mental map of all the knowledge — recognizing that “cat” and “dog” are both pets and that “sadness” and “anger” are both negative emotions.

At the heart of an LLM is a neural network – a mathematical structure loosely inspired by our brains.

Picture it as a vast web of interconnected nodes (or "neurons") organized in layers.

Each neuron is like a tiny decision-maker that takes some input, applies a weight (i.e. a probability), and passes its output to the next layer.

Modern AI systems have billions of these adjustable "dials" (called parameters). During training, the system tweaks these dials to get better at predicting text.

This approach works particularly well with sequential data (like text) since it focuses on the relationship between words, no matter how far apart they are in a sentence.

Here are a few examples of what’s happening during this step:

Language patterns such as subject-verb agreement (e.g. “I run” vs. “He runs”) or contextual patterns (e.g. “bank” as a riverbank versus a financial institution.)
Attention layers store relationships (e.g. “cat” relates to “pets” and “snuggly”)
Token relative positioning (e.g. “The cat, which is fluffy sleeps” connects cat to fluffy and sleep, even though they are separated by more words.)

The neural network phase of pre-training is massive. ChatGPT-4 reportedly cost over $100 million to train over a 3 month period. The energy used to train these models can power hundreds of homes for a year.

Be careful of "knowledge cutoff" dates



Given its resource-intensive nature, pre-training isn't run that often. The models have a cut-off date, after which they have no direct knowledge of events after the end of the training period.

This is why ChatGPT might confidently tell you about the 2020 Olympics (which were postponed) but draw a blank on who won this year's NCAA tournament.

This limitation explains why modern AI tools are increasingly paired with real-time search capabilities — to bridge the gap between their pre-trained knowledge and recent history.

The world’s best autocomplete

At this point, Aiepon has essentially become the world's best autocomplete.

She can predict with uncanny accuracy what words should come next in any sequence.

But while she's brilliant at completing sentences, she still lacks the judgment to determine what's actually helpful.

This brings us to the second phase: Learning not just what she could say, but what she should say.

Phase 2: Supervised Fine-Tuning and “The Apprentice”

Despite being extremely well read, Aiepon begins to experience the limitation of raw knowledge.

Yes, she “knows” all the source material — but she rambles, goes off topic and sometimes makes things up.

In Phase 2, Supervised Fine-Tuning (aka SFT) the LLM learns to provide useful outputs with the help of human feedback and training.

Supervised fine-tuning, explained

SFT is the bridge between a raw, pre-trained model and one that can actually follow instructions. And guess who plays a critical role in this phase?

Humans.

Ah yes, the irony.

SFT involves human experts creating thousands of “sample pairings” of prompts + ideal response to guide the model into producing more useful and coherent outputs.

Meet the human annotators

These annotators are often contractors with specific domain expertise and they write out example prompts, such as:

“Summarize this paragraph.”
“Explain quantum computing like I’m 5.”
“Write a poem in the style of Rumi.”

A typical LLM would inject between 10,000 and 100,000 pairings (once again, all created by humans) — a tiny fraction of the size of the pre-training data.

An example prompt + ideal response pairing:

Prompt:

 “Can you help me plan a weekend trip to a nearby city?”

Response:

 “I’d love to help! Could you share where you’re located or what kind of vibe you’re looking for—like a cozy small town, a bustling city, or somewhere with great hiking? That’ll help me suggest the perfect spot.”

Behind the scenes:


This pair teaches the model to respond in a friendly, engaging way while asking clarifying questions to tailor the answer. Instead of jumping straight to a generic suggestion, the model shows a friendlier tone. 

SFT datasets often include conversational prompts to make the model sound less robotic and more human-like.

These prompts are then paired with a specific (human-generated) response and the goal is to prioritize clarity, helpfulness and formatting in the responses.

Compared to the pre-training phase, SFT is surprisingly straightforward. The model continues to “autocomplete” (i.e. predict the next token) but now it has an additional “instruction following” lens.

This phase is significantly cheaper (both in time and GPU usage) and delivers substantial improvements with just a fraction of the compute power.

The key challenge isn’t the technical implementation — it’s getting the best data-set covering the widest range of human interactions.

A quick sidebar: Reinforcement Learning

Aiepon is about to throw us a little curve ball.

Stand up comedy. She’s going to try out a bunch of jokes in small clubs to see how the audience responds.

After each set, she’ll review the feedback and then only perform the jokes that performed better.

This mirrors the next sub-step of SFT, Reinforcement Learning From Human Feedback (aka RLHF).

Here’s how the process works:

The LLM is given a prompt and produces multiple responses (typically 2-8).
Human evaluators (yup, the real kind) compare the responses and rank them from worst to best.
A new “reward model” is created to predict how likely humans will like each response.
The LLM adjusts its internal settings to produce responses that get better scores.

Reinforcement Learning is an automated version of trial and error. It starts with the human evaluators, but over time the LLM takes over to learn what humans value most in a conversation.

What made DeepSeek so unique?

Remember how this Chinese AI startup came out of nowhere and made massive strides against its more established US competitors?

Instead of starting with lots of human written answers, it skipped SFT entirely. 

It used RLHF to learn by trial and error - making it particularly strong for coding and math.

This efficient approach kept costs low while setting the model up to excel in logic and problem solving.

Phase 3: Inference and “The Wise Sage”

In our last phase, Aiepon is now ready to set sail and share her knowledge in the real world.

In her journey from sponge to wise sage she no longer regurgitates memorized answers. Now, she creatively crafts new responses by assessing the context of a particular situation.

For LLMs, this final process is called Inference. It’s the moment of truth where the model interacts with users in real time— generating responses based on its pre-trained knowledge and fine-tuned skills.

The input stage and context window

As you type your prompt, the LLM works to assess what pieces of information it needs to act about. The context window means that model can only “see” a certain amount of text.

The model is constantly managing this window, deciding what to keep and what to discard.

The limitations of the context window



Earlier we saw that the ChatGPT 4o model had context window of 128,000 tokens. A single 10-K SEC filing is approximately 120,000 tokens.

Because of the limitations, users need to be selective about what gets included.

This is also why longer conversations can lose track of earlier details.

The processing stage and token prediction

Armed with the context window, the model then rapidly processes your prompt token-by-token to assess the next token (via a process called attention mechanism).

It then proceeds to generate the output iteratively, predicting one token at a time. The model doesn’t just pick one answer — it creates a probability distribution, which is why the same prompt often yields a different response.

The generation stage

Finally, words are generated one at a time with a feedback loop. As it generates a word, it improves the probability of the next word.

This happens in milliseconds, adding to the computational demands of the entire process. While still significantly less than the Pre-Training phase, a single complex query can use about the same electricity as charging your smartphone!

Tying it all together

Aiepon is all grown up and is now a wise sage… your wise sage.

But that doesn’t make her perfect. And it’s important to remember that while AI acts like magic, it’s just a bunch of math. It’s pattern-matching on an incomprehensible scale. But it’s not sentient, nor does it have a consciousness.

I hope this tutorial gives you a more robust understanding of the architecture of an LLM. Understanding this helps you make better decisions about AI’s use cases and limitations.

Since you’ve made it this far, it’s clear that you want to become more AI adaptive and resilient. I encourage you to ponder these 5 questions:

If LLMs are primarily employing pattern-matching, how should that change your expectations when using them for complex decision-making?
How does understanding LLMs tendency to hallucinate change your approach to verification? Where would you trust AI information without checking, and where would you insist on verification?
Knowing that LLMs are primarily predicting what humans would say, how would you use them differently for creative tasks versus factual research?
How might you design prompts differently now that you understand the tokenization process and how context windows work?
Since pre-training data has cutoff dates, how will you manage the "blind spots" in AI’s knowledge, where developments might have occurred after training?

See you next week!

Khe

PS Did a friend send this to you? Sign up for our free newsletter.

RadReads by Khe Hy

AI is not magic. It's just math.