Tokens, transformers, context windows, why LLMs hallucinate, and how to choose between Claude, GPT, and open-source models — explained in plain English.
You don't need to read research papers to use LLMs effectively. But you do need a working mental model of what they actually do — otherwise you'll fight the tool instead of using it.
This tutorial gives you that mental model in plain English. No math, no jargon you don't need.
At their core, large language models do one thing: given some text, predict the next chunk of text most likely to follow.
That chunk is called a token. A token is roughly 3–4 characters of English. The word "tokenization" is itself two tokens: "token" and "ization." The phrase "Hello world" is also two tokens.
Pseudocode for what an LLM does:
def llm_generate(prompt, max_tokens=500):
output = ""
for _ in range(max_tokens):
next_token = predict_next(prompt + output) # The "AI magic"
if next_token == END_TOKEN:
break
output += next_token
return output
That's it. The "intelligence" is entirely in the predict_next function — and that function was trained on a huge corpus of text (books, websites, code, etc.). The model has learned statistical patterns of what tokens tend to follow what other tokens.
Modern LLMs use an architecture called a transformer. The key trick is the attention mechanism — when predicting the next token, the model can "attend to" any previous token in the input, weighting each by how relevant it is. So if you write "The cat sat on the mat. The animal was happy.", the model can connect "animal" back to "cat" because it pays attention to that earlier token. That's it. Everything else is engineering details.
Two numbers matter for any LLM call:
You're billed for both, usually at different rates. Output tokens cost more because generating them takes more compute.
The context window is the max number of tokens (input + output) the model can handle in one call. Common sizes in 2026:
What this means in practice: a tutorial like this one is about 1,500 tokens. A typical web page is 2,000–5,000 tokens. So a 200K context window is enormous — you can stuff dozens of documents in.
Because they're next-token predictors, LLMs always produce plausible-sounding text. They don't have a built-in "I don't know" mechanism. If you ask a question they have no training data for, they'll generate something that looks like a correct answer based on similar-sounding training examples.
This is hallucination, and it's the #1 thing to design around in production:
Mitigations (covered in later tutorials):
LLMs are trained in three stages:
You almost never need to do any of this yourself. The right pattern in 2026 is: pick a hosted frontier model, prompt it well, augment with your data via RAG. Building or fine-tuning your own LLM is a research project, not a feature shipping next sprint.
Practical decision guide:
For a Django app starting out: Claude Sonnet 4.6 via the Anthropic API. Easy to integrate, reliable, plenty good for almost any use case.
LLMs are remarkable, statistical, fallible next-token predictors. They have no memory between calls, no real understanding (they're pattern matchers), no awareness of truth. Treat them as such and you'll build solid systems on top of them. Treat them as oracles and your users will catch the hallucinations before you do.
The next tutorial covers a more advanced topic: how reasoning models (Claude with extended thinking, OpenAI's o-series) try to improve on raw next-token prediction by "thinking out loud" before answering — and when that's worth the extra cost.