How LLMs Work: Tokens, Probabilities, and Prompts
A simple explanation of tokenization, probability distributions, pretraining, and inference—how LLMs generate sentences.
LLM at a Glance
An LLM is essentially a powerful “auto‑complete.” When you begin a sentence, the model continues it, filling in one tiny piece (a token) at a time until a full answer emerges. Tokens are often parts of words. Because the model has read a vast amount of text in advance, it develops a sense of “what usually comes next.” So when you ask a question, it keeps choosing the next token, then the next, until the answer is complete.
graph LR
A[User input] --> B[Tokenization]
B --> C{Next-token prediction}
C -- 1: Probability computation --> D[Probability distribution]
D -- 2: Token selection --> E[New token]
E --> F{Answer complete?}
F -- N --> C
F -- Y --> G[Decoding]
G --> H[Final answer output]
What Determines Generation Quality
Most of it comes down to three ideas:
- How bold it is: Settings like temperature or top‑p can make answers safer (more common phrasing) or more creative (less common phrasing).
- How much it can keep in mind: The model’s “short‑term memory” is limited. Long inputs may need summarization or pruning.
- What it actually knows: The model can’t browse the web and only “remembers” training-time knowledge. Without reliable, up‑to‑date context, it can make things up.
mindmap
root((Factors affecting generation quality))
Boldness
Temperature
Top-p
Memory
Context window
Summarization/Pruning
Knowledge scope
Training cutoff
Lack of up-to-date info
Where RAG Connects
RAG reduces knowledge limits and hallucinations by searching for relevant documents first and providing them as context before the LLM answers. In other words, it makes the model “look things up” before responding.
sequenceDiagram
participant User as User
participant Retriever as Retriever
participant LLM as LLM
User->>Retriever: Ask a question
Retriever->>Retriever: Retrieve relevant documents
Retriever->>LLM: Pass retrieved results
LLM->>User: Generate a context-grounded answer
An Easy Analogy for the Model
Inside, the model is a giant “pattern matcher.” It compares what you write with patterns it has seen before and selects the most likely continuation. You can imagine multiple “highlighters” that emphasize important parts of the sentence to keep the answer on topic.
graph LR
A[User input] --> B[Pattern matching: compare with past data]
B --> C{Select the most likely next token}
C --> D[Highlighters: emphasize key parts]
D --> E[Maintain topic & connect context]
E --> F[Complete the final sentence]
style A fill:#fdf5e6,stroke:#333,stroke-width:1px
style B fill:#e6f7ff,stroke:#333,stroke-width:1px
style C fill:#fff5e6,stroke:#333,stroke-width:1px
style D fill:#f0ffe6,stroke:#333,stroke-width:1px
style E fill:#f9e6ff,stroke:#333,stroke-width:1px
style F fill:#e6ffe9,stroke:#333,stroke-width:1px
Why It Sometimes Goes Wrong
Two common reasons:
- It fills gaps with guesses. If the needed info isn’t in “short‑term memory,” it may invent plausible details.
- Outdated knowledge. Anything that appeared after training isn’t known unless you provide it.
Simple Tips for Better Answers
Keep these in mind:
- Be explicit about format, length, and tone.
- Provide small examples or snippets to follow.
- Attach policies/facts to reduce guessing.
- Encourage honesty: “If the info isn’t provided, say you don’t know.”