How LLMs Work: Tokens, Probabilities, and Prompts

LLM at a Glance

An LLM is essentially a powerful “auto‑complete.” When you begin a sentence, the model continues it, filling in one tiny piece (a token) at a time until a full answer emerges. Tokens are often parts of words. Because the model has read a vast amount of text in advance, it develops a sense of “what usually comes next.” So when you ask a question, it keeps choosing the next token, then the next, until the answer is complete.

graph LR
    A[User input] --> B[Tokenization]
    B --> C{Next-token prediction}
    C -- 1: Probability computation --> D[Probability distribution]
    D -- 2: Token selection --> E[New token]
    E --> F{Answer complete?}
    F -- N --> C
    F -- Y --> G[Decoding]
    G --> H[Final answer output]

What Determines Generation Quality

Most of it comes down to three ideas:

How bold it is: Settings like temperature or top‑p can make answers safer (more common phrasing) or more creative (less common phrasing).
How much it can keep in mind: The model’s “short‑term memory” is limited. Long inputs may need summarization or pruning.
What it actually knows: The model can’t browse the web and only “remembers” training-time knowledge. Without reliable, up‑to‑date context, it can make things up.

mindmap
    root((Factors affecting generation quality))
        Boldness
            Temperature
            Top-p
        Memory
            Context window
            Summarization/Pruning
        Knowledge scope
            Training cutoff
            Lack of up-to-date info

Where RAG Connects

RAG reduces knowledge limits and hallucinations by searching for relevant documents first and providing them as context before the LLM answers. In other words, it makes the model “look things up” before responding.

sequenceDiagram
    participant User as User
    participant Retriever as Retriever
    participant LLM as LLM
    User->>Retriever: Ask a question
    Retriever->>Retriever: Retrieve relevant documents
    Retriever->>LLM: Pass retrieved results
    LLM->>User: Generate a context-grounded answer

An Easy Analogy for the Model

Inside, the model is a giant “pattern matcher.” It compares what you write with patterns it has seen before and selects the most likely continuation. You can imagine multiple “highlighters” that emphasize important parts of the sentence to keep the answer on topic.

graph LR
    A[User input] --> B[Pattern matching: compare with past data]
    B --> C{Select the most likely next token}
    C --> D[Highlighters: emphasize key parts]
    D --> E[Maintain topic & connect context]
    E --> F[Complete the final sentence]

    style A fill:#fdf5e6,stroke:#333,stroke-width:1px
    style B fill:#e6f7ff,stroke:#333,stroke-width:1px
    style C fill:#fff5e6,stroke:#333,stroke-width:1px
    style D fill:#f0ffe6,stroke:#333,stroke-width:1px
    style E fill:#f9e6ff,stroke:#333,stroke-width:1px
    style F fill:#e6ffe9,stroke:#333,stroke-width:1px

Why It Sometimes Goes Wrong

Two common reasons:

It fills gaps with guesses. If the needed info isn’t in “short‑term memory,” it may invent plausible details.
Outdated knowledge. Anything that appeared after training isn’t known unless you provide it.