# How a Model Reads Your Input

The first step toward prompt engineering is developing an accurate mental model of what happens between the moment you press send and the moment the first token of a response appears. The gap between how this process is popularly described and how it actually works is significant, and closing that gap changes how you think about prompt construction fundamentally.

#### Tokenisation

Before a language model processes your text, it converts it into a sequence of tokens. Tokens are not words. They are subword units derived from a vocabulary constructed during training using algorithms such as Byte Pair Encoding (BPE) or SentencePiece. Common words are typically represented as a single token. Uncommon words, technical terms, proper nouns, and non-English text are often split across multiple tokens.

This has several practical implications. First, the model does not read your prompt the way a human reads it. There are no sentences or paragraphs in the model's representation of your input, only a sequence of token IDs passed through an embedding layer. The semantic structure you perceive when you read your own prompt must be recovered by the model through learned statistical patterns, not through syntactic parsing.

Second, token boundaries can affect model behaviour in ways that are not always predictable. Splitting a technical term across tokens can subtly reduce the model's confidence in applying knowledge associated with that term. Writing certain types of structured output, such as code or formatted tables, is more reliable when the expected format matches patterns the model encountered frequently during training, because those patterns are reinforced by token-level statistics.

Third, token count is the operative unit for context window limits, not word count or character count. A useful approximation is that one token equals approximately four characters of English text, or roughly three quarters of a word. However, this ratio varies significantly for non-English languages, specialised vocabulary, and code, where token-to-character ratios can be substantially higher.

#### The Forward Pass and Probability Distribution

Once your input is tokenised and embedded, the model processes it through a deep neural network consisting of stacked transformer blocks. Each block applies a self-attention mechanism followed by a feed-forward network. The self-attention mechanism computes, for each token in the sequence, a weighted sum of representations from all other tokens. The weights, called attention scores, reflect how relevant each token is to the current one given the context of the full sequence.

This is the mechanism by which the model develops a contextualised representation of your input. The word "bank" in the phrase "river bank" receives different attention-weighted context than "bank" in "central bank," because the surrounding tokens shift the attention distribution toward semantically relevant parts of the input.

After processing through all transformer blocks, the model produces a probability distribution over its full vocabulary for the next token to generate. This distribution reflects the model's learned beliefs, given everything in the context window, about what token is most likely to follow. The model then samples from this distribution according to a temperature parameter, and the process repeats for each subsequent token until the response is complete.

The key insight here is that the model is not retrieving a pre-formed answer. It is constructing a response token by token, where each token is conditioned on everything that came before it, including your entire prompt and all tokens generated so far. The quality of your prompt determines the quality of the probability distribution the model starts from. A precise, well-contextualised prompt shifts the distribution toward high-quality outputs from the very first token.

#### Temperature, Top-P, and Sampling

The sampling parameters exposed in most AI interfaces, including GLBNXT Workspace, directly control how the model selects tokens from the probability distribution.

**Temperature** scales the distribution before sampling. A temperature of 1.0 samples from the raw distribution. Values below 1.0 sharpen the distribution, making high-probability tokens more likely and suppressing low-probability ones. This produces more deterministic, conservative output. Values above 1.0 flatten the distribution, increasing diversity and creativity at the cost of coherence and reliability.

**Top-P sampling** (also called nucleus sampling) restricts sampling to the smallest set of tokens whose cumulative probability exceeds a threshold P. Setting top-P to 0.9 means the model only samples from tokens that together account for 90% of the probability mass, ignoring low-probability tail tokens regardless of temperature.

For tasks requiring precision, consistency, and factual accuracy, lower temperature values and conservative top-P settings are appropriate. For creative tasks where variation and novelty are desirable, higher temperature is warranted. Understanding this relationship allows you to set sampling parameters deliberately rather than accepting defaults that may not suit your use case.

#### Why Phrasing Choices Have Measurable Effects

Given the statistical nature of token prediction, it follows that the specific words you use in a prompt are not interchangeable with synonyms, even when the semantic intent is identical. The model's training data contains patterns associating certain phrasings with certain response types. A prompt phrased as a question activates patterns associated with explanatory, informational responses. A prompt phrased as a directive activates patterns associated with task execution. A prompt that includes domain-specific vocabulary signals to the model that the response should operate at a domain-expert level.

This is not a quirk or a limitation. It is the mechanism by which prompts work. Understanding it means understanding that prompt engineering is not about finding magic words. It is about aligning the statistical priors embedded in the model's training with the output characteristics you actually want.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.glbnxt.com/tutorials/guides/prompt-engineering-fundamentals/how-a-model-reads-your-input.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.