list-checkHow Models Use Context

Before covering the mechanics of how to provide context, it is worth understanding what the model does with it once it arrives. This is not an abstract concern. The way a language model processes context has direct, practical implications for how you should structure what you provide.

The Context Window as Working Memory

As established in the companion tutorial on prompt engineering fundamentals, every language model operates within a finite context window: the total amount of text it can hold in active attention at once. Everything in the context window, your system prompt, the conversation history, any documents or data you provide, and your current query, competes for that capacity.

When you add context to a conversation, you are not depositing it into a filing system the model can search at will. You are placing it into working memory. The model processes it once, alongside everything else in the context window, and generates a response. It does not re-read the document each time you ask a new question. It draws on a contextualised representation built during the initial processing pass.

This distinction matters because it changes how you think about context preparation. The question is not simply "did I include the relevant document?" but "did I structure that document so that the relevant parts are accessible to the model's attention mechanism when it processes my query?"

Attention Distribution and the Lost in the Middle Effect

The self-attention mechanism at the core of transformer-based language models does not distribute attention uniformly across the context window. Content at the beginning and end of the context window consistently receives stronger attention than content in the middle. This has been empirically documented in research as the "lost in the middle" effect (Liu et al., 2023), and it has significant implications for how you position critical information within a long context.

The practical consequence is this: if you paste a long document into a conversation and your query depends on information that happens to fall in the middle of that document, the model may fail to retrieve or weight it appropriately, even though the information is technically present in the context. This is not a failure of model intelligence. It is a structural property of attention computation over long sequences.

The countermeasure is not to avoid long context but to be deliberate about where critical information is positioned. Placing the most important content at the beginning or end of a pasted document, or explicitly flagging its location in your task instruction, meaningfully improves retrieval reliability.

More Context Is Not Always Better

There is an intuitive assumption that more context produces better answers. The model has more to work with, so it should perform better. This assumption is wrong in several common situations.

First, irrelevant context dilutes the attention the model gives to relevant content. A document that is ninety percent background and ten percent the specific information your query depends on is harder for the model to use than a document containing only that ten percent. Pre-processing context to remove irrelevant material before injecting it is almost always worthwhile for complex or repeated tasks.

Second, contradictory context confuses the model. If you provide two documents that contain conflicting information on the same topic without explicit guidance on which to trust, the model will attempt to reconcile the conflict, often producing a hedged or incoherent response. Establishing precedence explicitly, stating which source should take priority when sources conflict, is essential in multi-source contexts.

Third, context that exceeds the context window will be truncated. The model will simply not see the content that falls outside the window limit. Depending on the interface and configuration, truncation may happen silently, without any warning that part of your document was dropped. Knowing the context window limits of the model you are using, and staying well within them, is a basic discipline of context management.

The Model's Relationship to Retrieved Context

When context is retrieved automatically through a RAG pipeline rather than pasted manually, the same principles apply, but with an additional layer of complexity. The model does not retrieve the content. The retrieval system does. The model then processes whatever was retrieved, within the constraints described above.

This means the quality of the model's response in a RAG-enabled workflow depends on two independent factors: the quality of the retrieval (did the right content surface?) and the quality of the model's processing of that content (did it use it correctly?). Failures can occur at either stage, and diagnosing which stage is responsible is a skill that Chapter 4 covers in detail.

Last updated

Was this helpful?