rectangle-betaTesting and Iterating

A system prompt is not a static document. It is a hypothesis about how to get consistent, useful output from a model, and like any hypothesis, it needs to be tested against evidence and revised when it falls short.

Establishing a Test Set

Before writing your system prompt, define a set of representative queries that reflect the real tasks you will use the model for. This becomes your test set. A good test set includes:

  • Typical queries that represent the most common use case

  • Edge cases that sit at the boundary of your defined scope

  • Adversarial inputs that might expose gaps in your behavioural rules or constraints

  • Ambiguous inputs that test how the model handles underspecification

Run every version of your system prompt against the full test set before committing to it. This prevents the common mistake of optimising for a single query type at the expense of overall robustness.

Reading Failure Modes

When a model response fails to meet your expectations, the failure almost always has a diagnosable cause traceable to a gap in the system prompt. Common failure modes and their root causes include:

  • Inconsistent tone or formality: missing or underspecified behavioural rules for communication style

  • Wrong level of detail: no output length or depth specification

  • Off-topic responses: insufficient scope constraints

  • Excessive hedging or caveats: no epistemic stance specified

  • Format drift across a long conversation: format rules need to be stated more forcefully, or repeated in the prompt

Treat each failure as a diagnostic signal, not a model limitation. In the majority of cases, a targeted addition or clarification to the system prompt will resolve the issue.

Versioning Your Prompts

As you iterate, maintain versioned copies of your system prompt. This is particularly important in GLBNXT Workspace, where system prompts may be shared across a team or embedded in a workflow. Without versioning, it is easy to lose track of what changed between a version that worked well and a version that introduced a regression.

A simple versioning convention is sufficient: append a date and a one-line change note to the top of each saved version. This overhead is minimal and the diagnostic value is significant.

Knowing When to Stop

There is a point of diminishing returns in system prompt iteration. The goal is not a perfect prompt; it is a prompt that produces acceptable output across your full test set with a failure rate you can tolerate. Over-engineering a system prompt can introduce its own problems: excessive constraints that prevent the model from handling novel requests gracefully, or so many conflicting rules that the model starts to behave inconsistently.

A useful heuristic: if you have iterated more than five or six times on the same failure mode without resolving it, the problem may not be the system prompt. It may be the model itself, the context retrieval configuration, or the nature of the query. Recognising that boundary is part of the skill.

Last updated

Was this helpful?