Testing and Iterating
A system prompt is not a static document. It is a hypothesis about how to get consistent, useful output from a model, and like any hypothesis, it needs to be tested against evidence and revised when it falls short.
Establishing a Test Set
Before writing your system prompt, define a set of representative queries that reflect the real tasks you will use the model for. This becomes your test set. A good test set includes:
Typical queries that represent the most common use case
Edge cases that sit at the boundary of your defined scope
Adversarial inputs that might expose gaps in your behavioural rules or constraints
Ambiguous inputs that test how the model handles underspecification
Run every version of your system prompt against the full test set before committing to it. This prevents the common mistake of optimising for a single query type at the expense of overall robustness.
Reading Failure Modes
When a model response fails to meet your expectations, the failure almost always has a diagnosable cause traceable to a gap in the system prompt. Common failure modes and their root causes include:
Inconsistent tone or formality: missing or underspecified behavioural rules for communication style
Wrong level of detail: no output length or depth specification
Off-topic responses: insufficient scope constraints
Excessive hedging or caveats: no epistemic stance specified
Format drift across a long conversation: format rules need to be stated more forcefully, or repeated in the prompt
Treat each failure as a diagnostic signal, not a model limitation. In the majority of cases, a targeted addition or clarification to the system prompt will resolve the issue.
Versioning Your Prompts
As you iterate, maintain versioned copies of your system prompt. This is particularly important in GLBNXT Workspace, where system prompts may be shared across a team or embedded in a workflow. Without versioning, it is easy to lose track of what changed between a version that worked well and a version that introduced a regression.
A simple versioning convention is sufficient: append a date and a one-line change note to the top of each saved version. This overhead is minimal and the diagnostic value is significant.
Knowing When to Stop
There is a point of diminishing returns in system prompt iteration. The goal is not a perfect prompt; it is a prompt that produces acceptable output across your full test set with a failure rate you can tolerate. Over-engineering a system prompt can introduce its own problems: excessive constraints that prevent the model from handling novel requests gracefully, or so many conflicting rules that the model starts to behave inconsistently.
A useful heuristic: if you have iterated more than five or six times on the same failure mode without resolving it, the problem may not be the system prompt. It may be the model itself, the context retrieval configuration, or the nature of the query. Recognising that boundary is part of the skill.
Last updated
Was this helpful?