Model evaluation & Versioning
Model evaluation and versioning are the practices that ensure the AI models powering your applications continue to perform well over time. Evaluation gives your team a systematic way to measure model quality against defined criteria, identify regressions when models are updated, and compare models before deciding which to use in production. Versioning provides the control and traceability needed to manage model changes safely in environments where output quality and consistency matter.
On GLBNXT Platform, evaluation and versioning capabilities are available through the observability and model management layer of your environment, with tooling that captures the data needed to assess model behaviour and the processes needed to manage model changes without disrupting production workloads.
Why Evaluation Matters
Deploying a model that works well in testing is the beginning, not the end, of managing model quality. AI models do not behave identically across all inputs. Edge cases, distribution shifts in real user queries, and changes introduced by model updates can all affect output quality in ways that are not visible from aggregate metrics alone.
Systematic evaluation gives your team confidence that a model is performing as expected across the range of inputs it will encounter in production, and provides an early warning system when performance degrades. Without evaluation, model quality issues are typically discovered through user complaints or downstream effects rather than through proactive monitoring.
Evaluation Approaches
Offline Evaluation
Offline evaluation assesses model performance against a fixed dataset of test cases before a model is deployed or updated. It is the primary quality gate for new models and model versions entering your environment.
An offline evaluation dataset consists of representative inputs paired with expected outputs or quality criteria. The model being evaluated is run against each test case, and its outputs are scored against the defined criteria. Scores are compared against a baseline, typically the current production model, to determine whether the new model meets the quality bar required for deployment.
Effective evaluation datasets are built from real inputs drawn from production usage rather than synthetic examples constructed during development. Real inputs capture the distribution of queries and edge cases that matter in your specific use case. Synthetic datasets tend to overrepresent idealised inputs and miss the long tail of real user behaviour.
Online Evaluation
Online evaluation assesses model performance in production using real user interactions as they occur. It complements offline evaluation by capturing quality signals that only emerge under real usage conditions, including inputs that were not anticipated during dataset construction and quality dimensions that are difficult to assess without real user context.
Online evaluation on GLBNXT Platform is supported through LLM tracing tools available in your environment. These tools capture the full trace of each model interaction, including the input, the model output, and intermediate steps in any pipeline or agent workflow. Traces can be reviewed manually, scored using automated evaluation criteria, or sampled for human review on a defined schedule.
Automated Evaluation
For use cases where manual review of model outputs at scale is impractical, automated evaluation uses a secondary model or defined scoring functions to assess the quality of outputs produced by the primary model. Automated evaluation can measure dimensions such as factual accuracy against a reference, adherence to a defined output format, relevance of retrieved context in a RAG system, or compliance with behavioural guidelines defined in a system prompt.
Automated evaluation is not a complete substitute for human judgement, particularly for use cases where quality is subjective or context-dependent. It is most effective as a high-volume screening mechanism that identifies outputs requiring closer review, combined with periodic human evaluation of a sampled subset.
Evaluation Tooling on GLBNXT Platform
GLBNXT Platform supports model evaluation through Langfuse and Opik, both available as managed services within your environment depending on your configuration.
Langfuse provides LLM tracing, evaluation scoring, and dataset management for AI applications and pipelines. It captures traces of model interactions, allows evaluation scores to be attached to individual traces, and provides dashboards for tracking quality metrics over time. Langfuse is well suited for teams that want a comprehensive view of model performance across their applications with both automated and human evaluation workflows.
Opik provides evaluation and observability capabilities focused on pipeline quality and output assessment. It supports the construction of evaluation datasets, automated scoring against defined criteria, and comparison of model versions across evaluation runs. Opik is well suited for teams running structured evaluation experiments as part of a model selection or update process.
Both tools integrate with the model endpoints and application pipelines in your platform environment. Instrumentation is added at the application layer to capture the interaction data that evaluation workflows operate on.
Model Versioning
Model versioning provides the traceability and control needed to manage model changes safely. Every model in your Model Hub has a version that identifies the specific weights, configuration, and serving runtime in use. When a model is updated, the previous version is retained until the new version has been validated and promoted to production.
Version Management Process
When a new version of a model becomes available, whether through an update to an open-source model, a new release from a model provider, or a custom fine-tuned model developed by your team, GLBNXT follows a defined process before the new version reaches your production endpoint.
The process follows these steps:
Staging deployment: the new model version is deployed to a staging endpoint within your environment that is separate from the production endpoint your applications currently use
Evaluation: your team runs offline evaluation against the staged version using your evaluation dataset, comparing scores against the current production baseline
Review: evaluation results are reviewed against the quality bar defined for your use case. Any regressions in key quality dimensions are investigated before proceeding
Promotion: once the new version meets the required quality bar, it is promoted to the production endpoint. The transition is managed by GLBNXT without requiring changes to your application configuration.
Rollback retention: the previous version remains available for a defined period after promotion, allowing your team to request a rollback if unexpected issues emerge in production
Application Impact of Version Changes
Because GLBNXT Platform manages model endpoints as stable API addresses, a model version update does not change the endpoint URL your applications call. Your application continues to call the same endpoint before and after a version update. The change is transparent at the infrastructure level.
However, a model version update may change the outputs your application produces, even for identical inputs. This is the reason evaluation is required before promotion. For applications where output consistency is critical, such as those subject to regulatory requirements or integrated into downstream processes that depend on specific output formats, model version updates should be treated as controlled changes subject to your organisation's change management processes.
Custom and Fine-Tuned Model Versioning
If your team develops custom or fine-tuned models to deploy into the Model Hub, the same versioning principles apply. Each iteration of a custom model should be treated as a distinct version, evaluated against your quality criteria before deployment, and managed through the staging and promotion process rather than deployed directly to production endpoints.
Maintaining a clear version history for custom models provides the traceability needed to understand how model behaviour has changed over time, identify the version responsible for a specific output if an issue is raised, and roll back to a previous version if a new iteration performs worse than expected.
Building an Evaluation Practice
Model evaluation delivers the most value when it is a consistent practice embedded in the development and deployment process rather than an occasional activity. The following principles support an effective evaluation practice on GLBNXT Platform.
Start with a small, high-quality dataset. A focused set of representative test cases covering the core tasks and key edge cases of your use case is more valuable than a large dataset of low-quality or poorly representative examples. Grow the dataset over time by adding real production cases that surface quality issues.
Define quality criteria before deploying. Agreeing what good looks like before a model goes into production gives your team a clear standard to evaluate against and prevents subjective judgements from varying across evaluation runs.
Evaluate before every significant change. Model updates, prompt changes, retrieval configuration changes, and system prompt modifications can all affect output quality. Treating evaluation as a prerequisite for any significant change to the AI layer of an application catches regressions before they reach users.
Track quality metrics over time. Point-in-time evaluation scores are less informative than trend data. Tracking evaluation metrics across model versions and over time provides the longitudinal view needed to understand whether model quality is improving, stable, or degrading as your application and its usage patterns evolve.
For guidance on the observability tooling that supports evaluation workflows, see the Observability and Monitoring section. For guidance on how model serving and version management works at the infrastructure level, see the Model Serving and Routing section.
Last updated
Was this helpful?