Model serving & routing
Model serving and routing is the layer of GLBNXT Platform that makes AI models accessible to your applications. It handles everything between an inference request leaving your application and a response being returned: model loading, compute allocation, request routing, load balancing, and failover. Your development team interacts with a clean, stable model endpoint. The platform manages everything behind it.
This section explains how model serving works on GLBNXT Platform, how routing is configured, and what your team needs to understand to work effectively with models in your environment.
How Model Serving Works
Every model available in your environment is served through a managed inference layer. When a model is made available in your Model Hub, GLBNXT handles the deployment of that model onto the appropriate compute resources, the configuration of the serving runtime, and the exposure of a stable API endpoint that your applications can call.
Model serving on GLBNXT Platform is built on open-source inference runtimes. Depending on the model type and the performance requirements of your use case, models are served through Ollama for open-source language models or NVIDIA NIM for production inference workloads that require optimised throughput and latency. Both are managed entirely at the platform level.
Your application does not need to know which serving runtime is being used. It calls an endpoint, and the platform returns a response.
Model Endpoints
Each model available in your environment is accessible via a dedicated API endpoint. Endpoints follow a consistent format and are compatible with standard AI development frameworks and tooling, meaning that code written to call a GLBNXT-hosted model can be migrated or adapted with minimal changes if your requirements evolve.
Endpoints are listed in the Model Hub area of your platform console, along with the model name, version, and any relevant configuration details. Access to individual endpoints is governed by the role-based access controls configured for your environment.
Model Routing
Model routing manages how inference requests are directed across available compute resources and model instances. When your application sends a request to a model endpoint, the routing layer handles the following automatically:
Load balancing: requests are distributed across available model instances to ensure consistent response times under load
Failover: if a model instance becomes unavailable, requests are automatically redirected to a healthy instance without interruption to your application
Resource allocation: inference requests are matched to the appropriate compute resources based on model requirements and available capacity
Priority handling: in environments with multiple teams or applications sharing compute, routing can be configured to apply priority rules that ensure critical workloads receive resources first
Routing configuration is managed centrally by GLBNXT and can be updated as your environment grows or your workload requirements change. Application code does not need to be updated when routing configuration changes.
Ollama for Open-Source Models
Ollama is the primary serving runtime for open-source language models on GLBNXT Platform. It supports a wide range of models from the open-source ecosystem and makes them available through a consistent API interface. Models served through Ollama can be used for conversational applications, document analysis, code generation, summarisation, and any other language task your use case requires.
GLBNXT manages Ollama deployment, model loading, and version management. When a new model version is available or a model update is required, GLBNXT handles the update without requiring changes to your application configuration.
NVIDIA NIM for Production Inference
For workloads that require optimised inference performance, GLBNXT Platform includes support for NVIDIA NIM. NIM provides hardware-optimised model serving on NVIDIA GPU infrastructure, delivering lower latency and higher throughput than standard serving runtimes for demanding production workloads.
NIM is suited for use cases where response time is critical, such as real-time user-facing applications, high-volume API services, or workloads processing large numbers of concurrent requests. Your GLBNXT contact can advise on when NIM is the appropriate serving configuration for your specific requirements.
Adding and Updating Models
The models available in your environment are configured during onboarding based on your use case requirements. If your team needs access to additional models or wants to update a model to a newer version, this is managed through a request to your GLBNXT contact.
GLBNXT validates, deploys, and tests new models before making them available in your Model Hub, ensuring that every model endpoint your team works with is stable and correctly configured. Custom or fine-tuned models developed by your team can also be deployed into the serving layer. See the Model Hub section for further guidance on custom model deployment.
What Your Team Needs to Know
For most development work on GLBNXT Platform, model serving and routing operates transparently in the background. Your team works with model endpoints in the same way it would work with any API, without needing to understand the infrastructure behind them.
The key points to keep in mind are:
Model endpoints are stable and consistent. Routing and failover changes do not affect endpoint addresses or API behaviour.
Access to model endpoints is governed by role-based access controls. If your team cannot access a model it expects to see, check role assignments with your platform administrator.
Performance characteristics of a given model may vary based on current compute load. If your use case has strict latency requirements, discuss serving configuration options with your GLBNXT contact during onboarding.
All inference requests are logged by the platform. Model usage is visible in the Monitoring and Observability area of the console and forms part of your environment's audit trail.
Last updated
Was this helpful?