Resource & compute management
Compute resources are the foundation that AI workloads run on. Managing them well means ensuring that your applications have the resources they need to perform reliably, that resources are allocated efficiently across the workloads in your environment, and that your team has the visibility needed to understand consumption patterns and plan for growth. On GLBNXT Platform, GLBNXT manages the compute infrastructure, but your team plays an important role in understanding how resources are used and communicating when allocation adjustments are needed.
This section explains how compute resources are structured on GLBNXT Platform, how resource allocation works, how to monitor consumption, and how to approach capacity planning as your workloads grow.
Compute Resource Types
GLBNXT Platform provides two categories of compute resource that underpin all workloads in your environment.
GPU compute is the primary resource for AI inference, model serving, and training workloads. Language models, embedding models, and other AI components require GPU acceleration to serve requests at the latency and throughput levels that production applications demand. GPU resources are allocated to your environment based on the model serving requirements agreed during onboarding and are managed by GLBNXT to ensure availability for your inference workloads.
CPU compute handles the supporting workloads that do not require GPU acceleration, including workflow automation execution, API processing, database operations, ingestion pipelines, observability components, and the Kubernetes orchestration layer that manages all containerised workloads in your environment. CPU resources are provisioned alongside GPU resources as part of your environment's complete compute allocation.
Both resource types are provisioned exclusively for your environment. Your workloads do not compete for compute resources with other organisations on the platform. Isolation is enforced at the infrastructure level.
How Resource Allocation Works
Your compute allocation is configured during onboarding based on the workloads your team plans to run in the environment. The allocation reflects your anticipated model serving requirements, the number and complexity of concurrent workflows, the data volumes your pipelines will process, and the number of users and applications that will generate requests against your environment.
GLBNXT monitors resource utilisation continuously and manages the allocation to ensure that your workloads have the capacity they need. Autoscaling is applied within your allocation to handle fluctuations in workload demand, distributing available resources dynamically across competing workloads based on their priority and resource requirements.
When demand consistently approaches the limits of your current allocation, GLBNXT will advise your team and work with you to adjust the allocation to match your actual usage patterns. Resource allocation adjustments are managed through your GLBNXT contact and take effect following a brief provisioning process.
GPU Resource Management
GPU resources are the most constrained and most critical resource in an AI platform environment. Managing GPU allocation effectively has a direct impact on the inference performance of your applications and the cost efficiency of your environment.
Model Serving and GPU Allocation
Each model deployed in your Model Hub is allocated GPU resources sufficient to serve inference requests at the performance levels required for your use case. The serving runtime, model size, and expected request throughput all influence how much GPU resource is required for a given model deployment.
Larger models require more GPU memory and typically more compute per inference request than smaller models. Running multiple large models simultaneously in the same environment requires proportionally more GPU resource. During onboarding, GLBNXT works with your team to configure model deployments and GPU allocation in a way that balances performance requirements with resource efficiency.
Concurrent Workload Management
In environments where multiple models and applications are running concurrently, GPU resources are shared across active inference workloads within your allocation. The platform's model routing layer manages request distribution to ensure that workloads receive GPU resources in accordance with their priority configuration. High-priority production inference requests are served ahead of lower-priority background processing tasks.
If your environment runs both user-facing applications with strict latency requirements and background processing workloads with more flexible timing, discuss priority configuration with your GLBNXT contact during onboarding to ensure that resource allocation reflects the relative importance of each workload.
GPU Scaling for Variable Workloads
Some workloads have predictable usage patterns with clear peaks and troughs, while others have variable demand that is difficult to forecast. For environments with predictable patterns, compute allocation can be configured to align with anticipated peak demand. For environments with highly variable demand, GLBNXT can advise on allocation strategies that balance performance during peak periods with cost efficiency during quieter periods.
CPU Resource Management
CPU resources support the non-inference workloads that run alongside your AI applications. These include workflow automation, API request handling, data ingestion pipelines, observability components, and the platform services that manage orchestration and routing.
CPU resource consumption is generally more predictable and more gradual in its growth than GPU consumption. As you add more workflows, more users, and more API integrations, CPU demand grows proportionally. GLBNXT monitors CPU utilisation as part of the continuous infrastructure observability layer and manages scaling within your allocation to accommodate growing workloads.
For environments running high-volume workflow automation or data ingestion at scale, CPU resource requirements can become significant. If your use case involves processing large document volumes, running many concurrent workflow executions, or serving high API request volumes, discuss CPU allocation requirements with your GLBNXT contact during the solution design phase rather than after performance issues emerge in production.
Storage Resource Management
In addition to compute, your environment's storage capacity is a managed resource that requires attention as your data volumes grow.
Object storage capacity in MinIO, database storage in Postgres, and vector database storage in Weaviate or Qdrant are all provisioned based on your anticipated data volumes. As your knowledge bases grow, your conversation history accumulates, and your application data expands, storage consumption increases. GLBNXT monitors storage utilisation and alerts your team when consumption approaches the limits of your current provisioning.
Storage capacity adjustments are managed through your GLBNXT contact. For environments with rapidly growing data volumes, establishing a regular cadence for reviewing storage consumption and projecting future requirements helps avoid last-minute capacity requests.
Data that is no longer needed should be removed from platform storage rather than retained indefinitely. Accumulated redundant data increases storage costs, reduces retrieval quality in vector databases, and can complicate compliance with data retention obligations. Implementing data lifecycle policies that archive or delete data according to defined retention schedules keeps storage consumption under control and supports your data minimisation obligations.
Monitoring Resource Consumption
Resource consumption data is visible through the Monitoring and Observability area of the platform console. Key metrics available for your environment include the following.
GPU utilisation shows the percentage of allocated GPU resources consumed by active inference workloads over time. Consistently high GPU utilisation across your allocation indicates that workloads are resource-constrained and that a capacity review may be appropriate.
CPU utilisation shows consumption across your CPU allocation by workload type, including model serving infrastructure, workflow automation, API processing, and platform services. Spikes in CPU utilisation that correlate with workflow execution peaks or ingestion pipeline runs help identify where CPU demand is concentrated.
Memory consumption shows RAM utilisation across your compute allocation. High memory pressure, particularly during large model loading or high-concurrency inference periods, can affect performance and should be reviewed if it occurs consistently.
Storage consumption shows used and available capacity across each storage service in your environment, including object storage, relational database storage, and vector database storage.
Model inference metrics show request volumes, average latency, token consumption, and error rates for each model endpoint in your Model Hub. These metrics are the primary signal for assessing whether model serving resources are appropriately sized for your inference workload.
Your team should review resource consumption metrics regularly, not only when performance issues arise. Understanding your environment's normal consumption patterns makes it significantly easier to identify anomalies, plan for growth, and make informed decisions about when allocation adjustments are needed.
Capacity Planning
Capacity planning is the practice of anticipating future resource requirements based on current consumption trends and projected workload growth. Proactive capacity planning avoids situations where resource constraints cause performance degradation or service disruption in production environments.
A practical capacity planning approach for GLBNXT Platform environments involves the following.
Establish consumption baselines during the initial weeks of production operation, documenting normal resource utilisation patterns across GPU, CPU, memory, and storage. Baselines give you the reference point needed to identify growth trends and detect anomalies.
Project growth from known drivers such as planned increases in user numbers, new applications being deployed, higher document ingestion volumes, or additional models being added to your Model Hub. Map each growth driver to its resource implications and estimate the timeline over which increased demand will materialise.
Request allocation adjustments ahead of need rather than reactively when performance is already affected. Compute allocation changes require a provisioning process. Engaging your GLBNXT contact when consumption reaches a defined threshold, such as sustained utilisation above seventy percent of your current allocation, gives adequate lead time for the adjustment to be in place before resources become a constraint.
Review capacity after significant changes such as deploying a new application, adding a large model to the Model Hub, onboarding a major new user group, or running a high-volume batch process. Significant changes can shift resource consumption patterns materially and should prompt a review of whether the current allocation remains appropriate.
For guidance on the observability tooling that surfaces resource consumption data, see the Observability and Monitoring section. For guidance on model serving configuration and its relationship to GPU resource requirements, see the Model Serving and Routing section.
Last updated
Was this helpful?