RAG & Knowledge systems
Retrieval-augmented generation is the architectural pattern that enables AI models to work with your organisation's own data. Rather than relying solely on what a model learned during training, RAG systems retrieve relevant information from your data sources at query time and provide it as context to the model, grounding responses in accurate, current, and domain-specific content that the model was never trained on.
On GLBNXT Platform, every component needed to build a production RAG system is available as a managed service. Document ingestion, embedding generation, vector storage, retrieval configuration, and language model integration are all supported within your platform environment, with the infrastructure managed by GLBNXT and the solution design in the hands of your team.
This section explains how RAG works, how to implement it on GLBNXT Platform, and the architectural considerations that determine how well a RAG system performs in production.
What This Pattern Is For
RAG and knowledge systems are the right architectural choice when the quality and accuracy of your AI application's outputs depends on access to specific information that exists in your organisation's data rather than in a model's trained knowledge.
Common use cases in this category include:
Contract review and analysis using your organisation's existing agreements and legal reference material
Financial document search and analysis across reports, filings, and internal financial data
Internal knowledge base search giving employees access to policies, procedures, and operational documentation
Compliance and regulatory research grounded in current regulatory texts and internal compliance frameworks
Tender and procurement analysis working across large volumes of structured proposal and specification documents
Policy analysis for public sector and regulated industry organisations that need to work with complex, frequently updated policy documentation
If your use case requires the AI to answer questions, summarise content, or generate outputs based on documents or data your organisation owns, RAG is the foundational pattern.
How RAG Works
A RAG system operates in two distinct phases: ingestion and retrieval. Understanding both phases is essential for designing a system that performs well.
Ingestion Phase
The ingestion phase processes your source documents into a form that enables fast and accurate retrieval at query time. This is a preparation step that runs before the RAG system is available to users, and it needs to be re-run whenever source documents are added, updated, or removed.
The ingestion pipeline follows these steps:
Document loading: source documents are loaded from their storage location, which on GLBNXT Platform is typically MinIO object storage or a connected external data source
Chunking: documents are split into smaller segments of a defined size. Chunk size affects retrieval quality significantly. Chunks that are too large reduce retrieval precision. Chunks that are too small lose the surrounding context needed for coherent answers.
Embedding generation: each chunk is passed through an embedding model that converts the text into a numerical vector representing its semantic meaning. These vectors are what enable similarity-based retrieval.
Indexing: the embeddings, along with metadata about their source document and position, are stored in the vector database. The index built during this step is what the retrieval layer searches at query time.
Retrieval Phase
The retrieval phase runs at query time, every time a user submits a request to the RAG system. It follows these steps:
Query embedding: the user's query is passed through the same embedding model used during ingestion, generating a vector representation of the query
Similarity search: the query vector is compared against the indexed document embeddings in the vector database. The most semantically similar chunks are identified and retrieved.
Context assembly: the retrieved chunks are assembled into a context block that is provided to the language model alongside the original query
Response generation: the language model generates a response grounded in the retrieved context. Because the relevant information is provided directly in the prompt, the model can answer accurately from your data without having been trained on it.
Platform Components for RAG
Building a RAG system on GLBNXT Platform uses the following managed components.
MinIO object storage is the primary location for source documents before and after ingestion. Raw documents are stored in MinIO, and processed artefacts such as chunked content and metadata can be staged there between pipeline steps.
Weaviate or Qdrant provides the vector database layer where embeddings are stored and similarity search is performed. The choice between them depends on your use case requirements. Weaviate is well suited to applications that need hybrid search combining vector similarity with structured metadata filtering. Qdrant is well suited to high-performance applications where retrieval latency under load is a critical requirement. Your GLBNXT contact can advise on the appropriate choice for your environment.
Embedding models are available through the Model Hub in your environment. The embedding model used during ingestion must be the same model used to embed queries at retrieval time. Changing the embedding model requires re-ingesting all source documents to ensure consistency.
Language models are connected to the retrieval output to generate responses. The language model receives the retrieved context and the user query and produces the final output. Model selection affects response quality, reasoning capability, and latency. Larger models generally produce higher quality responses but at greater inference cost and latency.
Langflow provides a visual builder for constructing and testing RAG pipeline configurations without writing custom code for every connection. It is particularly useful for prototyping pipeline variations and testing different chunking strategies, retrieval configurations, and prompt designs before committing to a production architecture.
Elasticsearch supports hybrid retrieval configurations where semantic vector search is combined with full-text keyword matching. This is valuable for use cases where users may search using exact terms, document identifiers, or structured queries that benefit from keyword precision alongside semantic relevance.
Chunking Strategy
Chunking is one of the most consequential design decisions in a RAG system. It determines the granularity of the information units being indexed and retrieved, and it has a direct impact on both retrieval accuracy and response quality.
There is no universally correct chunk size. The right approach depends on your documents and your queries. As a general principle:
Short, specific queries benefit from smaller, more precise chunks that contain targeted information
Broad, analytical queries benefit from larger chunks that preserve more surrounding context
Documents with strong structural organisation, such as contracts or policy documents, often benefit from chunk boundaries that respect the document structure rather than fixed character counts
A common approach is to start with a moderate chunk size and evaluate retrieval quality against a representative set of test queries before tuning. The Building AI Solutions section includes guidance on evaluation patterns that can help you assess and improve RAG system quality systematically.
Retrieval Configuration
Beyond chunk size, several retrieval parameters affect the quality and performance of a RAG system.
Top-k retrieval determines how many chunks are retrieved for each query. Retrieving more chunks gives the model more context to work with but increases the size of the prompt and the cost of each inference call. A top-k value between three and ten is common for most use cases.
Similarity threshold sets a minimum relevance score below which retrieved chunks are excluded from the context, even if they are among the top-k results. Setting a similarity threshold prevents low-relevance content from polluting the model's context when no highly relevant chunks exist for a given query.
Hybrid search configuration determines the balance between vector similarity and keyword matching when both are used. For use cases involving precise terminology, document references, or structured identifiers, weighting keyword matching more heavily can improve retrieval accuracy for specific query types.
Metadata filtering allows retrieval to be scoped to a subset of the indexed content based on document metadata such as source type, date, department, or access classification. This is important for use cases where different users should be able to retrieve information only from documents relevant to their role or context.
Data Freshness and Re-ingestion
A RAG system is only as current as its most recent ingestion run. If source documents are updated and the index is not refreshed, the system will retrieve outdated content and generate responses based on stale information. Designing an ingestion strategy that keeps the index current is an important operational consideration for production RAG systems.
Common approaches include scheduled ingestion runs that process new or updated documents on a defined cadence, event-driven ingestion triggered by document update events in the source system, and incremental ingestion that processes only changed documents rather than re-ingesting the entire corpus on each run.
Workflow automation available in your platform environment can be used to implement any of these patterns, connecting document update events or schedules to ingestion pipeline executions without requiring custom infrastructure.
Compliance and Data Access
RAG systems that operate on sensitive organisational data require careful attention to data access governance. The retrieval layer does not inherently enforce document-level access controls. If a user submits a query, the system retrieves the most relevant chunks from the index regardless of whether the user would be permitted to access the source documents directly.
For deployments where different users should have access to different subsets of the knowledge base, access control must be implemented explicitly in the RAG architecture. Common approaches include maintaining separate indexes for different access tiers, applying metadata filters at retrieval time based on the authenticated user's role or permissions, and scoping vector database collections by access classification.
Your GLBNXT contact can advise on the appropriate access control pattern for your specific compliance requirements. For a broader overview of compliance considerations on the platform, see the Security and Compliance section.
Getting Started
The recommended path to building your first RAG system on GLBNXT Platform is to begin with a small, well-defined document corpus and a clear set of test queries. This allows you to evaluate retrieval quality and iterate on chunking and retrieval configuration before scaling to your full document set.
A practical first build follows this sequence:
Upload a representative sample of your source documents to MinIO
Configure an ingestion pipeline that chunks, embeds, and indexes the documents into your vector database
Connect the vector database to a language model endpoint in your Model Hub
Test the system against your representative queries and evaluate retrieval quality and response accuracy
Iterate on chunking strategy, retrieval configuration, and prompt design until quality meets your requirements
Scale ingestion to your full document corpus and configure ongoing data freshness processes
For guidance on connecting your RAG layer to a conversational assistant interface, see the AI Assistants and Chat Interfaces section.
Last updated
Was this helpful?