RAG & Knowledge
RAG (Retrieval-Augmented Generation) grounds agent responses in your actual Knowledge Base articles instead of relying solely on the LLM's training data. When a customer asks about your return policy, the agent finds and cites your specific policy — not a generic answer.
How It Works
Customer: "What's your return policy?"
↓
1. Embed the question into a vector
2. Search Qdrant for similar KB article chunks (filtered by tenant_id)
3. Top matches injected as context into the agent's prompt
4. Agent responds using your actual policy content
5. Response includes source references
Architecture
┌─────────────────┐
Index articles → │ Python Service │ → Generate embeddings (OpenAI API)
│ │ → Store in Qdrant (tenant_id filter)
└─────────────────┘
┌─────────────────┐
Chat with RAG → │ Python Service │ → Embed query → Search Qdrant
│ │ → Format as context → Agent prompt
└─────────────────┘
┌─────────────────┐
│ Qdrant │ Single collection: autocom_kb
│ │ Tenant isolation via payload filter
│ tenant_id: acme │ Keyword index on tenant_id
│ tenant_id: xyz │ Keyword index on knowledge_base_id
└─────────────────┘
Indexing Articles
Bulk Index (API)
Trigger re-indexing of all published KB articles:
POST /api/v1/ai/agent/knowledge/reindex
X-Tenant: your-tenant-id
Authorization: Bearer your-token
{
"message": "Indexing queued",
"count": 42,
"batches": 1
}
This dispatches a queued job that:
- Loads all published articles from the tenant DB
- Sends them to the Python service in batches of 50
- Each article is chunked (~500 chars with overlap)
- Chunks are embedded using the tenant's OpenAI-compatible provider
- Vectors stored in Qdrant with tenant metadata
What Gets Indexed
Each article chunk is stored with this metadata:
{
"tenant_id": "acme-corp",
"article_id": "uuid",
"knowledge_base_id": "uuid",
"title": "Return Policy",
"category": "Policies",
"tags": ["returns", "refunds"],
"chunk_index": 0,
"chunk_text": "The actual article text for this chunk..."
}
Automatic Indexing
To automatically index articles when they're created or updated, dispatch the job from your KB controller:
use Modules\AI\App\Jobs\IndexKnowledgeArticleJob;
// After article create/update:
IndexKnowledgeArticleJob::dispatch(tenant()->id, [$article->id]);
Multi-Tenancy
Qdrant uses a single shared collection with tenant isolation via payload filtering:
- Every vector has
tenant_idin its payload tenant_idhas a keyword index for fast filtered search- Searches always include a
tenant_idfilter — one tenant never sees another's data knowledge_base_idfilter scopes searches within a specific KB
This is Qdrant's recommended multi-tenancy pattern — more efficient than separate collections per tenant.
RAG in Chat
RAG is enabled by default. When include_rag: true (the default), the chat endpoint:
- Embeds the user's message using the tenant's embedding provider
- Searches Qdrant for the top 5 most relevant chunks
- Deduplicates by article (multiple chunks from the same article)
- Formats results as context prepended to the agent prompt:
## Relevant Knowledge Base Articles
Use these articles to inform your response. Cite article titles when applicable.
### Return Policy (relevance: 87%)
Items can be returned within 30 days of delivery...
### Refund Processing (relevance: 72%)
Refunds are processed within 5-7 business days...
- The agent sees this context and uses it to answer accurately
Disabling RAG
{
"message": "What time is it?",
"include_rag": false
}
Embedding Requirements
RAG requires an OpenAI-compatible provider for generating embeddings. Anthropic (Claude) does not support embeddings.
If the tenant's default provider is Claude, the service automatically looks for any configured OpenAI-compatible provider (OpenAI, Groq, Together, etc.) to use for embeddings.
Default embedding model: text-embedding-3-small (1536 dimensions).
Managing Embeddings
Search Directly
POST /api/v1/embeddings/search
{
"context": { ... },
"query": "return policy",
"limit": 5,
"knowledge_base_id": "optional-kb-uuid",
"score_threshold": 0.3
}
Delete Embeddings
POST /api/v1/embeddings/delete
{
"context": { ... },
"article_ids": ["uuid-1", "uuid-2"]
}
Or delete all embeddings for a tenant:
{
"context": { ... },
"delete_all": true
}
Check Stats
GET /api/v1/embeddings/stats?tenant_id=acme-corp
Returns the total vector count for the tenant.
Chunking Strategy
Articles are split into chunks of ~500 characters at natural boundaries:
- Split on paragraph breaks (
\n\n) first - If paragraphs are too large, split on sentences
- 50-character overlap between chunks to preserve context
The article title is prepended to the first chunk so it's included in the embedding.