RAG & Knowledge

RAG (Retrieval-Augmented Generation) grounds agent responses in your actual Knowledge Base articles instead of relying solely on the LLM's training data. When a customer asks about your return policy, the agent finds and cites your specific policy — not a generic answer.

How It Works

Customer: "What's your return policy?"
  ↓
1. Embed the question into a vector
2. Search Qdrant for similar KB article chunks (filtered by tenant_id)
3. Top matches injected as context into the agent's prompt
4. Agent responds using your actual policy content
5. Response includes source references

Architecture

                    ┌─────────────────┐
  Index articles → │  Python Service   │ → Generate embeddings (OpenAI API)
                    │                   │ → Store in Qdrant (tenant_id filter)
                    └─────────────────┘

                    ┌─────────────────┐
  Chat with RAG → │  Python Service   │ → Embed query → Search Qdrant
                    │                   │ → Format as context → Agent prompt
                    └─────────────────┘

                    ┌─────────────────┐
                    │     Qdrant       │  Single collection: autocom_kb
                    │                   │  Tenant isolation via payload filter
                    │  tenant_id: acme │  Keyword index on tenant_id
                    │  tenant_id: xyz  │  Keyword index on knowledge_base_id
                    └─────────────────┘

Indexing Articles

Bulk Index (API)

Trigger re-indexing of all published KB articles:

POST /api/v1/ai/agent/knowledge/reindex
X-Tenant: your-tenant-id
Authorization: Bearer your-token
{
  "message": "Indexing queued",
  "count": 42,
  "batches": 1
}

This dispatches a queued job that:

  1. Loads all published articles from the tenant DB
  2. Sends them to the Python service in batches of 50
  3. Each article is chunked (~500 chars with overlap)
  4. Chunks are embedded using the tenant's OpenAI-compatible provider
  5. Vectors stored in Qdrant with tenant metadata

What Gets Indexed

Each article chunk is stored with this metadata:

{
  "tenant_id": "acme-corp",
  "article_id": "uuid",
  "knowledge_base_id": "uuid",
  "title": "Return Policy",
  "category": "Policies",
  "tags": ["returns", "refunds"],
  "chunk_index": 0,
  "chunk_text": "The actual article text for this chunk..."
}

Automatic Indexing

To automatically index articles when they're created or updated, dispatch the job from your KB controller:

use Modules\AI\App\Jobs\IndexKnowledgeArticleJob;

// After article create/update:
IndexKnowledgeArticleJob::dispatch(tenant()->id, [$article->id]);

Multi-Tenancy

Qdrant uses a single shared collection with tenant isolation via payload filtering:

  • Every vector has tenant_id in its payload
  • tenant_id has a keyword index for fast filtered search
  • Searches always include a tenant_id filter — one tenant never sees another's data
  • knowledge_base_id filter scopes searches within a specific KB

This is Qdrant's recommended multi-tenancy pattern — more efficient than separate collections per tenant.

RAG in Chat

RAG is enabled by default. When include_rag: true (the default), the chat endpoint:

  1. Embeds the user's message using the tenant's embedding provider
  2. Searches Qdrant for the top 5 most relevant chunks
  3. Deduplicates by article (multiple chunks from the same article)
  4. Formats results as context prepended to the agent prompt:
## Relevant Knowledge Base Articles
Use these articles to inform your response. Cite article titles when applicable.

### Return Policy (relevance: 87%)
Items can be returned within 30 days of delivery...

### Refund Processing (relevance: 72%)
Refunds are processed within 5-7 business days...
  1. The agent sees this context and uses it to answer accurately

Disabling RAG

{
  "message": "What time is it?",
  "include_rag": false
}

Embedding Requirements

RAG requires an OpenAI-compatible provider for generating embeddings. Anthropic (Claude) does not support embeddings.

If the tenant's default provider is Claude, the service automatically looks for any configured OpenAI-compatible provider (OpenAI, Groq, Together, etc.) to use for embeddings.

Default embedding model: text-embedding-3-small (1536 dimensions).

Managing Embeddings

Search Directly

POST /api/v1/embeddings/search
{
  "context": { ... },
  "query": "return policy",
  "limit": 5,
  "knowledge_base_id": "optional-kb-uuid",
  "score_threshold": 0.3
}

Delete Embeddings

POST /api/v1/embeddings/delete
{
  "context": { ... },
  "article_ids": ["uuid-1", "uuid-2"]
}

Or delete all embeddings for a tenant:

{
  "context": { ... },
  "delete_all": true
}

Check Stats

GET /api/v1/embeddings/stats?tenant_id=acme-corp

Returns the total vector count for the tenant.

Chunking Strategy

Articles are split into chunks of ~500 characters at natural boundaries:

  1. Split on paragraph breaks (\n\n) first
  2. If paragraphs are too large, split on sentences
  3. 50-character overlap between chunks to preserve context

The article title is prepended to the first chunk so it's included in the embedding.