AI Capabilities
The platform integrates AI at multiple levels: document processing pipelines use OCR and embedding models to make documents searchable, the workflow engine orchestrates LLM calls across multiple providers, and MCP servers give AI agents direct access to your data. This page covers the AI architecture and how these capabilities work together.
Multi-Provider LLM Support
The platform supports multiple LLM providers, configurable per-node in workflows and per-tenant for embeddings. This avoids vendor lock-in and enables cost optimization by routing different tasks to the most appropriate model.
Supported Providers
| Provider | Capabilities | Typical Use Cases |
|---|---|---|
| OpenAI | Chat completion, structured output, embeddings | General-purpose reasoning, JSON output, embeddings |
| Anthropic | Chat completion, long-context analysis | Complex analysis, research synthesis, long documents |
| Mistral | Chat completion, OCR, embeddings | European data processing, document extraction, multilingual |
| Google Gemini | Chat completion, embeddings | Cost-effective batch processing, embeddings |
Provider Selection
Provider selection happens at two levels:
-
Per-node in workflows — each node in the visual workflow editor can be configured with a specific provider and model. A single workflow can use OpenAI for summarization, Anthropic for analysis, and Mistral for translation.
-
Per-tenant for embeddings — each data cluster tenant has a configured embedding provider. All documents in that tenant's datasets use the same embedding model for consistency in vector search.
Embedding provider and vector dimensions are configured when creating a data cluster. Changing the embedding provider after documents have been processed requires re-embedding all existing documents. Choose your embedding provider carefully at cluster creation time.
Embedding Generation
Embeddings are vector representations of text that enable semantic search — finding documents by meaning rather than exact keyword matches.
How Embedding Works in Pipelines
During document processing, the embedding step runs after chunking:
Each chunk produces a fixed-dimensional vector (the dimension depends on the model). These vectors are stored in Qdrant alongside the chunk text and metadata, enabling semantic similarity search.
Embedding Providers
| Provider | Characteristics |
|---|---|
| OpenAI | High-quality embeddings, multiple model sizes available |
| Mistral | Strong multilingual support |
| Google Gemini | Cost-effective, competitive quality |
The embedding provider is configured per-tenant, ensuring all vectors in a collection use the same model and dimensions. This is critical for search quality — mixing embedding models in the same collection produces unreliable similarity scores.
OCR via Mistral Document AI
The platform uses Mistral's Document AI service for optical character recognition (OCR). This is the first step in making PDF documents searchable.
What OCR Produces
| Output | Format | Description |
|---|---|---|
| Extracted text | Markdown | Full document text with heading structure preserved |
| Figures | Images (base64) | All images, diagrams, and figures extracted from the document |
| OCR metadata | JSON | Page-by-page extraction details |
The OCR output preserves document structure — headings, paragraphs, lists, and tables are represented in Markdown. This structure is important for the chunking step, which uses heading boundaries to create semantically coherent chunks.
Figure Extraction
Figures extracted during OCR are processed through a figure linking step that:
- Resolves figure references in the Markdown text (e.g., "Figure 1" links to the actual image)
- Converts all images to a consistent format (PNG)
- Stores figures alongside the processed document in MinIO
- Makes figures accessible via the Data API
The MCP Architecture
The platform implements the Model Context Protocol (MCP) — an open standard for giving AI agents structured access to external data and tools. MCP servers expose your data to AI assistants like Claude, GPT-4, and custom agents through a standardized tool interface.
How MCP Works
The critical property of this architecture is that the AI agent accesses data with the authorizing user's permissions. The agent cannot see data the user cannot see, and every access is logged in the platform's audit trail.
Available MCP Servers
| Server | Tools | Data Source | Purpose |
|---|---|---|---|
| Data Cluster | 7 tools | Your document collections | Browse datasets, keyword search, semantic search, read documents, view figures |
| OpenAIRE | 29 tools | OpenAIRE Graph (600M+ products) | Literature review, citation analysis, author profiling, research trends |
| BnF | 15 tools | Bibliotheque nationale de France | Historical documents, bibliographic records, digitized collections |
MCP Authentication
MCP servers support three authentication paths, all of which enforce the same per-tool RBAC:
- OAuth PKCE — the standard path for interactive AI assistants. The user authorizes the agent through a browser-based OAuth flow.
- JWT relay — for web applications or services that already hold a valid JWT from the identity provider.
- API token — for enterprise integrations using platform API tokens.
In all cases, the MCP server extracts the user's identity and permissions from the token and enforces per-tool access control. Each tool declares what abilities it requires (e.g., dataset:read, entry:read), and the framework checks permissions before executing the tool.
How AI Agents Access Data Securely
The MCP architecture enforces multiple security boundaries:
- User-scoped access — the agent inherits the authorizing user's permissions
- Per-tool RBAC — each tool declares required abilities, checked before execution
- Proxy layer — all data access goes through the platform's authenticated proxy (no direct cluster access)
- Audit logging — every tool call is logged with the user identity, MCP session ID, and request details
- Organization scoping — agents can only access data within the user's organization
Workflow Orchestration
The visual workflow editor enables building multi-step AI pipelines without writing code. Workflows are defined as directed acyclic graphs (DAGs) where nodes represent operations and edges define data flow.
Node Categories
| Category | Examples | Description |
|---|---|---|
| Data Access | Vector search, keyword search, download entry | Retrieve data from your clusters |
| LLM | Chat completion, structured output | Call language models with configurable providers |
| Document Processing | Text splitter, summarizer, translator | Transform and analyze text |
| Audio | Text-to-speech | Generate audio from text |
| Research | OpenAIRE search, citation analysis | Access research intelligence tools |
| Agents | Agent node, group node | Multi-agent orchestration and composable sub-workflows |
| System | Conditional, loop, merge | Control flow and data routing |
Building a Workflow
- Add nodes to the canvas — each node represents one operation
- Connect nodes with edges — defines the data flow between steps
- Configure parameters — set model, prompt, search query, etc. per node
- Reference upstream outputs — use template syntax to pass data between nodes
- Run the workflow — the platform executes the DAG, tracking cost and status per node
Multi-Agent Orchestration
For complex tasks, the platform supports hierarchical multi-agent execution:
- Agent nodes execute autonomous AI agents with access to tools and sub-agents
- Group nodes define composable sub-workflows that dissolve into the parent DAG
- Tool routing — agents can call vector search, keyword search, and other data access tools as part of their reasoning
- State management — agent conversations and intermediate state are tracked for debugging
This enables building sophisticated AI applications — for example, a research agent that searches your document collection, cross-references findings with the global research graph (via OpenAIRE), and produces a structured analysis.
AI Cost Tracking
The platform tracks AI costs at multiple levels:
| Level | What Is Tracked |
|---|---|
| Per-node | Token counts (input/output), API call costs, model used |
| Per-job | Aggregate cost across all nodes in the workflow |
| Per-organization | Historical cost data for billing and budgeting |
Cost data is available in the job execution summary, allowing you to understand which steps in your workflows are most expensive and optimize accordingly — for example, by switching to a smaller model for simple classification tasks or batching multiple documents into a single LLM call.
Next Steps
- Processing Engine — How document processing pipelines and AI workflows execute
- AI Agent Integration — Connect AI assistants to your data via MCP
- Search and Query — Semantic and keyword search
- Compliance — How AI access is audited and secured