Processing Engine
The platform has two distinct execution engines, each designed for a different type of workload:
- Argo Workflows — runs document processing pipelines (OCR, chunking, embedding) on data clusters, close to the data
- Workers — execute AI workflows (LLM calls, vector search, multi-agent orchestration) on the platform, with access to multiple providers
These engines complement each other: Argo Workflows turn raw documents into searchable knowledge bases, and Workers use that knowledge to power AI-driven analysis and automation.
Two Execution Engines
| Aspect | Argo Workflows | Workers |
|---|---|---|
| Location | Data cluster (close to data) | Platform (close to AI providers) |
| Triggered by | File upload or manual trigger | User runs workflow from UI |
| Use case | Document ingestion and processing | AI analysis and automation |
| Parallelism | Kubernetes pods per pipeline step | Thread pool with async execution |
| Data access | Direct access to storage services | Via platform proxy to data clusters |
| Output | Stored in data cluster (MinIO, Qdrant, Meilisearch) | Returned as workflow result |
| Retry | Per-step retry with backoff | Per-node retry + queue-based redelivery |
Argo Workflows: Document Processing
Argo Workflows is a Kubernetes-native workflow engine. Each pipeline runs as a series of pods in the tenant's namespace, with each step producing artifacts that are passed to the next step.
How a Pipeline Runs
When a document is uploaded, the Data API checks if the dataset has an auto-trigger pipeline configured. If it does, the pipeline is submitted to Argo Workflows:
Each step runs as a container with defined resource limits, retry policies, and artifact I/O. Steps execute in dependency order — the chunker waits for OCR to complete, the embedder waits for the chunker, and so on.
Pipeline Components
Pipelines are composed of reusable components. Each component is a pre-deployed WorkflowTemplate on the data cluster, versioned and discoverable via the Data API.
Acquisition — get the document into the pipeline:
| Component | Input | Output | Description |
|---|---|---|---|
| Data API Entry Acquisition | Entry ID | File artifact | Downloads the original file from the Data API |
| S3 Acquisition | S3 URI | File artifact | Downloads directly from S3-compatible storage |
Processing — transform the document:
| Component | Input | Output | Description |
|---|---|---|---|
| Mistral OCR | PDF file | Markdown + figures + OCR JSON | Extracts text and images using Mistral Document AI |
| Figure Linker | Markdown + figures | Resolved markdown + PNG figures | Resolves figure references, converts images to PNG |
| Image Processor | Image files | Optimized images | Batch image optimization |
| Metadata Filter | Entry metadata | Filtered metadata | Selects fields relevant for vector storage |
Chunking and Embedding — prepare for search:
| Component | Input | Output | Description |
|---|---|---|---|
| Markdown Chunker | Markdown text | Chunk array (JSON) | Semantic splitting with heading awareness and token counting |
| Embedding Generator | Chunk array | Embedding vectors | Generates vector embeddings via configurable provider |
Registration — store the results:
| Component | Input | Output | Description |
|---|---|---|---|
| Chunks Registration | Chunks + embeddings | Qdrant points | Upserts vectors with metadata into tenant's Qdrant collection |
| Processed Content Registration | Markdown + figures | MinIO objects | Stores processed text and figures in object storage |
| Processed Files Registration | File artifacts | MinIO objects | Uploads additional processed files |
| Entry Status Registration | Status update | DB update | Sets entry status to "processed" (or "error" on failure) |
Pipeline Presets
The platform includes pre-built pipeline configurations for common document types:
| Preset | Steps | Use Case |
|---|---|---|
| PDF OCR | Acquire, OCR, Figure Link, Chunk, Embed, Register | PDFs and scanned documents |
| JATS XML | Acquire, Extract MECA, Parse JATS, Figure Link, Chunk, Embed, Register | Scientific articles in JATS XML format |
You can apply a preset to any dataset, and the pipeline will automatically trigger when documents are uploaded. Custom pipeline configurations can combine any available components in a DAG structure.
Composability
Pipeline components are designed to be composed. Adding support for a new document format requires writing only the format-specific parser — the chunking, embedding, and registration components are reused. This means:
- New parsers integrate into existing pipelines by adding a single step
- Multiple pipeline configurations can share the same components
- Components are versioned independently — updating a parser does not affect the chunker
Workers: AI Workflow Execution
Workers execute AI-powered workflows defined in the platform's visual workflow editor. Unlike Argo Workflows (which process documents), Workers orchestrate AI operations: LLM calls, vector search, multi-agent coordination, text-to-speech, and more.
How Workers Execute Jobs
Node Types
Workers support a rich set of node types organized by category:
Data Access — retrieve data from clusters:
| Node | Description |
|---|---|
| Vector Search | Semantic search across one or more clusters, with automatic result merging |
| Keyword Search | Full-text search via Meilisearch |
| Download Entry | Retrieve document files via the platform proxy |
| Get Entries | Fetch entry metadata in batch |
LLM — interact with language models:
| Node | Description |
|---|---|
| Chat Completion | Call any supported LLM provider with configurable model, temperature, and prompt |
| Structured Output | LLM call with JSON schema enforcement |
| Multi-turn Conversation | Stateful conversation with context management |
Document Processing — transform content:
| Node | Description |
|---|---|
| Text Splitter | Split text into chunks with configurable strategy |
| Summarizer | Generate summaries using LLM |
| Translator | Translate text between languages |
Audio — voice and speech:
| Node | Description |
|---|---|
| Text-to-Speech | Generate audio from text with voice selection |
Research — specialized research tools:
| Node | Description |
|---|---|
| OpenAIRE Search | Query the global research graph (600M+ products) |
| Citation Analysis | Build citation networks and bibliometric profiles |
Agents — multi-agent orchestration:
| Node | Description |
|---|---|
| Agent Node | Hierarchical multi-agent execution with tool access |
| Group Node | Composable sub-workflow that dissolves into the parent DAG |
System — workflow control:
| Node | Description |
|---|---|
| Conditional | Branch execution based on conditions |
| Loop | Iterate over collections |
| Merge | Combine outputs from parallel branches |
Multi-Provider AI Routing
Workers route LLM calls to the provider configured for each node. A single workflow can use different providers for different steps:
| Provider | Models | Use Cases |
|---|---|---|
| OpenAI | GPT-4o, GPT-4o-mini, o1, o3 | General-purpose reasoning, structured output |
| Anthropic | Claude Sonnet, Claude Opus | Long-context analysis, complex reasoning |
| Mistral | Mistral Large, Mistral Small | European data processing, multilingual tasks |
| Gemini Pro, Gemini Flash | Cost-effective batch processing |
Provider selection is per-node, not per-workflow. This enables cost optimization — use a smaller model for simple tasks and a more capable model for complex reasoning, within the same workflow.
The Visual Workflow Editor
The platform includes a visual editor for building AI workflows using a drag-and-drop canvas:
- Nodes represent operations (LLM calls, search, data access, etc.)
- Edges connect nodes to define data flow and dependencies
- Parameters are configured per-node (model selection, prompts, search queries)
- Template syntax allows nodes to reference outputs from upstream nodes
- DAG execution ensures nodes run in the correct dependency order
The editor produces a workflow definition (nodes + edges as JSON) that Workers execute as a directed acyclic graph. The same definition can be run multiple times with different inputs.
Execution Characteristics
| Property | Behavior |
|---|---|
| Execution model | DAG with topological sort — nodes run as soon as all dependencies are satisfied |
| Parallelism | Independent branches execute concurrently |
| Cost tracking | Per-node cost recorded (LLM tokens, API calls) with execution summary |
| Status streaming | Real-time job status updates via server-sent events (SSE) |
| Context persistence | Execution context saved for debugging and replay |
| Error handling | Per-node retry with configurable policy; failed nodes can be inspected |
How Processing Scales
Argo Workflows (Document Processing)
- Concurrent workflows — the Argo controller manages multiple workflows simultaneously within configured limits
- Dynamic parallelism — batch processing steps can fan out across multiple pods (e.g., processing 100 papers from an archive in parallel)
- Per-workflow scratch space — each workflow gets dedicated temporary storage for intermediate artifacts
- Resource limits — each step has defined CPU and memory limits to prevent resource contention
Workers (AI Workflows)
- Queue-based scaling — KEDA monitors the job queue and scales worker pods based on pending job count
- Concurrent execution — each worker pod processes multiple jobs simultaneously
- Async I/O — network-bound operations (LLM API calls, proxy requests) use async execution to maximize throughput
- Provider rate limits — workers respect per-provider rate limits and implement backoff strategies
Next Steps
- AI Capabilities — Multi-provider LLM support, MCP servers, and AI agent access
- Pipelines — Pipeline configuration and preset details
- Infrastructure — Kubernetes orchestration and storage systems
- Search and Query — Using semantic and keyword search