Skip to main content

Pipelines

Pipelines transform raw documents into searchable, AI-queryable knowledge. When a PDF, DOCX, or XML file is uploaded to a dataset, a pipeline extracts its text, figures, and metadata, splits the content into chunks, generates embedding vectors, and registers everything into the search indexes. The result is a document that can be found via keyword search, semantic similarity, and AI agent queries.

What Pipelines Do

A pipeline takes a raw file and produces:

  • Extracted text — Structured markdown content stored as content.json
  • Figures — Extracted images saved as individual PNG files
  • Chunks — Semantically meaningful text segments for search
  • Embedding vectors — Neural representations of each chunk for similarity search
  • Search index entries — Keyword-searchable documents in the full-text engine

After pipeline processing, an entry transitions from uploaded to processed, and its content is searchable through all three search paths: keyword, semantic, and global discovery.

Pipeline Stages

Pipelines are fully configurable Argo Workflows jobs that can be composed as needed. The default pipeline follows a five-stage model, but this is one example of how a pipeline can be structured — not the only way. Custom pipelines can combine components in any order, skip stages, or add new ones.

1. Acquisition

Download the source file from storage so the pipeline can process it.

2. Processing

Extract text, figures, and metadata from the raw file. This is where format-specific logic runs — OCR for PDFs, XML parsing for JATS documents, text extraction for DOCX files.

3. Chunking

Split the extracted text into smaller segments suitable for embedding. The chunker is heading-aware and token-bounded, producing semantically coherent chunks that respect document structure.

4. Embedding

Convert text chunks into numerical vectors using a language model. These vectors capture semantic meaning and enable similarity search — documents with similar content produce similar vectors, even if they use different words.

5. Registration

Store the results back into the data cluster: processed content to object storage, vectors to the vector database, metadata to the search engine, and status updates to the database.

Pipeline Components

The platform ships with 14 composable pipeline components across the five stages. Each component is a versioned Kubernetes WorkflowTemplate that can be combined to build custom pipelines.

Acquisition Components

ComponentDescription
Data API Entry AcquisitionDownloads the entry's original file from the data cluster's MinIO storage via the Data API
BioRxiv S3 AcquisitionFetches files from BioRxiv's public S3 buckets (requester-pays)

Processing Components

ComponentDescription
Mistral OCR ProcessorSends PDF to Mistral's OCR API, extracts text as markdown and images as base64
Figure LinkerResolves figure references in markdown, converts images to PNG format
MECA ExtractorUnpacks MECA submission archives to extract JATS XML and supplementary files
JATS XML ParserParses JATS XML (academic publishing standard) into structured markdown with metadata

Chunking Components

ComponentDescription
Markdown ChunkerSplits markdown into semantic chunks using heading-aware, token-bounded splitting (tiktoken tokenizer)

Embedding Components

ComponentDescription
Data Cluster EmbeddingGenerates embeddings via the data cluster's built-in embedding service (supports multiple providers)

Registration Components

ComponentDescription
Processed Content RegistrationStores extracted content and figures in MinIO, triggers full-text indexing
Processed Files RegistrationUploads additional processed files to MinIO
Chunks RegistrationUpserts embedding vectors and chunk metadata into Qdrant
Entry Status RegistrationUpdates the entry status to processed (or error on failure)

Pipeline Presets

For common document formats, the platform provides pre-configured pipeline presets that wire the right components together. Presets are available in the web application — select a preset when configuring a dataset's pipeline.

Available presets cover standard formats including PDF (with OCR), JATS XML (academic publishing), and MECA archives. Each preset handles the full pipeline from acquisition through embedding and registration.

Custom presets can be made available on request. Contact us to discuss your requirements.

How Pipelines Are Triggered

Pipelines can be triggered in two ways:

Automatic Trigger (On Upload)

When a dataset has a pipeline configured with trigger: "on_upload", the pipeline runs automatically after a file is uploaded. The Data API schedules the pipeline in the background immediately after the upload completes.

{
"enabled": true,
"trigger": "on_upload",
"timeout": "30m",
"steps": [ ... ]
}

This is the most common configuration — documents are processed as soon as they arrive.

Manual Trigger

Pipelines can also be triggered through the API for specific entries. This is useful for reprocessing entries after a pipeline configuration change, or for processing entries that were uploaded before a pipeline was configured.

Pipeline Execution

Pipelines run as Argo Workflows on the data cluster's Kubernetes infrastructure. Each pipeline run is a directed acyclic graph (DAG) of steps, where each step executes as a Kubernetes pod.

Execution Model

  • Each step runs in its own pod — with defined resource limits (CPU, memory) and a timeout
  • Artifacts pass between steps — output files from one step become input files for the next
  • Each workflow gets scratch space — a dedicated persistent volume for temporary files
  • Retries are automatic — failed steps retry with exponential backoff (up to 4 retries per step)
  • Failure handling — if any step fails after all retries, an exit handler sets the entry status to error

Resource Usage

Pipeline steps have defined resource requests and limits. For example, the OCR step typically uses 256Mi-512Mi of memory and 0.5-1 CPU core. The embedding step is I/O-bound (waiting for the embedding API) rather than compute-bound.

Multiple pipelines can run concurrently on the same data cluster, up to the configured concurrency limit. Dynamic parallelism allows batch processing of many entries simultaneously — for example, processing 100+ papers from an archive import in parallel.

info

Pipeline processing happens entirely on the data cluster. The OCR API call goes from the cluster to the external OCR provider — it does not pass through the platform. Embedding API calls follow the same pattern. The platform is not involved in pipeline execution.

Pipeline Configuration

Each dataset stores its pipeline configuration as a JSON structure. The configuration specifies which components to use, how they are wired together, and what parameters each step receives.

Configuring a Pipeline

Select a preset in the web application, or configure a custom pipeline for your use case. Custom pipelines can combine any available components in any order — the platform does not restrict you to the preset configurations.

note

A visual pipeline editor is coming soon. Currently, pipeline configuration is JSON-based. Contact us to configure a custom pipeline for your use case.

Discovering Available Components

The Data API exposes an endpoint that lists all available pipeline components on the cluster:

GET /api/v1/pipelines/components

This queries Kubernetes for WorkflowTemplates labeled as pipeline components and returns their names, versions, expected inputs, and outputs. Results are cached for 5 minutes.

Pipeline Components and Versioning

Pipeline components are deployed as Kubernetes WorkflowTemplates. They are versioned, following a {name}-{version} naming convention (e.g., mistral-o-c-r-processor-1.0.0).

Components are exported from the pipeline source code, baked into the data cluster's Docker image, and applied as a Helm pre-install hook when the Data API chart is deployed. This means that component updates are deployed atomically with the data cluster upgrade — there is no version mismatch between the Data API and its pipeline components.

note

When a pipeline component is updated (e.g., improved OCR post-processing), existing entries are not automatically reprocessed. To apply the new component to existing documents, trigger reprocessing through the API.

What Happens at Each Stage

OCR (Mistral OCR)

The OCR step sends the PDF to Mistral's OCR API. The API returns:

  • Structured text as markdown, preserving headings, lists, tables, and paragraphs
  • Extracted images as base64-encoded data, one per figure in the document
  • Page-level structure for documents with complex layouts

Figure Linking

The figure linker resolves references between the extracted markdown and the extracted images. It:

  • Matches ![caption](image-N.ext) references to the correct extracted image
  • Converts all images to standardized PNG format
  • Generates a consistent naming scheme (fig_001.png, fig_002.png, ...)
  • Updates the markdown with resolved figure paths

Chunking

The markdown chunker splits the document into semantically meaningful chunks:

  • Heading-aware — chunks respect heading boundaries, so a section is not split mid-paragraph
  • Token-bounded — each chunk stays within a token limit suitable for the embedding model
  • Metadata-enriched — each chunk carries its position index and figure references

Embedding

The embedding step converts text chunks into numerical vectors:

  • Batch processing — chunks are sent in batches of 32 for efficiency
  • Concurrent requests — up to 5 concurrent batch requests to the embedding API
  • Cold start tolerance — timeout set to 300 seconds to handle serverless cold starts

Registration

The registration steps store results in three systems:

  1. MinIOcontent.json and figure PNGs are written to the entry's processed storage path
  2. Qdrant — embedding vectors are upserted with chunk text and metadata as payload
  3. Meilisearch — the entry is indexed for full-text keyword search with schema-aware field mapping

After registration completes, the entry status is updated to processed, and the document becomes searchable.

Next Steps