Configure a Pipeline

A pipeline defines how uploaded documents are processed — from raw file to searchable, AI-queryable content. Pipeline configuration happens during dataset creation and can be updated afterward.

This guide covers what each pipeline preset does, how to switch between presets, and how to request custom pipeline configurations.

What a Pipeline Controls

A pipeline determines:

Which processing steps run — OCR, XML parsing, chunking, embedding, etc.
In what order — Steps execute as a directed graph (DAG), with dependencies between stages.
What embedding model is used — Controlled by the data cluster's embedding provider setting.
When processing starts — Automatically on upload, or manually triggered.

The output of a pipeline is a fully indexed document: searchable by keyword (Meilisearch), by meaning (Qdrant vector search), and readable by AI agents via MCP.

Using Pipeline Presets

Pipeline presets are ready-made configurations that cover the most common document formats. You select a preset during the pipeline step of the dataset creation wizard.

Available Presets

Preset	Best For	Processing Steps
General Purpose	PDF, DOCX, images	OCR (Mistral), figure linking, chunking, embedding, registration
Scientific Articles	BioRxiv JATS XML, MECA archives	XML parsing, figure extraction, chunking, embedding, registration

tip

General Purpose is the right choice for most users. It handles PDFs of any complexity — scanned documents, multi-column layouts, documents with figures and tables — through OCR-based extraction.

What Each Preset Does

General Purpose (PDF + OCR)

This is the most commonly used preset. When a file is uploaded:

OCR — Sends the document to an OCR service that extracts text as structured markdown and identifies images.
Figure Linking — Resolves cross-references between text and figures, converts extracted images to PNG.
Chunking — Splits the markdown into semantic chunks using heading-aware, token-bounded splitting.
Embedding — Generates vector embeddings for each chunk using the cluster's configured embedding provider.
Content Registration — Stores processed text and figures, indexes content for keyword search.
Chunk Registration — Upserts embedding vectors into the vector database for semantic search.

Scientific Articles (JATS XML)

info

The Scientific Articles preset is available on demand — it is not enabled by default for all users. Contact us to have it activated for your organization.

Designed for academic publishing workflows:

JATS Parsing — Extracts structured content (title, abstract, sections, references) from JATS XML.
Figure Extraction — Extracts figures from the XML or MECA archive.
Chunking — Splits content into semantic chunks.
Embedding — Generates vector embeddings.
Content Registration — Stores and indexes the processed content.
Chunk Registration — Upserts vectors for semantic search.

Auto-Trigger vs Manual Trigger

Each dataset's pipeline can be set to trigger automatically or manually.

Auto-Trigger (Default)

When enabled, the pipeline runs automatically every time a file is uploaded to the dataset. This is the recommended setting for most use cases — upload a file and it becomes searchable without additional steps.

Auto-trigger is enabled by setting the Trigger option to on_upload during dataset creation.

Manual Trigger

When auto-trigger is disabled, uploaded files remain in the Uploaded status until processing is explicitly triggered. This is useful when you want to upload a batch of files first and process them later, or when you need to review files before processing.

To trigger a pipeline manually via the API, call the trigger endpoint for each entry:

Python

from data_api_client import ApiClient, Configuration, EntriesApi

config = Configuration(
    host="https://api.alien.club/clusters/YOUR_CLUSTER_ID/proxy"
)
client = ApiClient(
    config,
    header_name="Authorization",
    header_value="Bearer oat_YOUR_API_TOKEN",
)
entries_api = EntriesApi(client)

# Trigger pipeline for a specific entry
result = entries_api.trigger_pipeline_api_v1_entries_entry_id_trigger_pipeline_post(
    entry_id=ENTRY_ID
)
workflow_name = result.get("workflow_name")
print(f"Pipeline triggered: {workflow_name}")

cURL

curl -X POST "https://api.alien.club/clusters/YOUR_CLUSTER_ID/proxy/api/v1/entries/ENTRY_ID/trigger-pipeline" \
  -H "Authorization: Bearer oat_YOUR_API_TOKEN"

You can then monitor the workflow status by polling the entry's workflow status endpoint:

status = entries_api.get_workflow_status_api_v1_entries_entry_id_workflow_status_get(
    entry_id=ENTRY_ID
)
print(f"Workflow status: {status.get('workflow_status')}")
# Possible values: Pending, Running, Succeeded, Failed, Error

Changing Pipeline Configuration

info

Changing pipeline configuration is not yet available in the web interface. Currently, pipeline configuration can only be updated programmatically via the API. A UI for pipeline configuration is coming soon.

To update the pipeline configuration for an existing dataset via the API:

from data_api_client import ApiClient, Configuration, PipelinesApi
from data_api_client.models.dataset_pipeline_config_input import DatasetPipelineConfigInput

config = Configuration(
    host="https://api.alien.club/clusters/YOUR_CLUSTER_ID/proxy"
)
client = ApiClient(
    config,
    header_name="Authorization",
    header_value="Bearer oat_YOUR_API_TOKEN",
)
pipelines_api = PipelinesApi(client)

# Update pipeline configuration for a dataset
pipeline_config = DatasetPipelineConfigInput(
    enabled=True,
    trigger="on_upload",  # or "manual"
    timeout="45m",
    steps=[ ... ],  # Your pipeline step configuration
)

pipelines_api.configure_pipeline_api_v1_pipelines_datasets_dataset_id_config_patch(
    dataset_id=YOUR_DATASET_ID,
    dataset_pipeline_config_input=pipeline_config,
)

caution

Changing the pipeline configuration only affects newly uploaded files. Existing entries that have already been processed are not reprocessed automatically. However, you can retrigger processing on existing entries manually to have them reprocessed with the new pipeline configuration — see Manual Trigger above.

Custom Pipeline Configuration

The preset pipelines cover the most common workflows. If you need a different combination of processing steps — for example, a custom text extraction method, a different chunking strategy, or integration with a proprietary document format — contact us to discuss your requirements.

Contact: support@alien.club

info

A visual pipeline editor is coming soon, allowing you to compose custom pipelines by selecting and connecting processing components directly in the UI.

Pipeline Execution

Pipelines execute as workflows on your data cluster's infrastructure. Each step runs as an isolated container with its own resources. Key characteristics:

Parallel processing — Multiple documents can be processed simultaneously.
Automatic retry — Failed steps are retried automatically with exponential backoff.
Isolated storage — Each pipeline run gets dedicated scratch space for intermediate files.
Data sovereignty — All processing happens on your data cluster. Document content never leaves your infrastructure.

For a deeper understanding of pipeline architecture and the available processing components, see Pipelines.

What's Next

Your pipeline is configured. Continue with:

Upload Documents — Upload files and watch them get processed by your pipeline
Search and Query — Search across your processed documents
Pipelines (Concept) — Detailed pipeline architecture and component reference

What a Pipeline Controls​

Using Pipeline Presets​

Available Presets​

What Each Preset Does​

General Purpose (PDF + OCR)​

Scientific Articles (JATS XML)​

Auto-Trigger vs Manual Trigger​

Auto-Trigger (Default)​

Manual Trigger​

Python​

cURL​

Changing Pipeline Configuration​

Custom Pipeline Configuration​

Pipeline Execution​

What's Next​

What a Pipeline Controls

Using Pipeline Presets

Available Presets

What Each Preset Does

General Purpose (PDF + OCR)

Scientific Articles (JATS XML)

Auto-Trigger vs Manual Trigger

Auto-Trigger (Default)

Manual Trigger

Python

cURL

Changing Pipeline Configuration

Custom Pipeline Configuration

Pipeline Execution

What's Next