Configure a Pipeline
A pipeline defines how uploaded documents are processed — from raw file to searchable, AI-queryable content. Pipeline configuration happens during dataset creation and can be updated afterward.
This guide covers what each pipeline preset does, how to switch between presets, and how to request custom pipeline configurations.
What a Pipeline Controls
A pipeline determines:
- Which processing steps run — OCR, XML parsing, chunking, embedding, etc.
- In what order — Steps execute as a directed graph (DAG), with dependencies between stages.
- What embedding model is used — Controlled by the data cluster's embedding provider setting.
- When processing starts — Automatically on upload, or manually triggered.
The output of a pipeline is a fully indexed document: searchable by keyword (Meilisearch), by meaning (Qdrant vector search), and readable by AI agents via MCP.
Using Pipeline Presets
Pipeline presets are ready-made configurations that cover the most common document formats. You select a preset during the pipeline step of the dataset creation wizard.
Available Presets
| Preset | Best For | Processing Steps |
|---|---|---|
| General Purpose | PDF, DOCX, images | OCR (Mistral), figure linking, chunking, embedding, registration |
| Scientific Articles | BioRxiv JATS XML, MECA archives | XML parsing, figure extraction, chunking, embedding, registration |
General Purpose is the right choice for most users. It handles PDFs of any complexity — scanned documents, multi-column layouts, documents with figures and tables — through OCR-based extraction.
What Each Preset Does
General Purpose (PDF + OCR)
This is the most commonly used preset. When a file is uploaded:
- OCR — Sends the document to an OCR service that extracts text as structured markdown and identifies images.
- Figure Linking — Resolves cross-references between text and figures, converts extracted images to PNG.
- Chunking — Splits the markdown into semantic chunks using heading-aware, token-bounded splitting.
- Embedding — Generates vector embeddings for each chunk using the cluster's configured embedding provider.
- Content Registration — Stores processed text and figures, indexes content for keyword search.
- Chunk Registration — Upserts embedding vectors into the vector database for semantic search.
Scientific Articles (JATS XML)
The Scientific Articles preset is available on demand — it is not enabled by default for all users. Contact us to have it activated for your organization.
Designed for academic publishing workflows:
- JATS Parsing — Extracts structured content (title, abstract, sections, references) from JATS XML.
- Figure Extraction — Extracts figures from the XML or MECA archive.
- Chunking — Splits content into semantic chunks.
- Embedding — Generates vector embeddings.
- Content Registration — Stores and indexes the processed content.
- Chunk Registration — Upserts vectors for semantic search.
Auto-Trigger vs Manual Trigger
Each dataset's pipeline can be set to trigger automatically or manually.
Auto-Trigger (Default)
When enabled, the pipeline runs automatically every time a file is uploaded to the dataset. This is the recommended setting for most use cases — upload a file and it becomes searchable without additional steps.
Auto-trigger is enabled by setting the Trigger option to on_upload during dataset creation.
Manual Trigger
When auto-trigger is disabled, uploaded files remain in the Uploaded status until processing is explicitly triggered. This is useful when you want to upload a batch of files first and process them later, or when you need to review files before processing.
To trigger a pipeline manually via the API, call the trigger endpoint for each entry:
Python
from data_api_client import ApiClient, Configuration, EntriesApi
config = Configuration(
host="https://api.alien.club/clusters/YOUR_CLUSTER_ID/proxy"
)
client = ApiClient(
config,
header_name="Authorization",
header_value="Bearer oat_YOUR_API_TOKEN",
)
entries_api = EntriesApi(client)
# Trigger pipeline for a specific entry
result = entries_api.trigger_pipeline_api_v1_entries_entry_id_trigger_pipeline_post(
entry_id=ENTRY_ID
)
workflow_name = result.get("workflow_name")
print(f"Pipeline triggered: {workflow_name}")
cURL
curl -X POST "https://api.alien.club/clusters/YOUR_CLUSTER_ID/proxy/api/v1/entries/ENTRY_ID/trigger-pipeline" \
-H "Authorization: Bearer oat_YOUR_API_TOKEN"
You can then monitor the workflow status by polling the entry's workflow status endpoint:
status = entries_api.get_workflow_status_api_v1_entries_entry_id_workflow_status_get(
entry_id=ENTRY_ID
)
print(f"Workflow status: {status.get('workflow_status')}")
# Possible values: Pending, Running, Succeeded, Failed, Error
Changing Pipeline Configuration
Changing pipeline configuration is not yet available in the web interface. Currently, pipeline configuration can only be updated programmatically via the API. A UI for pipeline configuration is coming soon.
To update the pipeline configuration for an existing dataset via the API:
from data_api_client import ApiClient, Configuration, PipelinesApi
from data_api_client.models.dataset_pipeline_config_input import DatasetPipelineConfigInput
config = Configuration(
host="https://api.alien.club/clusters/YOUR_CLUSTER_ID/proxy"
)
client = ApiClient(
config,
header_name="Authorization",
header_value="Bearer oat_YOUR_API_TOKEN",
)
pipelines_api = PipelinesApi(client)
# Update pipeline configuration for a dataset
pipeline_config = DatasetPipelineConfigInput(
enabled=True,
trigger="on_upload", # or "manual"
timeout="45m",
steps=[ ... ], # Your pipeline step configuration
)
pipelines_api.configure_pipeline_api_v1_pipelines_datasets_dataset_id_config_patch(
dataset_id=YOUR_DATASET_ID,
dataset_pipeline_config_input=pipeline_config,
)
Changing the pipeline configuration only affects newly uploaded files. Existing entries that have already been processed are not reprocessed automatically. However, you can retrigger processing on existing entries manually to have them reprocessed with the new pipeline configuration — see Manual Trigger above.
Custom Pipeline Configuration
The preset pipelines cover the most common workflows. If you need a different combination of processing steps — for example, a custom text extraction method, a different chunking strategy, or integration with a proprietary document format — contact us to discuss your requirements.
Contact: support@alien.club
A visual pipeline editor is coming soon, allowing you to compose custom pipelines by selecting and connecting processing components directly in the UI.
Pipeline Execution
Pipelines execute as workflows on your data cluster's infrastructure. Each step runs as an isolated container with its own resources. Key characteristics:
- Parallel processing — Multiple documents can be processed simultaneously.
- Automatic retry — Failed steps are retried automatically with exponential backoff.
- Isolated storage — Each pipeline run gets dedicated scratch space for intermediate files.
- Data sovereignty — All processing happens on your data cluster. Document content never leaves your infrastructure.
For a deeper understanding of pipeline architecture and the available processing components, see Pipelines.
What's Next
Your pipeline is configured. Continue with:
- Upload Documents — Upload files and watch them get processed by your pipeline
- Search and Query — Search across your processed documents
- Pipelines (Concept) — Detailed pipeline architecture and component reference