Upload Documents

Once you have a dataset with a configured pipeline, you can upload documents and have them automatically processed into searchable, AI-queryable content.

This guide covers uploading via the web interface and via the SDK, monitoring processing status, and troubleshooting failed uploads.

Prerequisites

An active data cluster with a dataset created. See Create a Dataset.
Write access to your organization.

Supported File Types

The file types your dataset accepts depend on the pipeline preset configured during dataset creation:

Pipeline Preset	Supported Files
General Purpose (PDF + OCR)	PDF, DOCX, images (PNG, JPG)
Scientific Articles (JATS XML)	JATS XML, MECA archives

tip

Most users should use the General Purpose preset, which handles the widest range of document formats through OCR-based extraction.

Upload via the Web Interface

Step 1: Navigate to Your Dataset

Go to Clusters in the left sidebar.
Click on the cluster that contains your dataset.
Click on the dataset name to open its detail page.

Dataset detail page showing entries list

Step 2: Upload Files

Click the Upload button.
Select one or more files from your computer, or drag and drop files into the upload area.
The upload begins immediately. A progress indicator shows the status of each file.

Each uploaded file becomes an entry in the dataset. Entries track metadata, processing state, and file storage.

Step 3: Monitor Processing

If your dataset has auto-trigger enabled (the default), uploaded files are automatically sent to the processing pipeline. You can monitor progress from the dataset detail page:

Uploaded — The file has been received and stored. Processing has not started yet.
Processing — The pipeline is actively running (OCR, chunking, embedding, etc.).
Processed — All pipeline stages completed successfully. The document is now searchable.
Failed — One or more pipeline stages encountered an error.

Entries list showing status badges

info

Processing time depends on the document size, the pipeline configuration, and queue availability. A typical 20-page PDF completes processing in under a minute. Large documents or batches may take longer. If many processing requests are happening simultaneously across your cluster, jobs may queue and take additional time to start.

Webhooks

The platform supports webhooks that send notifications on entry status changes. This lets you react programmatically when processing completes, fails, or transitions to any other state — without polling the API. Configure webhooks in your dataset settings to receive real-time updates.

Step 4: View Processed Content

Once an entry reaches the Processed status:

Click on the entry name to open its detail page.
The Content tab shows the extracted text, structured as markdown with headings and paragraphs.
The Files tab shows the original uploaded file and any extracted figures (images).
The Metadata tab shows entry-level metadata fields defined by the dataset schema.

Entry detail page showing processed content

Upload via the SDK

For programmatic uploads, use the Python or TypeScript SDK. All SDK requests go through the platform backend proxy, which handles authentication and routes to the correct data cluster.

caution

Entry creation and file upload are two separate API calls. First create the entry, then upload the file to it. The examples below show the correct two-step flow.

Python

from data_api_client import ApiClient, Configuration, EntriesApi
from data_api_client.models.entry_create_request import EntryCreateRequest

# Configure — requests go through the platform proxy
config = Configuration(
    host="https://api.alien.club/clusters/YOUR_CLUSTER_ID/proxy"
)

client = ApiClient(
    config,
    header_name="Authorization",
    header_value="Bearer oat_YOUR_API_TOKEN",
)

entries_api = EntriesApi(client)

# Step 1: Create the entry
request = EntryCreateRequest(
    dataset_id=YOUR_DATASET_ID,
    name="My Research Paper",
    slug="my-research-paper",
    description="Uploaded via SDK",
    metadata={},
)
response = entries_api.create_entry_api_v1_entries_post(request)
entry_id = response.entry.id
print(f"Created entry {entry_id}")

# Step 2: Upload the file to the entry
with open("paper.pdf", "rb") as f:
    file_bytes = f.read()

entries_api.upload_file_to_entry_api_v1_entries_entry_id_upload_post(
    entry_id=entry_id,
    file=("paper.pdf", file_bytes),
    file_type="original",
)
print(f"Uploaded file to entry {entry_id}")

TypeScript

import { Configuration, EntriesApi } from "@alien/data-api-client";

const config = new Configuration({
  basePath: "https://api.alien.club/clusters/YOUR_CLUSTER_ID/proxy",
  headers: {
    Authorization: "Bearer oat_YOUR_API_TOKEN",
  },
});

const entriesApi = new EntriesApi(config);

// Step 1: Create the entry
const response = await entriesApi.createEntryApiV1EntriesPost({
  datasetId: YOUR_DATASET_ID,
  name: "My Research Paper",
  slug: "my-research-paper",
  description: "Uploaded via SDK",
  metadata: {},
});
const entryId = response.entry.id;
console.log(`Created entry ${entryId}`);

// Step 2: Upload the file to the entry
const fileBuffer = fs.readFileSync("paper.pdf");
await entriesApi.uploadFileToEntryApiV1EntriesEntryIdUploadPost(
  entryId,
  new File([fileBuffer], "paper.pdf"),
  "original"
);
console.log(`Uploaded file to entry ${entryId}`);

tip

For SDK installation and registry configuration, see the Install SDK guide.

Full Programmatic Workflow

This section shows the complete flow for programmatically creating a dataset, uploading documents, triggering pipelines, and validating results. This is based on the working reference implementation we provide to clients.

Step 1: Initialize the Client

import os
from data_api_client import (
    ApiClient, Configuration,
    DatasetsApi, EntriesApi, PipelinesApi, SearchApi, HealthApi,
)

# Configuration from environment variables
backend_url = os.environ["BACKEND_API_URL"]  # https://api.alien.club
cluster_id = os.environ["CLUSTER_ID"]
token = os.environ["BEARER_TOKEN"]           # oat_...

base_url = f"{backend_url}/clusters/{cluster_id}/proxy"

config = Configuration(host=base_url)
client = ApiClient(
    config,
    header_name="Authorization",
    header_value=f"Bearer {token}",
)

datasets_api = DatasetsApi(client)
entries_api = EntriesApi(client)
pipelines_api = PipelinesApi(client)
search_api = SearchApi(client)
health_api = HealthApi(client)

Step 2: Create a Dataset with Pipeline Configuration

from data_api_client.models.dataset_create_request import DatasetCreateRequest
from data_api_client.models.dataset_pipeline_config_input import DatasetPipelineConfigInput
from data_api_client.models.pipeline_step_config import PipelineStepConfig
from data_api_client.models.artifact_input_config import ArtifactInputConfig

# Create the dataset
dataset_request = DatasetCreateRequest(
    name="My Document Collection",
    slug="my-document-collection",
    description="Documents uploaded via SDK",
    dataset_type="text",
    schema_definition={
        "schema_id": "general_v1",
        "version": "v1",
        "description": "General purpose schema",
        "original": {
            "required_files": ["*.pdf", "*.docx"],
            "optional_files": [],
            "metadata_schema": {
                "type": "object",
                "properties": {"source": {"type": "string"}},
                "required": [],
            },
        },
        "processed": {
            "content_schema": {
                "type": "object",
                "properties": {
                    "text": {"type": "string"},
                    "chunks": {"type": "array"},
                },
            },
            "required_files": [],
            "optional_files": ["*.png", "*.jpg"],
        },
        "processing": {"intermediate_files": [], "retention_days": 1},
    },
)

dataset = datasets_api.create_dataset_api_v1_datasets_post(dataset_request)
dataset_id = dataset.id
print(f"Created dataset: {dataset_id}")

info

Pipeline configuration is a separate step. See Configure a Pipeline for details on programmatic pipeline setup.

Step 3: Create Entries and Upload Files

from pathlib import Path
from data_api_client.models.entry_create_request import EntryCreateRequest

def upload_document(pdf_path: Path, dataset_id: int) -> int:
    """Create an entry and upload a file. Returns the entry ID."""
    # Create entry
    request = EntryCreateRequest(
        dataset_id=dataset_id,
        name=pdf_path.stem,
        slug=pdf_path.stem.lower()[:40],
        description="Uploaded via SDK",
        metadata={"source": "sdk-upload"},
    )
    response = entries_api.create_entry_api_v1_entries_post(request)
    entry_id = response.entry.id

    # Upload file
    with open(pdf_path, "rb") as f:
        file_bytes = f.read()
    entries_api.upload_file_to_entry_api_v1_entries_entry_id_upload_post(
        entry_id=entry_id,
        file=(pdf_path.name, file_bytes),
        file_type="original",
    )

    return entry_id

# Upload all PDFs in a directory
pdf_dir = Path("./documents")
entry_ids = []
for pdf_file in pdf_dir.glob("*.pdf"):
    entry_id = upload_document(pdf_file, dataset_id)
    entry_ids.append(entry_id)
    print(f"Uploaded {pdf_file.name} → entry {entry_id}")

Step 4: Trigger Pipelines

If your dataset uses manual triggering (not auto-trigger on upload), trigger each entry's pipeline explicitly:

for entry_id in entry_ids:
    result = entries_api.trigger_pipeline_api_v1_entries_entry_id_trigger_pipeline_post(
        entry_id=entry_id
    )
    workflow_name = result.get("workflow_name")
    print(f"Entry {entry_id} → pipeline {workflow_name}")

Step 5: Monitor Processing

Poll entry statuses until all entries are processed:

import time

def wait_for_processing(entry_ids: list[int], timeout: int = 1800):
    """Poll until all entries are processed or timeout."""
    start = time.time()
    while time.time() - start < timeout:
        statuses = {}
        for eid in entry_ids:
            entry = entries_api.get_entry_api_v1_entries_entry_id_get(eid)
            statuses[eid] = str(entry.status).split(".")[-1].upper()

        processed = sum(1 for s in statuses.values() if s == "PROCESSED")
        failed = sum(1 for s in statuses.values() if s == "FAILED")
        print(f"Processed: {processed}, Failed: {failed}, Total: {len(entry_ids)}")

        if processed + failed == len(entry_ids):
            return statuses
        time.sleep(10)

    raise TimeoutError(f"Processing did not complete within {timeout}s")

statuses = wait_for_processing(entry_ids)

Step 6: Validate with Search

from data_api_client.models.keyword_search_request import KeywordSearchRequest
from data_api_client.models.vector_search_request import VectorSearchRequest

# Keyword search
kw_request = KeywordSearchRequest(
    query="your search term",
    dataset_ids=[dataset_id],
    limit=10,
)
kw_results = search_api.keyword_search_api_v1_search_post(kw_request)
print(f"Keyword search: {kw_results.total} results")

# Vector (semantic) search
vec_request = VectorSearchRequest(
    query="semantic query about your documents",
    dataset_ids=[dataset_id],
    limit=5,
)
vec_results = search_api.vector_search_chunks_api_v1_vector_chunks_post(vec_request)
print(f"Vector search: {len(vec_results.results)} chunks")

Entry Lifecycle

Every uploaded file follows this lifecycle:

Status	Meaning
Uploaded	File received, SHA256 verified, stored in the data cluster. Waiting for pipeline.
Processing	Pipeline is running. Stages execute in order: OCR, figure linking, chunking, embedding, registration.
Processed	All stages completed. Content indexed in Meilisearch (keyword search) and Qdrant (semantic search).
Failed	A pipeline stage encountered an error. Check the error details on the entry detail page.

What Happens During Processing

The upload and processing flow from end to end:

When a file enters the pipeline, the following stages execute in order:

OCR / Text Extraction — Extracts text and images from the uploaded file.
Figure Linking — Resolves figure references in the extracted markdown and converts images to standard format.
Chunking — Splits the text into semantic chunks based on headings and token limits.
Embedding — Generates vector embeddings for each chunk using the configured embedding provider.
Registration — Stores processed content, indexes it for keyword search, and upserts embeddings for semantic search.

For a detailed breakdown of each stage, see Pipelines.

Checking Processing Status

Via the Web Interface

The dataset detail page shows a status badge next to each entry. Click an entry to see its full processing history, including timestamps for each pipeline stage.

Via the API

# Check entry status
entry = entries_api.get_entry(id=entry_id)
print(f"Status: {entry.status}")

const entry = await entriesApi.getEntry({ id: entryId });
console.log(`Status: ${entry.status}`);

Batch Status Check

To check the status of many entries at once:

# Batch get entries with status
response = entries_api.batch_get_entries(
    dataset_id=1,
    limit=50
)
for entry in response.entries:
    print(f"{entry.name}: {entry.status}")

Troubleshooting

Upload Fails Immediately

File too large — The maximum upload size is 100 MB per file. Check that your file does not exceed this limit.
Unsupported format — Verify the file type is supported by your dataset's pipeline preset.
Cluster offline — The data cluster must be in an active state. Check the cluster status in the Clusters section.

Entry Stuck in "Processing"

Pipeline stages run as background jobs. If an entry remains in "Processing" for an extended period, it may indicate a transient infrastructure issue.
Check the entry detail page for error messages on individual pipeline stages.
Contact support if the issue persists.

Entry Shows "Failed"

Open the entry detail page to see which pipeline stage failed and the error message.
Common causes include corrupted files, unsupported PDF features, or temporary API outages from the embedding provider.
You can re-upload the file to retry processing.
Please contact us for help debugging failed entries.

Next Steps

Configure a Pipeline — Adjust pipeline settings and understand presets
Search and Query — Search across your processed documents
AI Agent Integration — Connect AI agents to your document collection

Prerequisites​

Supported File Types​

Upload via the Web Interface​

Step 1: Navigate to Your Dataset​

Step 2: Upload Files​

Step 3: Monitor Processing​

Step 4: View Processed Content​

Upload via the SDK​

Python​

TypeScript​

Full Programmatic Workflow​

Step 1: Initialize the Client​

Step 2: Create a Dataset with Pipeline Configuration​

Step 3: Create Entries and Upload Files​

Step 4: Trigger Pipelines​

Step 5: Monitor Processing​

Step 6: Validate with Search​

Entry Lifecycle​

What Happens During Processing​

Checking Processing Status​

Via the Web Interface​

Via the API​

Batch Status Check​

Troubleshooting​

Upload Fails Immediately​

Entry Stuck in "Processing"​

Entry Shows "Failed"​

Next Steps​