Create a Dataset

A dataset is a collection of documents (entries) inside a data cluster. When you create a dataset, you define its metadata schema and configure how uploaded documents are processed through a pipeline.

Dataset creation is a four-step wizard that guides you through naming, schema selection, field customization, and pipeline configuration.

Prerequisites

An active data cluster to host the dataset. If you do not have one yet, follow Create a Data Cluster.
Write access to your organization.

Step 1: Open the Dataset Creation Wizard

Navigate to Clusters in the dashboard.
Click on the cluster where you want to create the dataset.
From the cluster detail page, click Create Dataset (or navigate to the Datasets tab and click the create button).

Cluster detail page with Create Dataset button

The dataset creation wizard opens with four steps: Create Dataset, Schema, Fields, and Pipeline.

Step 2: Name Your Dataset

Dataset wizard step 1 — name and description

Field	Description	Required
Dataset name	A descriptive name for the dataset (minimum 3 characters)	Yes
Description	A brief description of what this dataset contains	No

Examples:

Name: bioRxiv Preprints 2024 / Description: Pre-publication biology and life sciences papers from 2024
Name: Product Manuals / Description: Technical documentation for all product lines

Click Next to create the dataset on the cluster and proceed to schema selection.

info

The dataset is created on the cluster as soon as you complete this step. If you cancel the wizard after this point, the dataset still exists (with no pipeline configured) and can be configured later from the dataset settings page.

Step 3: Choose a Schema Preset

The schema defines the structure of entries in your dataset — what metadata fields each document has and what file types are expected.

Dataset wizard step 2 — schema preset selection

Two schema presets are available:

Default

A general-purpose schema for standard document processing. Expects a PDF file and supports basic metadata:

Metadata Field	Type	Required
`title`	String	Yes
`author`	String	No

Scientific Articles

An extended schema designed for academic publications. Includes fields commonly found in scientific papers:

Metadata Field	Type	Required
`doi`	String	Yes
`title`	String	Yes
`abstract`	String	No
`authors`	String	No
`published_date`	String	No
`journal`	String	No
`keywords`	String	No

Select the preset that best matches your data and click Next.

tip

If neither preset fits your needs, you have two options: select the closest preset and customize the fields in the next step (you can add, remove, or modify metadata fields), or choose the Custom option to build your schema entirely from scratch.

Step 4: Customize Fields

Why fields matter

Fields are more than labels — they drive key platform capabilities:

Faceted search — Fields power Meilisearch faceted search, letting users filter entries by specific field values (e.g., filter by author, journal, or date).
Structured retrieval — Fields enable precise, structured entry retrieval through the API and UI.
LLM-augmented understanding — AI agents use field metadata for richer context when answering questions about your data.
Well-defined fields make your data more discoverable, filterable, and useful across the platform.

This step lets you fine-tune the metadata fields from the schema preset.

Dataset wizard step 3 — field customization

You can:

Add fields — Define additional metadata properties for your entries
Remove fields — Remove preset fields that are not relevant
Change field types — Modify the data type of a field
Mark fields as required — Control which fields must be provided when creating entries

The field customizations are saved to the dataset schema. All entries in this dataset will follow this schema.

Click Next to proceed to pipeline configuration.

Step 5: Configure the Processing Pipeline

The pipeline defines how uploaded documents are automatically processed into searchable content. This is the most important configuration step — it determines what happens when you upload a file.

Dataset wizard step 4 — pipeline preset selection

Pipeline Preset: General Purpose

The platform ships with one production-ready pipeline preset:

General Purpose — Processes PDF, DOCX, and image files through a complete pipeline:

Upload -> OCR -> Link Figures -> Chunk -> Embed -> Store

This preset runs the following processing stages:

Stage	What It Does
Fetch Entry	Downloads the uploaded file from cluster storage
OCR	Extracts text and images using Mistral OCR
Link Figures	Resolves figure references in the extracted text and converts images to PNG
Register Figures	Saves extracted figures back to cluster storage
Chunk	Splits extracted text into semantic chunks (1,024 tokens with 128-token overlap)
Embed	Generates embedding vectors for each chunk using the cluster's configured embedding provider
Store Chunks	Saves chunk text and embedding vectors to the vector database
Save Content	Saves the processed text content to cluster storage
Update Status	Marks the entry as "processed"

Custom JSON

If you need a different pipeline configuration, select Custom JSON and paste your pipeline configuration. Custom pipelines compose the same underlying components in a different order or with different parameters.

Need a custom pipeline?

Contact us at contact@alien.club — we can provide tailored JSON pipeline configurations for your specific use case, including custom processing stages, parameters, and component ordering.

Pipeline Editor Coming Soon

A visual pipeline editor is under development. Currently, custom pipelines require a JSON configuration.

Auto-Trigger

A toggle at the bottom of the pipeline step controls automatic processing:

Enabled (default) — The pipeline runs automatically whenever a file is uploaded to this dataset
Disabled — Files are uploaded but not processed until you manually trigger the pipeline

tip

Keep auto-trigger enabled unless you need to batch-upload many files and trigger processing later. Auto-trigger ensures every uploaded document is immediately processed and becomes searchable.

Finish

Click Configure Pipeline to save the pipeline configuration and complete the wizard. You are redirected to the dataset detail page.

What Happens After Creation

Your dataset is now ready to receive documents. Here is the state after completing the wizard:

The dataset exists on your cluster with the configured schema
The pipeline is configured and (if auto-trigger is enabled) will process any uploaded files
No entries exist yet — the dataset is empty

From the dataset detail page, you can:

Upload documents — Add files that will be processed by the pipeline
View entries — See the list of documents and their processing status
Modify settings — Change the dataset description, schema, or pipeline configuration

Dataset detail page with entries list

Understanding Pipeline Configuration

The pipeline configuration is stored as JSON on the dataset. Each step in the pipeline specifies:

Component — The processing component to run (e.g., mistral-ocr-processor-1.0.0)
Dependencies — Which steps must complete before this step runs
Inputs — Artifacts from previous steps that this step consumes
Outputs — Artifacts this step produces for downstream steps
Parameters — Configuration specific to this component

Steps run as a directed acyclic graph (DAG) — steps with no dependency on each other can run in parallel. The platform handles scheduling, retries, and artifact passing automatically.

For details on pipeline architecture and all available components, see Pipelines (Concept).

Next Steps

Upload Documents — Add files to your dataset and monitor processing
Configure Pipeline — Modify pipeline settings after dataset creation
Datasets and Entries (Concept) — Understand the data model in depth
Pipelines (Concept) — Learn about all available pipeline components and presets

Prerequisites​

Step 1: Open the Dataset Creation Wizard​

Step 2: Name Your Dataset​

Step 3: Choose a Schema Preset​

Default​

Scientific Articles​

Step 4: Customize Fields​

Step 5: Configure the Processing Pipeline​

Pipeline Preset: General Purpose​

Custom JSON​

Auto-Trigger​

Finish​

What Happens After Creation​

Understanding Pipeline Configuration​

Next Steps​