Skip to main content

Create a Dataset

A dataset is a collection of documents (entries) inside a data cluster. When you create a dataset, you define its metadata schema and configure how uploaded documents are processed through a pipeline.

Dataset creation is a four-step wizard that guides you through naming, schema selection, field customization, and pipeline configuration.

Prerequisites

  • An active data cluster to host the dataset. If you do not have one yet, follow Create a Data Cluster.
  • Write access to your organization.

Step 1: Open the Dataset Creation Wizard

  1. Navigate to Clusters in the dashboard.
  2. Click on the cluster where you want to create the dataset.
  3. From the cluster detail page, click Create Dataset (or navigate to the Datasets tab and click the create button).

Cluster detail page with Create Dataset button

The dataset creation wizard opens with four steps: Create Dataset, Schema, Fields, and Pipeline.

Step 2: Name Your Dataset

Dataset wizard step 1 — name and description

FieldDescriptionRequired
Dataset nameA descriptive name for the dataset (minimum 3 characters)Yes
DescriptionA brief description of what this dataset containsNo

Examples:

  • Name: bioRxiv Preprints 2024 / Description: Pre-publication biology and life sciences papers from 2024
  • Name: Product Manuals / Description: Technical documentation for all product lines

Click Next to create the dataset on the cluster and proceed to schema selection.

info

The dataset is created on the cluster as soon as you complete this step. If you cancel the wizard after this point, the dataset still exists (with no pipeline configured) and can be configured later from the dataset settings page.

Step 3: Choose a Schema Preset

The schema defines the structure of entries in your dataset — what metadata fields each document has and what file types are expected.

Dataset wizard step 2 — schema preset selection

Two schema presets are available:

Default

A general-purpose schema for standard document processing. Expects a PDF file and supports basic metadata:

Metadata FieldTypeRequired
titleStringYes
authorStringNo

Scientific Articles

An extended schema designed for academic publications. Includes fields commonly found in scientific papers:

Metadata FieldTypeRequired
doiStringYes
titleStringYes
abstractStringNo
authorsStringNo
published_dateStringNo
journalStringNo
keywordsStringNo

Select the preset that best matches your data and click Next.

tip

If neither preset fits your needs, you have two options: select the closest preset and customize the fields in the next step (you can add, remove, or modify metadata fields), or choose the Custom option to build your schema entirely from scratch.

Step 4: Customize Fields

Why fields matter

Fields are more than labels — they drive key platform capabilities:

  • Faceted search — Fields power Meilisearch faceted search, letting users filter entries by specific field values (e.g., filter by author, journal, or date).
  • Structured retrieval — Fields enable precise, structured entry retrieval through the API and UI.
  • LLM-augmented understanding — AI agents use field metadata for richer context when answering questions about your data.
  • Well-defined fields make your data more discoverable, filterable, and useful across the platform.

This step lets you fine-tune the metadata fields from the schema preset.

Dataset wizard step 3 — field customization

You can:

  • Add fields — Define additional metadata properties for your entries
  • Remove fields — Remove preset fields that are not relevant
  • Change field types — Modify the data type of a field
  • Mark fields as required — Control which fields must be provided when creating entries

The field customizations are saved to the dataset schema. All entries in this dataset will follow this schema.

Click Next to proceed to pipeline configuration.

Step 5: Configure the Processing Pipeline

The pipeline defines how uploaded documents are automatically processed into searchable content. This is the most important configuration step — it determines what happens when you upload a file.

Dataset wizard step 4 — pipeline preset selection

Pipeline Preset: General Purpose

The platform ships with one production-ready pipeline preset:

General Purpose — Processes PDF, DOCX, and image files through a complete pipeline:

Upload -> OCR -> Link Figures -> Chunk -> Embed -> Store

This preset runs the following processing stages:

StageWhat It Does
Fetch EntryDownloads the uploaded file from cluster storage
OCRExtracts text and images using Mistral OCR
Link FiguresResolves figure references in the extracted text and converts images to PNG
Register FiguresSaves extracted figures back to cluster storage
ChunkSplits extracted text into semantic chunks (1,024 tokens with 128-token overlap)
EmbedGenerates embedding vectors for each chunk using the cluster's configured embedding provider
Store ChunksSaves chunk text and embedding vectors to the vector database
Save ContentSaves the processed text content to cluster storage
Update StatusMarks the entry as "processed"

Custom JSON

If you need a different pipeline configuration, select Custom JSON and paste your pipeline configuration. Custom pipelines compose the same underlying components in a different order or with different parameters.

Need a custom pipeline?

Contact us at contact@alien.club — we can provide tailored JSON pipeline configurations for your specific use case, including custom processing stages, parameters, and component ordering.

Pipeline Editor Coming Soon

A visual pipeline editor is under development. Currently, custom pipelines require a JSON configuration.

Auto-Trigger

A toggle at the bottom of the pipeline step controls automatic processing:

  • Enabled (default) — The pipeline runs automatically whenever a file is uploaded to this dataset
  • Disabled — Files are uploaded but not processed until you manually trigger the pipeline
tip

Keep auto-trigger enabled unless you need to batch-upload many files and trigger processing later. Auto-trigger ensures every uploaded document is immediately processed and becomes searchable.

Finish

Click Configure Pipeline to save the pipeline configuration and complete the wizard. You are redirected to the dataset detail page.

What Happens After Creation

Your dataset is now ready to receive documents. Here is the state after completing the wizard:

  • The dataset exists on your cluster with the configured schema
  • The pipeline is configured and (if auto-trigger is enabled) will process any uploaded files
  • No entries exist yet — the dataset is empty

From the dataset detail page, you can:

  • Upload documents — Add files that will be processed by the pipeline
  • View entries — See the list of documents and their processing status
  • Modify settings — Change the dataset description, schema, or pipeline configuration

Dataset detail page with entries list

Understanding Pipeline Configuration

The pipeline configuration is stored as JSON on the dataset. Each step in the pipeline specifies:

  • Component — The processing component to run (e.g., mistral-ocr-processor-1.0.0)
  • Dependencies — Which steps must complete before this step runs
  • Inputs — Artifacts from previous steps that this step consumes
  • Outputs — Artifacts this step produces for downstream steps
  • Parameters — Configuration specific to this component

Steps run as a directed acyclic graph (DAG) — steps with no dependency on each other can run in parallel. The platform handles scheduling, retries, and artifact passing automatically.

For details on pipeline architecture and all available components, see Pipelines (Concept).

Next Steps