Datasets and Entries

Datasets and entries are the core data model of the Alien Intelligence platform. A dataset is a collection of documents with shared configuration — schema, pipeline settings, and metadata. An entry is an individual document within a dataset, tracked through its entire lifecycle from upload to fully searchable.

Datasets

A dataset represents a logical grouping of documents. It defines what kind of documents it contains, how they should be processed, and how they are organized.

Dataset Properties

Property	Description
Name	Human-readable name (e.g., "BioRxiv 2024 Archive")
Slug	URL-safe identifier, unique within the cluster
Type	Schema type: `text`, `audio`, `voice`, or `images`
Schema	Typed field definitions with versioning
Pipeline config	Processing pipeline assignment and trigger rules
Visibility	`public` or `private`
Description	Markdown-formatted description
License	License metadata for the collection
Provider	Source or provider information
Tags	Searchable tags for categorization

Dataset Types

The dataset type determines how entries are processed and indexed:

Type	Typical Content	Processing	Search Behavior
`text`	PDFs, DOCX, XML documents	OCR, chunking, embedding	Full-text + semantic search
`audio`	Audio recordings	Transcription, embedding	Transcript search
`voice`	Voice samples	Voice model training	Model-based lookup
`images`	Image collections	Vision processing, captioning	Caption + metadata search

Dataset Schema

Each dataset has a typed schema that defines the structure of its entry metadata. Schemas are versioned — when the schema changes, the version increments, and existing entries can be identified as needing reprocessing.

{
  "version": 2,
  "fields": {
    "title": { "type": "string", "required": true },
    "authors": { "type": "array", "items": "string" },
    "doi": { "type": "string" },
    "publication_date": { "type": "date" },
    "abstract": { "type": "text" }
  }
}

Schema field types influence how Meilisearch indexes the data — different fields are marked as searchable, filterable, or sortable based on their type.

Collections

Datasets can be organized into collections — higher-level groupings that help users navigate large numbers of datasets. A collection is a logical container; it does not affect how data is stored or processed.

Entries

An entry is a single document within a dataset. It tracks the document's files, processing state, and metadata through a manifest system.

Entry Properties

Property	Description
Name	Document name (e.g., "paper-2024-001.pdf")
Dataset	Parent dataset reference
Status	Current lifecycle state
Manifest	JSONB structure tracking all files, hashes, and metadata
Version	Optimistic locking counter, incremented on every update
Storage path	Base path in MinIO for all entry files
MIME type	Primary file type (e.g., `application/pdf`)
File size	Size of the primary uploaded file

Entry Lifecycle

Every entry moves through a defined set of states:

Status	Meaning
pending	Entry record created, no file uploaded yet
uploaded	File uploaded to MinIO, ready for processing
processing	Pipeline is running (OCR, chunking, embedding)
processed	Pipeline complete — text extracted, vectors indexed, content searchable
error	Pipeline failed — can be retried manually

tip

The status transition from uploaded to processing happens automatically when the dataset has a pipeline configured with trigger: "on_upload". You can also trigger processing manually through the API.

The Manifest

The manifest is a JSONB structure stored on each entry that tracks every file associated with the document. It serves as the single source of truth for what files exist, where they are stored, and their integrity checksums.

{
  "original": {
    "files": [
      {
        "key": "datasets/abc/entries/123/original/1711234567_paper.pdf",
        "filename": "paper.pdf",
        "mime_type": "application/pdf",
        "size": 2456789,
        "hash": "sha256:a1b2c3d4...",
        "created_at": "2026-03-26T10:00:00Z"
      }
    ]
  },
  "processed": {
    "files": [
      {
        "key": "datasets/abc/entries/123/processed/content.json",
        "filename": "content.json",
        "mime_type": "application/json",
        "size": 156000,
        "hash": "sha256:e5f6g7h8..."
      }
    ],
    "figures": [
      {
        "key": "datasets/abc/entries/123/processed/figures/fig_001.png",
        "filename": "fig_001.png",
        "mime_type": "image/png",
        "size": 45000
      }
    ]
  }
}

The manifest is updated atomically with row-level locking (SELECT FOR UPDATE) to prevent concurrent modification issues during pipeline processing.

File Storage Structure

Entry files are organized in MinIO with a predictable path hierarchy:

{bucket}/
  datasets/{dataset_id}/
    entries/{entry_id}/
      original/
        {timestamp}_{filename}        # Original uploaded file
      processing/
        {timestamp}_{filename}        # Intermediate pipeline artifacts
      processed/
        content.json                  # Extracted text content
        figures/
          fig_001.png                 # Extracted figures
          fig_002.png

Every file write includes a SHA256 integrity check. The manifest records the hash, allowing verification that stored files have not been corrupted.

Supported File Types

The platform supports a range of document formats through its pipeline system. OCR processing is powered by Mistral Document AI.

Format	Extension	Processing Path
PDF	`.pdf`	Mistral OCR extraction
DOCX	`.docx`	Mistral OCR extraction
PPTX	`.pptx`	Mistral OCR extraction
Images	`.png`, `.jpg`, `.jpeg`, `.avif`	Mistral OCR or vision processing
JATS XML	`.xml`	Structured XML parsing (academic publishing standard)
MECA archives	`.meca`, `.zip`	Archive extraction followed by JATS parsing

Additional formats can be supported by adding new pipeline components. The platform's pipeline system is composable — a new format typically requires only a new processor component, while reusing the existing chunking, embedding, and registration infrastructure.

Need support for an additional format? Contact us to discuss your requirements.

Upload Flow

When a file is uploaded to an entry, the following happens:

Authentication — The request is authenticated via the platform proxy (user token or API token)
Row lock — The entry record is locked to prevent concurrent manifest updates
Integrity check — SHA256 hash is computed for the uploaded file
Storage write — File bytes are written to MinIO at the appropriate path
Manifest update — The entry's manifest JSONB is updated with the new file's metadata
Status transition — If this is the first original file and status is pending, it transitions to uploaded
Version increment — The entry's version counter increments (used for optimistic locking and sync)
Change log — A row is queued in the resource change log for batch sync to the platform
Pipeline trigger — If the dataset has an auto-trigger pipeline, processing starts in the background

info

The upload endpoint accepts a file_type parameter: original (default), processing, or processed. This makes the platform flexible — you can use the built-in processing pipelines, bring your own external processing, or combine both approaches. Upload original files and let the platform process them, or directly upload processed results if you handle processing yourself. The platform does not lock you into any single processing path.

Metadata Sync

Entries and datasets are synchronized to the platform's catalog through a batch sync mechanism running every 30 seconds. The sync includes:

Dataset names, descriptions, entry counts, and sizes
Entry names, statuses, MIME types, and file sizes
Version numbers for conflict resolution

The platform uses this metadata for the catalog dashboard, search routing, and analytics. Document content is never included in the sync — only summary metadata.

note

Sync uses version-ordered conflict resolution: the platform only applies an update if the incoming version is greater than the existing version. This prevents out-of-order updates from overwriting newer data during network partitions.

Next Steps

Pipelines — How entries are processed from raw files into searchable content
Search — How to search across entries using keyword and semantic search
Data Clusters — The infrastructure that hosts datasets and entries
Create a Dataset — Step-by-step guide to creating a dataset
Upload Documents — Upload files and monitor processing

Datasets​

Dataset Properties​

Dataset Types​

Dataset Schema​

Collections​

Entries​

Entry Properties​

Entry Lifecycle​

The Manifest​

File Storage Structure​

Supported File Types​

Upload Flow​

Metadata Sync​

Next Steps​