Skip to main content

Datasets and Entries

Datasets and entries are the core data model of the Alien Intelligence platform. A dataset is a collection of documents with shared configuration — schema, pipeline settings, and metadata. An entry is an individual document within a dataset, tracked through its entire lifecycle from upload to fully searchable.

Datasets

A dataset represents a logical grouping of documents. It defines what kind of documents it contains, how they should be processed, and how they are organized.

Dataset Properties

PropertyDescription
NameHuman-readable name (e.g., "BioRxiv 2024 Archive")
SlugURL-safe identifier, unique within the cluster
TypeSchema type: text, audio, voice, or images
SchemaTyped field definitions with versioning
Pipeline configProcessing pipeline assignment and trigger rules
Visibilitypublic or private
DescriptionMarkdown-formatted description
LicenseLicense metadata for the collection
ProviderSource or provider information
TagsSearchable tags for categorization

Dataset Types

The dataset type determines how entries are processed and indexed:

TypeTypical ContentProcessingSearch Behavior
textPDFs, DOCX, XML documentsOCR, chunking, embeddingFull-text + semantic search
audioAudio recordingsTranscription, embeddingTranscript search
voiceVoice samplesVoice model trainingModel-based lookup
imagesImage collectionsVision processing, captioningCaption + metadata search

Dataset Schema

Each dataset has a typed schema that defines the structure of its entry metadata. Schemas are versioned — when the schema changes, the version increments, and existing entries can be identified as needing reprocessing.

{
"version": 2,
"fields": {
"title": { "type": "string", "required": true },
"authors": { "type": "array", "items": "string" },
"doi": { "type": "string" },
"publication_date": { "type": "date" },
"abstract": { "type": "text" }
}
}

Schema field types influence how Meilisearch indexes the data — different fields are marked as searchable, filterable, or sortable based on their type.

Collections

Datasets can be organized into collections — higher-level groupings that help users navigate large numbers of datasets. A collection is a logical container; it does not affect how data is stored or processed.

Entries

An entry is a single document within a dataset. It tracks the document's files, processing state, and metadata through a manifest system.

Entry Properties

PropertyDescription
NameDocument name (e.g., "paper-2024-001.pdf")
DatasetParent dataset reference
StatusCurrent lifecycle state
ManifestJSONB structure tracking all files, hashes, and metadata
VersionOptimistic locking counter, incremented on every update
Storage pathBase path in MinIO for all entry files
MIME typePrimary file type (e.g., application/pdf)
File sizeSize of the primary uploaded file

Entry Lifecycle

Every entry moves through a defined set of states:

StatusMeaning
pendingEntry record created, no file uploaded yet
uploadedFile uploaded to MinIO, ready for processing
processingPipeline is running (OCR, chunking, embedding)
processedPipeline complete — text extracted, vectors indexed, content searchable
errorPipeline failed — can be retried manually
tip

The status transition from uploaded to processing happens automatically when the dataset has a pipeline configured with trigger: "on_upload". You can also trigger processing manually through the API.

The Manifest

The manifest is a JSONB structure stored on each entry that tracks every file associated with the document. It serves as the single source of truth for what files exist, where they are stored, and their integrity checksums.

{
"original": {
"files": [
{
"key": "datasets/abc/entries/123/original/1711234567_paper.pdf",
"filename": "paper.pdf",
"mime_type": "application/pdf",
"size": 2456789,
"hash": "sha256:a1b2c3d4...",
"created_at": "2026-03-26T10:00:00Z"
}
]
},
"processed": {
"files": [
{
"key": "datasets/abc/entries/123/processed/content.json",
"filename": "content.json",
"mime_type": "application/json",
"size": 156000,
"hash": "sha256:e5f6g7h8..."
}
],
"figures": [
{
"key": "datasets/abc/entries/123/processed/figures/fig_001.png",
"filename": "fig_001.png",
"mime_type": "image/png",
"size": 45000
}
]
}
}

The manifest is updated atomically with row-level locking (SELECT FOR UPDATE) to prevent concurrent modification issues during pipeline processing.

File Storage Structure

Entry files are organized in MinIO with a predictable path hierarchy:

{bucket}/
datasets/{dataset_id}/
entries/{entry_id}/
original/
{timestamp}_{filename} # Original uploaded file
processing/
{timestamp}_{filename} # Intermediate pipeline artifacts
processed/
content.json # Extracted text content
figures/
fig_001.png # Extracted figures
fig_002.png

Every file write includes a SHA256 integrity check. The manifest records the hash, allowing verification that stored files have not been corrupted.

Supported File Types

The platform supports a range of document formats through its pipeline system. OCR processing is powered by Mistral Document AI.

FormatExtensionProcessing Path
PDF.pdfMistral OCR extraction
DOCX.docxMistral OCR extraction
PPTX.pptxMistral OCR extraction
Images.png, .jpg, .jpeg, .avifMistral OCR or vision processing
JATS XML.xmlStructured XML parsing (academic publishing standard)
MECA archives.meca, .zipArchive extraction followed by JATS parsing

Additional formats can be supported by adding new pipeline components. The platform's pipeline system is composable — a new format typically requires only a new processor component, while reusing the existing chunking, embedding, and registration infrastructure.

Need support for an additional format? Contact us to discuss your requirements.

Upload Flow

When a file is uploaded to an entry, the following happens:

  1. Authentication — The request is authenticated via the platform proxy (user token or API token)
  2. Row lock — The entry record is locked to prevent concurrent manifest updates
  3. Integrity check — SHA256 hash is computed for the uploaded file
  4. Storage write — File bytes are written to MinIO at the appropriate path
  5. Manifest update — The entry's manifest JSONB is updated with the new file's metadata
  6. Status transition — If this is the first original file and status is pending, it transitions to uploaded
  7. Version increment — The entry's version counter increments (used for optimistic locking and sync)
  8. Change log — A row is queued in the resource change log for batch sync to the platform
  9. Pipeline trigger — If the dataset has an auto-trigger pipeline, processing starts in the background
info

The upload endpoint accepts a file_type parameter: original (default), processing, or processed. This makes the platform flexible — you can use the built-in processing pipelines, bring your own external processing, or combine both approaches. Upload original files and let the platform process them, or directly upload processed results if you handle processing yourself. The platform does not lock you into any single processing path.

Metadata Sync

Entries and datasets are synchronized to the platform's catalog through a batch sync mechanism running every 30 seconds. The sync includes:

  • Dataset names, descriptions, entry counts, and sizes
  • Entry names, statuses, MIME types, and file sizes
  • Version numbers for conflict resolution

The platform uses this metadata for the catalog dashboard, search routing, and analytics. Document content is never included in the sync — only summary metadata.

note

Sync uses version-ordered conflict resolution: the platform only applies an update if the incoming version is greater than the existing version. This prevents out-of-order updates from overwriting newer data during network partitions.

Next Steps

  • Pipelines — How entries are processed from raw files into searchable content
  • Search — How to search across entries using keyword and semantic search
  • Data Clusters — The infrastructure that hosts datasets and entries
  • Create a Dataset — Step-by-step guide to creating a dataset
  • Upload Documents — Upload files and monitor processing