Datasets and Entries
Datasets and entries are the core data model of the Alien Intelligence platform. A dataset is a collection of documents with shared configuration — schema, pipeline settings, and metadata. An entry is an individual document within a dataset, tracked through its entire lifecycle from upload to fully searchable.
Datasets
A dataset represents a logical grouping of documents. It defines what kind of documents it contains, how they should be processed, and how they are organized.
Dataset Properties
| Property | Description |
|---|---|
| Name | Human-readable name (e.g., "BioRxiv 2024 Archive") |
| Slug | URL-safe identifier, unique within the cluster |
| Type | Schema type: text, audio, voice, or images |
| Schema | Typed field definitions with versioning |
| Pipeline config | Processing pipeline assignment and trigger rules |
| Visibility | public or private |
| Description | Markdown-formatted description |
| License | License metadata for the collection |
| Provider | Source or provider information |
| Tags | Searchable tags for categorization |
Dataset Types
The dataset type determines how entries are processed and indexed:
| Type | Typical Content | Processing | Search Behavior |
|---|---|---|---|
text | PDFs, DOCX, XML documents | OCR, chunking, embedding | Full-text + semantic search |
audio | Audio recordings | Transcription, embedding | Transcript search |
voice | Voice samples | Voice model training | Model-based lookup |
images | Image collections | Vision processing, captioning | Caption + metadata search |
Dataset Schema
Each dataset has a typed schema that defines the structure of its entry metadata. Schemas are versioned — when the schema changes, the version increments, and existing entries can be identified as needing reprocessing.
{
"version": 2,
"fields": {
"title": { "type": "string", "required": true },
"authors": { "type": "array", "items": "string" },
"doi": { "type": "string" },
"publication_date": { "type": "date" },
"abstract": { "type": "text" }
}
}
Schema field types influence how Meilisearch indexes the data — different fields are marked as searchable, filterable, or sortable based on their type.
Collections
Datasets can be organized into collections — higher-level groupings that help users navigate large numbers of datasets. A collection is a logical container; it does not affect how data is stored or processed.
Entries
An entry is a single document within a dataset. It tracks the document's files, processing state, and metadata through a manifest system.
Entry Properties
| Property | Description |
|---|---|
| Name | Document name (e.g., "paper-2024-001.pdf") |
| Dataset | Parent dataset reference |
| Status | Current lifecycle state |
| Manifest | JSONB structure tracking all files, hashes, and metadata |
| Version | Optimistic locking counter, incremented on every update |
| Storage path | Base path in MinIO for all entry files |
| MIME type | Primary file type (e.g., application/pdf) |
| File size | Size of the primary uploaded file |
Entry Lifecycle
Every entry moves through a defined set of states:
| Status | Meaning |
|---|---|
| pending | Entry record created, no file uploaded yet |
| uploaded | File uploaded to MinIO, ready for processing |
| processing | Pipeline is running (OCR, chunking, embedding) |
| processed | Pipeline complete — text extracted, vectors indexed, content searchable |
| error | Pipeline failed — can be retried manually |
The status transition from uploaded to processing happens automatically when the dataset has a pipeline configured with trigger: "on_upload". You can also trigger processing manually through the API.
The Manifest
The manifest is a JSONB structure stored on each entry that tracks every file associated with the document. It serves as the single source of truth for what files exist, where they are stored, and their integrity checksums.
{
"original": {
"files": [
{
"key": "datasets/abc/entries/123/original/1711234567_paper.pdf",
"filename": "paper.pdf",
"mime_type": "application/pdf",
"size": 2456789,
"hash": "sha256:a1b2c3d4...",
"created_at": "2026-03-26T10:00:00Z"
}
]
},
"processed": {
"files": [
{
"key": "datasets/abc/entries/123/processed/content.json",
"filename": "content.json",
"mime_type": "application/json",
"size": 156000,
"hash": "sha256:e5f6g7h8..."
}
],
"figures": [
{
"key": "datasets/abc/entries/123/processed/figures/fig_001.png",
"filename": "fig_001.png",
"mime_type": "image/png",
"size": 45000
}
]
}
}
The manifest is updated atomically with row-level locking (SELECT FOR UPDATE) to prevent concurrent modification issues during pipeline processing.
File Storage Structure
Entry files are organized in MinIO with a predictable path hierarchy:
{bucket}/
datasets/{dataset_id}/
entries/{entry_id}/
original/
{timestamp}_{filename} # Original uploaded file
processing/
{timestamp}_{filename} # Intermediate pipeline artifacts
processed/
content.json # Extracted text content
figures/
fig_001.png # Extracted figures
fig_002.png
Every file write includes a SHA256 integrity check. The manifest records the hash, allowing verification that stored files have not been corrupted.
Supported File Types
The platform supports a range of document formats through its pipeline system. OCR processing is powered by Mistral Document AI.
| Format | Extension | Processing Path |
|---|---|---|
.pdf | Mistral OCR extraction | |
| DOCX | .docx | Mistral OCR extraction |
| PPTX | .pptx | Mistral OCR extraction |
| Images | .png, .jpg, .jpeg, .avif | Mistral OCR or vision processing |
| JATS XML | .xml | Structured XML parsing (academic publishing standard) |
| MECA archives | .meca, .zip | Archive extraction followed by JATS parsing |
Additional formats can be supported by adding new pipeline components. The platform's pipeline system is composable — a new format typically requires only a new processor component, while reusing the existing chunking, embedding, and registration infrastructure.
Need support for an additional format? Contact us to discuss your requirements.
Upload Flow
When a file is uploaded to an entry, the following happens:
- Authentication — The request is authenticated via the platform proxy (user token or API token)
- Row lock — The entry record is locked to prevent concurrent manifest updates
- Integrity check — SHA256 hash is computed for the uploaded file
- Storage write — File bytes are written to MinIO at the appropriate path
- Manifest update — The entry's manifest JSONB is updated with the new file's metadata
- Status transition — If this is the first original file and status is
pending, it transitions touploaded - Version increment — The entry's version counter increments (used for optimistic locking and sync)
- Change log — A row is queued in the resource change log for batch sync to the platform
- Pipeline trigger — If the dataset has an auto-trigger pipeline, processing starts in the background
The upload endpoint accepts a file_type parameter: original (default), processing, or processed. This makes the platform flexible — you can use the built-in processing pipelines, bring your own external processing, or combine both approaches. Upload original files and let the platform process them, or directly upload processed results if you handle processing yourself. The platform does not lock you into any single processing path.
Metadata Sync
Entries and datasets are synchronized to the platform's catalog through a batch sync mechanism running every 30 seconds. The sync includes:
- Dataset names, descriptions, entry counts, and sizes
- Entry names, statuses, MIME types, and file sizes
- Version numbers for conflict resolution
The platform uses this metadata for the catalog dashboard, search routing, and analytics. Document content is never included in the sync — only summary metadata.
Sync uses version-ordered conflict resolution: the platform only applies an update if the incoming version is greater than the existing version. This prevents out-of-order updates from overwriting newer data during network partitions.
Next Steps
- Pipelines — How entries are processed from raw files into searchable content
- Search — How to search across entries using keyword and semantic search
- Data Clusters — The infrastructure that hosts datasets and entries
- Create a Dataset — Step-by-step guide to creating a dataset
- Upload Documents — Upload files and monitor processing