Entry
Entry schema - client cluster storage.
Philosophy:
- Manifest-based architecture ONLY
- No legacy Parquet fields
- All file locations tracked in manifest
- Fast queries via denormalized fields
Entry name
URL-friendly slug
description object
Entry description
- string
- null
Current processing status
Possible values: [pending, uploading, uploaded, processing, processed, error]
MIME type of primary file
Entry ID
manifest objectrequired
Entry manifest with all file locations and metadata
Schema version (e.g., 'v3')
Dataset schema identifier (e.g., 'arxiv_papers_ocr')
original object
Original files section
- OriginalManifest
- null
files object[]
List of original files
S3 key for the file
File size in bytes
MIME type of the file
hash object
SHA256 hash of the file
- string
- null
created_at object
File creation timestamp
- string<date-time>
- null
expires_at object
File expiration timestamp (for processing artifacts)
- string<date-time>
- null
metadata object
Original metadata (title, author, etc.)
Original metadata (title, author, etc.)
processing object
Processing artifacts section
- ProcessingManifest
- null
List of completed processing steps
files object[]
Intermediate processing files
S3 key for the file
File size in bytes
MIME type of the file
hash object
SHA256 hash of the file
- string
- null
created_at object
File creation timestamp
- string<date-time>
- null
expires_at object
File expiration timestamp (for processing artifacts)
- string<date-time>
- null
processed object
Processed content section
- ProcessedManifest
- null
content_key object
S3 key for main content.json file
- string
- null
size object
Size of content.json in bytes
- integer
- null
fields_summary object
Quick stats for UI (text_length, chunk_count, etc.)
Quick stats for UI (text_length, chunk_count, etc.)
completed_at object
Processing completion timestamp
- string<date-time>
- null
additional_files object
Additional processed files (figures, etc.)
- object[]
- null
S3 key for the file
File size in bytes
MIME type of the file
hash object
SHA256 hash of the file
- string
- null
created_at object
File creation timestamp
- string<date-time>
- null
expires_at object
File expiration timestamp (for processing artifacts)
- string<date-time>
- null
full_manifest_key object
S3 key if manifest >5KB (stored externally)
- string
- null
Base storage path (e.g., 'datasets/123/entries/456')
primary_file_key object
S3 key of primary original file (cached)
- string
- null
processed_content_key object
S3 key of processed content (cached)
- string
- null
file_size_bytes object
Total size of all files in bytes (cached)
- integer
- null
Parent dataset ID
Creation timestamp
Last update timestamp
processing_completed_at object
Processing completion timestamp
- string<date-time>
- null
Version number for optimistic locking
1{
"name": "string",
"slug": "string",
"description": "string",
"status": "pending",
"mime_type": "string",
"id": 0,
"manifest": {
"dataset_schema_id": "arxiv_papers_ocr",
"original": {
"files": [
{
"created_at": "2025-11-04T10:00:00Z",
"hash": "sha256:abc123...",
"key": "datasets/123/entries/456/original/paper.pdf",
"mime_type": "application/pdf",
"size": 2048000
},
{
"key": "datasets/123/entries/456/original/thumbnail.jpg",
"mime_type": "image/jpeg",
"size": 50000
}
],
"metadata": {
"arxiv_id": "2024.12345",
"authors": [
"John Doe",
"Jane Smith"
],
"published_date": "2024-11-01",
"title": "Deep Learning for Computer Vision"
}
},
"processed": {
"additional_files": [
{
"key": "datasets/123/entries/456/processed/figures/fig_001.png",
"mime_type": "image/png",
"size": 80000
}
],
"completed_at": "2025-11-04T10:30:00Z",
"content_key": "datasets/123/entries/456/processed/content.json",
"fields_summary": {
"chunk_count": 120,
"figure_count": 8,
"text_length": 45000
},
"size": 150000
},
"processing": {
"files": [
{
"expires_at": "2025-12-04T10:00:00Z",
"key": "datasets/123/entries/456/processing/embeddings.npy",
"mime_type": "application/octet-stream",
"size": 200000
}
],
"steps_completed": [
"ocr",
"chunking",
"embedding"
]
},
"schema_version": "v3"
},
"storage_path": "string",
"primary_file_key": "string",
"processed_content_key": "string",
"file_size_bytes": 0,
"dataset_id": 0,
"created_at": "2024-07-29T15:51:28.071Z",
"updated_at": "2024-07-29T15:51:28.071Z",
"processing_completed_at": "2024-07-29T15:51:28.071Z",
"version": 1
}