Skip to main content

Dataset

Dataset schema - client cluster storage.

Philosophy:

  • This is the LOCAL storage representation
  • Backend catalog maintains: licenses, organizations, providers, pricing, access control
  • This maintains: schema definitions, storage paths, data organization
nameName (string)required

Dataset name

slugSlug (string)required

URL-friendly slug

descriptionDescription (string)required

Dataset description

dataset_typeDatasetType (string)required

Type of dataset

Possible values: [text, audio, voice, images]

idId (integer)required

Dataset ID (synced from backend catalog)

size_bytesSize Bytes (integer)

Total size in bytes

Default value: 0
entry_countEntry Count (integer)

Number of entries (cached)

Default value: 0
schema_definition objectrequired

Manifest-based schema definition

schema_idSchema Id (string)required

Unique schema identifier

versionVersion (string)required

Schema version (e.g., 'v3')

descriptionDescription (string)required

Human-readable schema description

original objectrequired

Original files schema

required_filesstring[]

Required file patterns

optional_filesstring[]

Optional file patterns

metadata_schema object

JSONSchema7 for metadata validation

property name*any

JSONSchema7 for metadata validation

processed objectrequired

Processed content schema

content_schema object

JSONSchema7 for content validation

property name*any

JSONSchema7 for content validation

required_filesstring[]

Required processed files

optional_filesstring[]

Optional processed files

processing object

Processing artifacts schema

anyOf
intermediate_filesstring[]

Intermediate file patterns

retention_daysRetention Days (integer)

Days to retain processing artifacts

Default value: 7
current_schema_versionCurrent Schema Version (string)

Current schema version. Entries can be migrated incrementally by comparing manifest->>'schema_version'

Default value: v1
storage_pathStorage Path (string)required

Base storage path in MinIO/S3 (e.g., 'datasets/123')

created_atstring<date-time>required

Creation timestamp

updated_atstring<date-time>required

Last update timestamp

last_synced_at object

Last sync with backend catalog

anyOf
string<date-time>
versionVersion (integer)

Version number for optimistic locking

Default value: 1
Dataset
{
"created_at": "2025-01-01T00:00:00Z",
"current_schema_version": "v3",
"dataset_type": "text",
"description": "OCR processed academic papers from ArXiv",
"entry_count": 1500,
"id": 123,
"last_synced_at": "2025-01-10T00:00:00Z",
"name": "ArXiv Papers OCR",
"schema_definition": {
"description": "Schema for ArXiv papers with OCR, chunking, and embeddings",
"original": {
"metadata_schema": {
"properties": {
"title": {
"type": "string"
},
"authors": {
"items": {
"type": "string"
},
"type": "array"
},
"arxiv_id": {
"type": "string"
},
"published_date": {
"format": "date",
"type": "string"
}
},
"required": [
"title",
"arxiv_id"
],
"type": "object"
},
"optional_files": [
"thumbnail.jpg"
],
"required_files": [
"paper.pdf"
]
},
"processed": {
"content_schema": {
"properties": {
"text": {
"type": "string"
},
"chunks": {
"type": "array"
},
"figures": {
"type": "array"
}
},
"required": [
"text",
"chunks"
],
"type": "object"
},
"optional_files": [
"figures/*.png"
],
"required_files": [
"content.json"
]
},
"processing": {
"intermediate_files": [
"embeddings.npy",
"chunks.json"
],
"retention_days": 7
},
"schema_id": "arxiv_papers_ocr",
"version": "v3"
},
"size_bytes": 10485760,
"slug": "arxiv-papers-ocr",
"storage_path": "datasets/123",
"updated_at": "2025-01-05T00:00:00Z"
}