Dataset

Dataset schema - client cluster storage.

Philosophy:

This is the LOCAL storage representation
Backend catalog maintains: licenses, organizations, providers, pricing, access control
This maintains: schema definitions, storage paths, data organization

nameName (string)required

Dataset name

slugSlug (string)required

URL-friendly slug

descriptionDescription (string)required

Dataset description

dataset_typeDatasetType (string)required

Type of dataset

Possible values: [text, audio, voice, images]

idId (integer)required

Dataset ID (synced from backend catalog)

size_bytesSize Bytes (integer)

Total size in bytes

Default value: 0

entry_countEntry Count (integer)

Number of entries (cached)

Default value: 0

schema_definition objectrequired

Manifest-based schema definition

schema_idSchema Id (string)required

Unique schema identifier

versionVersion (string)required

Schema version (e.g., 'v3')

descriptionDescription (string)required

Human-readable schema description

original objectrequired

Original files schema

required_filesstring[]

Required file patterns

optional_filesstring[]

Optional file patterns

metadata_schema object

JSONSchema7 for metadata validation

property name*any

JSONSchema7 for metadata validation

processed objectrequired

Processed content schema

content_schema object

JSONSchema7 for content validation

property name*any

JSONSchema7 for content validation

required_filesstring[]

Required processed files

optional_filesstring[]

Optional processed files

processing object

Processing artifacts schema

anyOf

DatasetSchemaProcessing
null

intermediate_filesstring[]

Intermediate file patterns

retention_daysRetention Days (integer)

Days to retain processing artifacts

Default value: 7

current_schema_versionCurrent Schema Version (string)

Current schema version. Entries can be migrated incrementally by comparing manifest->>'schema_version'

Default value: v1

storage_pathStorage Path (string)required

Base storage path in MinIO/S3 (e.g., 'datasets/123')

created_atstring<date-time>required

Creation timestamp

updated_atstring<date-time>required

Last update timestamp

last_synced_at object

Last sync with backend catalog

anyOf

string<date-time>
null

string<date-time>

versionVersion (integer)

Version number for optimistic locking

Default value: 1

Dataset
{
  "created_at": "2025-01-01T00:00:00Z",
  "current_schema_version": "v3",
  "dataset_type": "text",
  "description": "OCR processed academic papers from ArXiv",
  "entry_count": 1500,
  "id": 123,
  "last_synced_at": "2025-01-10T00:00:00Z",
  "name": "ArXiv Papers OCR",
  "schema_definition": {
    "description": "Schema for ArXiv papers with OCR, chunking, and embeddings",
    "original": {
      "metadata_schema": {
        "properties": {
          "title": {
            "type": "string"
          },
          "authors": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "arxiv_id": {
            "type": "string"
          },
          "published_date": {
            "format": "date",
            "type": "string"
          }
        },
        "required": [
          "title",
          "arxiv_id"
        ],
        "type": "object"
      },
      "optional_files": [
        "thumbnail.jpg"
      ],
      "required_files": [
        "paper.pdf"
      ]
    },
    "processed": {
      "content_schema": {
        "properties": {
          "text": {
            "type": "string"
          },
          "chunks": {
            "type": "array"
          },
          "figures": {
            "type": "array"
          }
        },
        "required": [
          "text",
          "chunks"
        ],
        "type": "object"
      },
      "optional_files": [
        "figures/*.png"
      ],
      "required_files": [
        "content.json"
      ]
    },
    "processing": {
      "intermediate_files": [
        "embeddings.npy",
        "chunks.json"
      ],
      "retention_days": 7
    },
    "schema_id": "arxiv_papers_ocr",
    "version": "v3"
  },
  "size_bytes": 10485760,
  "slug": "arxiv-papers-ocr",
  "storage_path": "datasets/123",
  "updated_at": "2025-01-05T00:00:00Z"
}