Skip to main content

DatasetPipelineConfig

Complete pipeline configuration for a dataset.

Defines a sequence of processing steps to be executed when entries are uploaded or manually triggered.

enabledEnabled (boolean)

Whether the pipeline is active for this dataset

Default value: true
triggerTrigger (string)

When to trigger the pipeline: automatically on upload or manually

Possible values: [on_upload, manual]

Default value: on_upload
steps object[]required

Ordered list of pipeline steps to execute

Possible values: >= 1

  • Array [
  • nameName (string)required

    Unique name for this step within the pipeline (e.g., 'ocr', 'chunk', 'embed')

    Examples:
    Example: ocr
    componentComponent (string)required

    Component identifier (WorkflowTemplate name in Argo)

    Examples:
    Example: mistral-ocr-processor-v1
    templateTemplate (string)required

    Template name to reference (usually same as component without version suffix)

    Examples:
    Example: mistral-ocr-processor
    parameters object

    Component-specific parameters

    property name*any

    Component-specific parameters

    Examples:
    Example: {"overlap":128,"size":1024}
    dependsstring[]

    List of step names this step depends on (for ordering)

    Examples:
    Example: ["ocr"]
    workflow_parametersstring[]

    Workflow-level parameters to inject (e.g., entry_id, dataset_id)

    Examples:
    Example: ["entry_id"]
    inputs object[]

    Explicit artifact inputs with source step and artifact name

  • Array [
  • nameName (string)required

    Input artifact name expected by this step's template

    Examples:
    Example: input-text
    from_stepFrom Step (string)required

    Name of the step that produces this artifact

    Examples:
    Example: ocr
    from_artifactFrom Artifact (string)required

    Output artifact name from the source step

    Examples:
    Example: output-markdown
  • ]
  • outputsstring[]

    Declared output artifact names (for documentation only, not validated)

    Examples:
    Example: ["output-markdown","output-figures"]
  • ]
  • timeoutTimeout (string)

    Maximum execution time for the entire pipeline (e.g., '30m', '2h')

    Possible values: Value must match regular expression ^\d+[smh]$

    Default value: 30m
    DatasetPipelineConfig
    {
    "enabled": true,
    "steps": [
    {
    "component": "mistral-ocr-processor-v1",
    "depends": [],
    "inputs": [],
    "name": "ocr",
    "outputs": [
    "output-markdown",
    "output-figures"
    ],
    "parameters": {},
    "template": "mistral-ocr-processor",
    "workflow_parameters": []
    },
    {
    "component": "markdown-chunker-v1",
    "depends": [
    "ocr"
    ],
    "inputs": [
    {
    "from_artifact": "output-markdown",
    "from_step": "ocr",
    "name": "input-text"
    }
    ],
    "name": "chunk",
    "outputs": [
    "chunks"
    ],
    "parameters": {
    "overlap": 128,
    "size": 1024
    },
    "template": "markdown-chunker",
    "workflow_parameters": []
    }
    ],
    "timeout": "30m",
    "trigger": "on_upload"
    }