DatasetPipelineConfig
Complete pipeline configuration for a dataset.
Defines a sequence of processing steps to be executed when entries are uploaded or manually triggered.
Whether the pipeline is active for this dataset
trueWhen to trigger the pipeline: automatically on upload or manually
Possible values: [on_upload, manual]
on_uploadsteps object[]required
Ordered list of pipeline steps to execute
Possible values: >= 1
Unique name for this step within the pipeline (e.g., 'ocr', 'chunk', 'embed')
- Example 1
- Example 2
- Example 3
- Example 4
ocrchunkembedstore-qdrantComponent identifier (WorkflowTemplate name in Argo)
- Example 1
- Example 2
- Example 3
mistral-ocr-processor-v1markdown-chunker-v1data-cluster-embedding-v1Template name to reference (usually same as component without version suffix)
- Example 1
- Example 2
- Example 3
mistral-ocr-processormarkdown-chunkerdata-cluster-embeddingparameters object
Component-specific parameters
Component-specific parameters
- Example 1
- Example 2
{"overlap":128,"size":1024}{"batchsize":32,"concurrency":10}List of step names this step depends on (for ordering)
- Example 1
- Example 2
["ocr"]["chunk","embed"]Workflow-level parameters to inject (e.g., entry_id, dataset_id)
- Example 1
- Example 2
- Example 3
["entry_id"]["entry_id","dataset_id"][]inputs object[]
Explicit artifact inputs with source step and artifact name
Input artifact name expected by this step's template
- Example 1
- Example 2
- Example 3
input-textinput-pdfchunksName of the step that produces this artifact
- Example 1
- Example 2
- Example 3
ocrfetch-entrychunkOutput artifact name from the source step
- Example 1
- Example 2
- Example 3
output-markdownentry-filechunksDeclared output artifact names (for documentation only, not validated)
- Example 1
- Example 2
- Example 3
["output-markdown","output-figures"]["chunks"]["embeddings"]Maximum execution time for the entire pipeline (e.g., '30m', '2h')
Possible values: Value must match regular expression ^\d+[smh]$
30m{
"enabled": true,
"steps": [
{
"component": "mistral-ocr-processor-v1",
"depends": [],
"inputs": [],
"name": "ocr",
"outputs": [
"output-markdown",
"output-figures"
],
"parameters": {},
"template": "mistral-ocr-processor",
"workflow_parameters": []
},
{
"component": "markdown-chunker-v1",
"depends": [
"ocr"
],
"inputs": [
{
"from_artifact": "output-markdown",
"from_step": "ocr",
"name": "input-text"
}
],
"name": "chunk",
"outputs": [
"chunks"
],
"parameters": {
"overlap": 128,
"size": 1024
},
"template": "markdown-chunker",
"workflow_parameters": []
}
],
"timeout": "30m",
"trigger": "on_upload"
}