docxExtract
Extracts text from a DOCX document by downloading it from a URL and parsing it with python-docx. All non-empty paragraphs are joined with double newlines. Section count is determined by counting paragraphs whose style name starts with "Heading" — falls back to 1 if no headings are found.
Supports up to 10 concurrent downloads. Retries up to 3 times with exponential backoff (base 1s, multiplier 2×, max 8s).
Parameters
| Param | Type | Required | Description |
|---|---|---|---|
file_url | string (URL) | Yes | URL of the DOCX document to download and process |
entry | object | Yes | Entry data dict containing dataset_id and entry_id for routing |
Output
| Field | Type | Description |
|---|---|---|
text | string | The full extracted text — all non-empty paragraphs joined with \n\n |
numPages | integer | Number of sections in the document (heading-based count, minimum 1) |
Example
{
"id": "docxExtract",
"type": "docxExtract",
"data": {
"label": "DOCX Extract",
"isExecuted": false,
"handles": ["inputs", "outputs"],
"schema": {},
"params": {
"file_url": { "value": "{{ @downloadEntry.file_url }}", "isExpression": true, "isAttachedToInputNode": false },
"entry": { "value": "{{ @fetchEntries }}", "isExpression": true, "isAttachedToInputNode": false }
},
"inputs": [], "outputs": [], "errors": []
},
"position": { "x": 300, "y": 0 },
"isSelected": false,
"isDragging": false
}