docxExtract

Extracts text from a DOCX document by downloading it from a URL and parsing it with python-docx. All non-empty paragraphs are joined with double newlines. Section count is determined by counting paragraphs whose style name starts with "Heading" — falls back to 1 if no headings are found.

Supports up to 10 concurrent downloads. Retries up to 3 times with exponential backoff (base 1s, multiplier 2×, max 8s).

Parameters

Param	Type	Required	Description
`file_url`	string (URL)	Yes	URL of the DOCX document to download and process
`entry`	object	Yes	Entry data dict containing `dataset_id` and `entry_id` for routing

Output

Field	Type	Description
`text`	string	The full extracted text — all non-empty paragraphs joined with `\n\n`
`numPages`	integer	Number of sections in the document (heading-based count, minimum `1`)

Example

{
  "id": "docxExtract",
  "type": "docxExtract",
  "data": {
    "label": "DOCX Extract",
    "isExecuted": false,
    "handles": ["inputs", "outputs"],
    "schema": {},
    "params": {
      "file_url": { "value": "{{ @downloadEntry.file_url }}", "isExpression": true, "isAttachedToInputNode": false },
      "entry": { "value": "{{ @fetchEntries }}", "isExpression": true, "isAttachedToInputNode": false }
    },
    "inputs": [], "outputs": [], "errors": []
  },
  "position": { "x": 300, "y": 0 },
  "isSelected": false,
  "isDragging": false
}

Parameters​

Output​

Example​

Parameters

Output

Example