Skip to main content

Parser Pipeline

Overview

The Parser Pipeline extracts readable text from documents so they can be processed by other pipelines such as embedding, indexing, or workflows.

It supports common formats including PDF, Word, CSV, Markdown, Text, and Excel, and produces clean, structured documents as output.

This pipeline is usually the starting point for any document-based processing flow.

What It Does

  • Accepts files from chat uploads or datasets when deployed
  • Extracts readable text from supported formats
  • Outputs structured documents for downstream pipelines

Parsed documents can be passed to:

  • Embedder pipelines
  • Writer pipelines
  • Workflows
  • Document-based agents
info

The Parser Pipeline only extracts content. For indexing or storage, connect it to pipelines like Embedder or Writer.

Using the Parser Pipeline

Add to DocProcessorAgent

  • Open Pipelines
  • Select Parser Pipeline
  • Drag and drop it into DocProcessorAgent agent

Configure Parsing Mode (Optional)

Use OCR when working with scanned or image-based documents.

Parsing options:

  • OCR Disabled (default)
    Best for digital PDFs and text-based files
  • OCR Enabled
    Extracts text from scanned or image-only PDFs
OCR configuration settings

Output

After execution:

  • One or more structured documents are produced
  • Output automatically flows to the next connected pipeline
  • An error is returned if no readable content is found

Common Use Cases

  • Parsing uploaded documents in chat
  • Preparing documents for vector indexing
  • Extracting text from scanned PDFs
  • Feeding documents into workflows and agents

Summary

The Parser Pipeline converts raw documents into structured text.
With support for multiple formats and optional OCR, it serves as the entry point for all document-processing workflows.