Skip to main content

Tokenizer Pipeline

Overview

The Tokenizer Pipeline prepares documents for downstream processing by cleaning text and splitting it into smaller chunks.

It is commonly used after the Parser Pipeline and before pipelines like Embedder or Writer, where chunked text is required for indexing, search, or LLM processing.

What It Does

  • Cleans extracted document text
  • Removes unnecessary formatting and noise
  • Splits large documents into smaller, manageable chunks

The output of this pipeline is a list of tokenized document chunks that can be safely passed to:

  • Embedder pipelines
  • Writer pipelines
  • Vector indexing
  • Workflows and agents
info

The Tokenizer Pipeline does not extract text from files.
It expects parsed documents as input, typically produced by the Parser Pipeline.

Using the Tokenizer Pipeline

Add to DocProcessorAgent

  • Go to Pipelines
  • Select Tokenizer Pipeline
  • Drag and drop it into DocProcessorAgent
  • Connect it after the Parser Pipeline

Configure Text Cleaning (Optional)

The pipeline automatically cleans text before splitting.

Common cleaning behaviors include:

  • Removing extra whitespace
  • Removing empty lines
  • Normalizing text characters

These defaults work well for most documents and usually do not require changes.

Configure Splitting Behavior

You can control how documents are split into chunks.

Common options:

  • Split by word, sentence, line, or page
  • Control chunk size (length)
  • Optionally add overlap between chunks

Usage:

  • Smaller chunks → better for embeddings and search
  • Larger chunks → better for summarization and context retention

Output

After execution:

  • Documents are cleaned and split into multiple chunks
  • Each chunk becomes an individual document
  • Output flows automatically to the next connected pipeline

Common Use Cases

  • Preparing documents for vector embeddings
  • Splitting long documents for LLM processing
  • Cleaning noisy OCR or parsed text
  • Improving retrieval quality in RAG workflows

Summary

The Tokenizer Pipeline cleans and splits documents into structured chunks.
It improves downstream processing quality and is a key step in search, RAG, and document understanding workflows.