Tokenizer Pipeline

Overview

The Tokenizer Pipeline prepares documents for downstream processing by cleaning text and splitting it into smaller chunks.

It is commonly used after the Parser Pipeline and before pipelines like Embedder or Writer, where chunked text is required for indexing, search, or LLM processing.

What It Does

Cleans extracted document text
Removes unnecessary formatting and noise
Splits large documents into smaller, manageable chunks

The output of this pipeline is a list of tokenized document chunks that can be safely passed to:

Embedder pipelines
Writer pipelines
Vector indexing
Workflows and agents

info

The Tokenizer Pipeline does not extract text from files.
It expects parsed documents as input, typically produced by the Parser Pipeline.

Using the Tokenizer Pipeline

Add to DocProcessorAgent

Go to Pipelines
Select Tokenizer Pipeline
Drag and drop it into DocProcessorAgent
Connect it after the Parser Pipeline

Configure Text Cleaning (Optional)

The pipeline automatically cleans text before splitting.

Common cleaning behaviors include:

Removing extra whitespace
Removing empty lines
Normalizing text characters

These defaults work well for most documents and usually do not require changes.

Configure Splitting Behavior

You can control how documents are split into chunks.

Common options:

Split by word, sentence, line, or page
Control chunk size (length)
Optionally add overlap between chunks

Usage:

Smaller chunks → better for embeddings and search
Larger chunks → better for summarization and context retention

Output

After execution:

Documents are cleaned and split into multiple chunks
Each chunk becomes an individual document
Output flows automatically to the next connected pipeline

Common Use Cases

Preparing documents for vector embeddings
Splitting long documents for LLM processing
Cleaning noisy OCR or parsed text
Improving retrieval quality in RAG workflows

Summary

The Tokenizer Pipeline cleans and splits documents into structured chunks.
It improves downstream processing quality and is a key step in search, RAG, and document understanding workflows.

Overview​

What It Does​

Using the Tokenizer Pipeline​

Add to DocProcessorAgent​

Configure Text Cleaning (Optional)​

Configure Splitting Behavior​

Output​

Common Use Cases​

Summary​