Skip to main content

Email Processor Pipeline

Overview

The Email Processor Pipeline processes parsed email documents and stores them in a vector-enabled document store for search and retrieval.

It takes output from the Email Parser Pipeline, performs cleaning, tokenization, splitting, embedding, and indexes both email content and attachments.

This pipeline makes emails searchable, retrievable, and usable in RAG workflows.

What It Does

  • Accepts parsed email documents as input
  • Cleans content (whitespace, empty lines, ASCII filtering)
  • Splits documents into chunks (words, sentences, passages)
  • Generates embeddings using a configurable model
  • Indexes:
    • Email content into a content index
    • Attachments into a separate attachments index

Using the Email Processor Pipeline

Add to Agent

  • Go to Pipelines
  • Select Email Processor Pipeline
  • Drag it into DocProcessorAgent
  • Place after the Email Parser Pipeline

Configure Dataset and Indices

Configure the agent dataset with the following:

  • Content index for email body and metadata
  • Attachments index for email attachments
  • These indices are selected in the dataset configuration and used by the pipeline to write processed data
Email Processor dataset and indices configuration

Input & Output

  • Input: Parsed email documents from the Email Parser Pipeline
  • Output: Embedded and indexed documents and attachments; errors reported for duplicates or runtime issues

Pipeline Flow

  1. Email Parser Pipeline extracts content and attachments
  2. Email Processor Pipeline:
    • Cleans content
    • Splits into chunks
    • Generates embeddings
    • Writes to configured indices
  3. Indexed data becomes available for search and retrieval

Common Use Cases

  • Indexing enterprise inboxes for semantic search
  • Preparing email data for RAG-based agents
  • Storing attachments separately for structured retrieval

Summary

The Email Processor Pipeline enables end-to-end preparation of email data for vector search and intelligent querying.
It ensures email content and attachments are fully searchable and ready for downstream workflows.