Skip to main content

RetreiverPipeline

Description: Haystack pipeline for retrieving files from Elasticsearch document store.

Overview

The Retriever Pipeline is a Haystack-based retrieval pipeline designed to fetch relevant documents from an Elasticsearch-backed Document Store.
It performs semantic similarity search using embeddings and acts as the Retrieval-Augmented Generation (RAG) component by supplying contextually relevant documents based on a user’s natural language query.

In addition to semantic retrieval, the pipeline supports dataset-level filtering to enforce data security, tenant isolation, and role-based access control.


Core Responsibilities

  • Interpret user search queries
  • Perform embedding-based similarity search
  • Retrieve the most relevant documents from Elasticsearch
  • Apply dataset-level and user-selected filters
  • Return contextually relevant and security-compliant results
  • Act as the retrieval layer for RAG workflows

Configure Retriever Pipeline

  • name – Logical name used to identify the Retriever Pipeline configuration.
  • description – Human-readable description explaining the purpose of the pipeline.
  • top_k – Number of top-ranked documents returned after retrieval and reranking.
  • num_candidates – Number of initial candidate documents fetched before reranking; if null, a default value is used.
  • filter_policy – Determines how filters are applied during retrieval (REPLACE, APPEND, or REMOVE).
  • embedding_model – Model used to generate vector embeddings for semantic similarity search.
  • ranking_model – Reranker model used to re-score and reorder retrieved documents for improved relevance.

This configuration provides fine-grained control over semantic retrieval, filtering behavior, and result quality within the Retriever Pipeline.


Retrieval Capabilities

The Retriever Pipeline uses vector embeddings to:

  • Convert user queries into embeddings
  • Compare query embeddings with document embeddings
  • Rank documents based on semantic similarity
  • Retrieve top-matching documents even when keywords differ

This enables context-aware retrieval beyond exact keyword matching.


Dataset-Level Filters & Secure Retrieval

The Retriever Pipeline supports dataset-level filters to ensure secure and context-aware document retrieval.
If the selected Vector Store dataset defines filterable metadata fields such as:

  • group
  • user
  • tenant_id
  • role

these fields are automatically detected from the dataset schema and made available in the UI—no manual pipeline configuration is required.

How Filters Work in the UI

  • Each filterable field appears as a checkbox list
  • Users can:
    • Select one or more values within a filter
    • Combine filters across multiple fields
  • Applying filters is optional and fully user-controlled

What Happens During Retrieval

  • If no filters are selected
    Documents are retrieved purely based on semantic similarity, using vector embeddings to find the most relevant content.

  • If one or more filters are selected
    The system adds security filters to the search query so that only documents matching the selected criteria are considered.

This approach ensures:

  • Data-level security
  • Tenant and role-based isolation
  • Retrieval of only relevant and authorized documents

In short, filters let users narrow results safely, while semantic search ensures the results remain contextually relevant.


Processing Flow

  1. User submits a natural language query
  2. Query is converted into an embedding
  3. Dataset filter schema is evaluated
  4. User-selected filters (if any) are applied
  5. Elasticsearch executes:
    • Vector similarity search
    • Optional bool.filter constraints
  6. Relevant documents are retrieved and ranked
  7. Results are returned to downstream agents or pipelines

Key Features

  • Haystack-based retrieval architecture
  • Elasticsearch document store
  • Vector embedding similarity search
  • RAG-compatible context retrieval
  • Automatic dataset-level filter detection
  • UI-driven filter selection
  • Secure and isolated data access

When to Use the Retriever Pipeline

Use the Retriever Pipeline when:

  • Documents are stored in Elasticsearch with embeddings
  • Semantic relevance is more important than keyword matching
  • Retrieval is part of a RAG or question-answering workflow
  • Dataset-level security or tenant isolation is required
  • Users need controlled, filter-based access to documents

Summary

The Retriever Pipeline is a Haystack-based semantic retrieval pipeline that fetches relevant documents from an Elasticsearch document store using embedding and similarity search. It serves as the retrieval layer for RAG workflows by supplying contextually relevant documents based on user queries. The pipeline automatically detects filterable dataset fields such as tenant, user, or group and exposes them in the UI for secure, user-driven filtering. When filters are applied, Elasticsearch queries are augmented with bool.filter clauses to enforce data-level security, tenant isolation, and role-based access, ensuring context-aware and authorized document retrieval.