RetreiverPipeline

Description: Haystack pipeline for retrieving files from Elasticsearch document store.

Overview

The Retriever Pipeline is a Haystack-based retrieval pipeline designed to fetch relevant documents from an Elasticsearch-backed Document Store.
It performs semantic similarity search using embeddings and acts as the Retrieval-Augmented Generation (RAG) component by supplying contextually relevant documents based on a user’s natural language query.

In addition to semantic retrieval, the pipeline supports dataset-level filtering to enforce data security, tenant isolation, and role-based access control.

Core Responsibilities

Interpret user search queries
Perform embedding-based similarity search
Retrieve the most relevant documents from Elasticsearch
Apply dataset-level and user-selected filters
Return contextually relevant and security-compliant results
Act as the retrieval layer for RAG workflows

Configure Retriever Pipeline

name – Logical name used to identify the Retriever Pipeline configuration.
description – Human-readable description explaining the purpose of the pipeline.
top_k – Number of top-ranked documents returned after retrieval and reranking.
num_candidates – Number of initial candidate documents fetched before reranking; if null, a default value is used.
filter_policy – Determines how filters are applied during retrieval (REPLACE, APPEND, or REMOVE).
embedding_model – Model used to generate vector embeddings for semantic similarity search.
ranking_model – Reranker model used to re-score and reorder retrieved documents for improved relevance.

This configuration provides fine-grained control over semantic retrieval, filtering behavior, and result quality within the Retriever Pipeline.

Retrieval Capabilities

Semantic & Similarity Search

The Retriever Pipeline uses vector embeddings to:

Convert user queries into embeddings
Compare query embeddings with document embeddings
Rank documents based on semantic similarity
Retrieve top-matching documents even when keywords differ

This enables context-aware retrieval beyond exact keyword matching.

Dataset-Level Filters & Secure Retrieval

The Retriever Pipeline supports dataset-level filters to ensure secure and context-aware document retrieval.
If the selected Vector Store dataset defines filterable metadata fields such as:

group
user
tenant_id
role

these fields are automatically detected from the dataset schema and made available in the UI—no manual pipeline configuration is required.

How Filters Work in the UI

Each filterable field appears as a checkbox list
Users can:
- Select one or more values within a filter
- Combine filters across multiple fields
Applying filters is optional and fully user-controlled

What Happens During Retrieval

If no filters are selected
Documents are retrieved purely based on semantic similarity, using vector embeddings to find the most relevant content.
If one or more filters are selected
The system adds security filters to the search query so that only documents matching the selected criteria are considered.

This approach ensures:

Data-level security
Tenant and role-based isolation
Retrieval of only relevant and authorized documents

In short, filters let users narrow results safely, while semantic search ensures the results remain contextually relevant.

Processing Flow

User submits a natural language query
Query is converted into an embedding
Dataset filter schema is evaluated
User-selected filters (if any) are applied
Elasticsearch executes:
- Vector similarity search
- Optional bool.filter constraints
Relevant documents are retrieved and ranked
Results are returned to downstream agents or pipelines

Key Features

Haystack-based retrieval architecture
Elasticsearch document store
Vector embedding similarity search
RAG-compatible context retrieval
Automatic dataset-level filter detection
UI-driven filter selection
Secure and isolated data access

When to Use the Retriever Pipeline

Use the Retriever Pipeline when:

Documents are stored in Elasticsearch with embeddings
Semantic relevance is more important than keyword matching
Retrieval is part of a RAG or question-answering workflow
Dataset-level security or tenant isolation is required
Users need controlled, filter-based access to documents

Summary

The Retriever Pipeline is a Haystack-based semantic retrieval pipeline that fetches relevant documents from an Elasticsearch document store using embedding and similarity search. It serves as the retrieval layer for RAG workflows by supplying contextually relevant documents based on user queries. The pipeline automatically detects filterable dataset fields such as tenant, user, or group and exposes them in the UI for secure, user-driven filtering. When filters are applied, Elasticsearch queries are augmented with bool.filter clauses to enforce data-level security, tenant isolation, and role-based access, ensuring context-aware and authorized document retrieval.

Overview​

Core Responsibilities​

Configure Retriever Pipeline​

Retrieval Capabilities​

Semantic & Similarity Search​

Dataset-Level Filters & Secure Retrieval​

How Filters Work in the UI​

What Happens During Retrieval​

Processing Flow​

Key Features​

When to Use the Retriever Pipeline​

Summary​