ADR-007: Cross-Encoder Re-Ranking for RAG Pipeline¶
Status: Accepted Date: 2026-04-02 Decision makers: Kamerplanter Development Team
Context¶
The RAG pipeline (ADR-006) uses Hybrid Search (vector + BM25 full-text with Reciprocal Rank Fusion). The RAG benchmark shows a 29% overall score with the following error distribution:
- 57% GENERATION_MISS — LLM receives irrelevant chunks and hallucinates
- 11% RETRIEVAL_MISS — Hybrid Search fails to find the correct chunk in Top-K
- 21% SYNONYM_GAP — evaluation patterns do not recognise correct answers (independent issue)
Problem¶
The Bi-Encoder (E5-base) and BM25 rank independently. Their fusion via RRF is effective for recall but not optimal for precision: irrelevant chunks with high BM25 scores (keyword matches without semantic relevance) pollute the LLM context. This causes the dominant error class GENERATION_MISS.
Decision¶
Cross-Encoder Re-Ranking as an optional pipeline stage between retrieval and LLM generation.
Architecture¶
- Separate microservice (
reranker-service), analogous to the embedding service (ONNX Runtime + FastAPI) - Model:
BAAI/bge-reranker-v2-m3(multilingual, 568M parameters, Apache-2.0) - Graceful degradation: Without a reranker URL the pipeline operates as before (hybrid search only)
Why a separate service instead of in-process?¶
- Cross-encoder inference is CPU-intensive (~500ms for 20 pairs) — requires its own memory/CPU budget
- Existing architecture (embedding service) has proven itself
- Optionally deployable — no additional Python dependency in the knowledge service
Why bge-reranker-v2-m3?¶
| Criterion | bge-reranker-v2-m3 | ms-marco-MiniLM-L-12-v2 |
|---|---|---|
| Languages | Multilingual (DE/EN) | English-focused |
| Parameters | 568M | 33M |
| BEIR benchmark | State-of-the-art | Good, but EN-only |
| ONNX export | Via optimum | Via optimum |
| Licence | Apache-2.0 | Apache-2.0 |
| Kamerplanter use case | DE knowledge base | Unsuitable |
Rejected alternatives¶
- RRF weight tuning only: Improves precision minimally; does not solve the root problem (keyword matches without semantic relevance)
- ColBERT / Late Interaction: Higher recall, but recall was not the problem being addressed
- LLM-based re-ranking: Too slow (>5s), too expensive for local inference
Consequences¶
Positive¶
- More precise LLM context → fewer GENERATION_MISS errors
- Expected benchmark improvement: 10–20 percentage points
- No breaking changes — graceful degradation when reranker is absent
- Same deployment patterns as the embedding service (known and tested)
Negative¶
- Additional microservice (+1.5–4 GB RAM, +500ms latency per query)
- ONNX export via
optimumduring Docker build (longer build time) - First Docker build takes ~10–15 minutes (model download + export)
Neutral¶
- Knowledge service config: 3 new environment variables (
RERANKER_URL,RERANKER_INITIAL_K,RERANKER_TOP_K) - Helm/Skaffold: new controller, service, NetworkPolicy (analogous to embedding service)