Leveraging Approximate Caching for Faster Retrieval-Augmented Generation

  • Shai Bergman
  • , Anne Marie Kermarrec
  • , Diana Petrescu
  • , Rafael Pires
  • , Mathis Randl*
  • , Martijn De Vos
  • , Ji Zhang
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Retrieval-augmented generation (RAG) improves the reliability of large language model (LLM) answers by integrating external knowledge. However, RAG increases the end-to-end inference time since looking for relevant documents from large vector databases is computationally expensive. To address this, we introduce Proximity, an approximate key-value cache that optimizes the RAG workflow by leveraging similarities in user queries. Instead of treating each query independently, Proximity reuses previously retrieved documents when similar queries appear, substantially reducing the reliance on expensive vector database lookups. To efficiently scale, Proximity employs a locality-sensitive hashing (LSH) scheme that enables fast cache lookups while preserving retrieval accuracy. We evaluate Proximity using the MMLU and MedRAG question-answering benchmarks. Our experiments demonstrate that Proximity with our LSH scheme and a realistically-skewed MedRAG workload reduces database calls by 77.2% while maintaining database recall and test accuracy. We experiment with different similarity tolerances and cache capacities, and show that the time spent within the Proximity cache remains low and constant (4.8 μs) even as the cache grows substantially in size. Our results demonstrate that approximate caching is a practical and effective strategy for optimizing RAG-based systems.

Original languageEnglish
Title of host publicationMiddleware 2025 - Proceedings of the 26th ACM International Middleware Conference
PublisherAssociation for Computing Machinery, Inc
Pages340-353
Number of pages14
ISBN (Electronic)9798400715549
DOIs
Publication statusPublished - 14 Dec 2025
Externally publishedYes
Event26th ACM International Middleware Conference, Middleware 2025 - Nashville, United States
Duration: 15 Dec 202519 Dec 2025

Publication series

NameMiddleware 2025 - Proceedings of the 26th ACM International Middleware Conference

Conference

Conference26th ACM International Middleware Conference, Middleware 2025
Country/TerritoryUnited States
CityNashville
Period15/12/2519/12/25

Keywords

  • approximate caching
  • large language models
  • latency reduction
  • machine learning systems
  • neural information retrieval
  • query optimization
  • retrieval-augmented generation
  • vector databases

Fingerprint

Dive into the research topics of 'Leveraging Approximate Caching for Faster Retrieval-Augmented Generation'. Together they form a unique fingerprint.

Cite this