Caching in LLMs - Quality Score Eviction Policy

Extension Design

Quality Score Eviction Policy design and algorithm implementation

While traditional strategies are effective in general-purpose systems, these approaches overlook the semantic relevance of cached content. This can reduce caching effectiveness in LLM applications, where users often submit similar prompts that could be handled more efficiently by reusing existing responses.

To address this issue, we introduce the Quality Score Eviction Policy, an extension to the GPTCache framework that incorporates a learned notion of quality into eviction decisions. Our policy aims to retain entries that not only are frequently used or recently accessed, but also demonstrate high semantic similarity to user queries over time. This change seeks to improve hit rates and lower latency for repeated or paraphrased queries while adapting dynamically to usage patterns.

Algorithm Implementation

The Quality Score Eviction Policy uses a composite score to decide which cache items to evict. This score is a weighted combination of:

  • Quality Score: An exponentially-smoothed average of similarity scores from past cache hits. This tracks how well an item matches new queries over time.
  • Recency: Based on the timestamp of last access.
  • Frequency: Count of how often the entry was accessed.

Eviction proceeds by removing the entry with the lowest composite score, prioritizing retention of items that are useful in practice, not just frequently or recently accessed.

The core update rule for quality is:

quality_t = (1-alpha) * quality_{t-1} + alpha * similarity

Where quality_t is the quality score at time t, similarity is the similarity score, calculating by the highest similarity to any item in the cache, and alpha is the learning rate, which is a hyperparameter.