Conclusion & Future Work

Conclusion

Our experiments show that a Quality Score–based eviction policy can improve cache effectiveness for LLM applications, especially when capacity is constrained. In the Mixed-500 setting with a small cache (max size = 10), the policy with learning rate 0.5 and weights (quality, recency, frequency) = (0.8, 0.1, 0.1) achieved a hit rate of 0.428, outperforming LRU (0.31), LFU (0.37), FIFO (0.298), and RR (0.31). As cache size grows, all policies converge to high hit rates, but the Quality Score policy remains competitive.

Throughput results place Quality Score between simple policies (FIFO/RR) and heavier baselines (LRU/LFU). CPU usage was similar across policies (~55% on average); the dominant compute cost stems from similarity lookup for cache key matching rather than the eviction mechanism itself. Practically, this means choosing an eviction policy should prioritize hit-rate gains and serving cost over CPU overhead differences.

Unexpected learnings

FIFO and RR are top in raw operations/sec, but they can lag in hit rate under small caches compared to Quality Score.
A high quality weight (0.8) with moderate learning (0.5) consistently helped when capacity was tight.
Policy overhead is not the primary bottleneck; semantic matching dominates CPU time.

Limitations

Throughput used a mock, instant LLM to isolate policy overhead; production latency may change relative rankings.
Datasets and workloads were synthetic; real traffic can have different temporal and semantic patterns.
We used static weights and a simple online update; there was no automatic tuning per workload drift.
Similarity backend choices (embedding model, ANN index) were not explored and can materially affect results.

Future Work

Adaptive weighting and tuning: learn (quality, recency, frequency) and learning rate online (e.g., bandits or Bayesian optimization) based on observed hit rate and cost.
Cost-aware objective: incorporate token and dollar savings, latency targets, and SLA penalties directly into the eviction score.
Time decay and freshness: add exponential decay to quality and recency to better handle content drift.
Segment-aware caches: maintain per-domain or per-intent segments to prevent popular segments from monopolizing capacity.
Warm starts and prefetching: seed cache with high-quality answers for common intents; prefetch likely follow-ups.
Faster similarity: explore ANN indexes and embedding cache to reduce the dominant CPU cost of lookups.
Distributed coordination: study eviction and synchronization across multi-node caches under skewed traffic.
Human and feedback signals: integrate user feedback and downstream task success to refine the quality estimate.

Practical guidance

When to use: capacity-constrained caches, cost-sensitive workloads, or traffic with semantic repetition where quality varies across entries.
Sensible defaults: start with (0.8, 0.1, 0.1) and learning_rate = 0.5, then tune for your workload.
Measure what matters: monitor hit rate, effective cost per request, and p95 latency alongside throughput.

Overall, a quality-aware eviction policy is a simple, low-overhead change that can deliver meaningful savings and stability for LLM systems, particularly before scaling out infrastructure.