Caching in LLMs - Quality Score Eviction Policy

Experimental Setup

Detailed experimental methodology for evaluating the Quality Score Eviction Policy against baseline memory policies in GPTCache.

This section details the experimental setup used to evaluate the performance of the proposed Quality Score Eviction Policy against the baseline 'memory' eviction policies in GPTCache, which include LRU, LFU, FIFO and RR.

Workload Generation

  • Dataset: We used the "paraphrased_questions.db" dataset, which contains 3200 questions. This dataset was created by taking 100 similar question pairs from the Quora Question Pairs dataset and then running a script to generate 30 additional similar questions for each pair, resulting in 100 clusters of 32 semantically similar questions. This simulates a realistic workload of semantically similar queries.

  • Workload Scenarios:

    • High Repetition: A sequence of questions with a high degree of repetition, designed to stress the cache's hit rate.
    • Low Repetition: A sequence of questions with a low degree of repetition, designed to measure the cache's overhead.
    • Mixed Repetition: A sequence with a mix of high and low repeated questions, simulating a more realistic user behavior.

Metrics

The following metrics will be collected to evaluate the performance of the caching policies:

  • Cache hit rate: The percentage of requests served from the cache.
  • CPU/GPU utilization: The percentage of CPU resources used during the experiment.
  • Throughput: The number of queries processed per second.

Hyperparameter Tuning

To find the optimal configuration for our proposed eviction policy and compare it against baselines, we performed a grid search over the following parameters. In total, this grid search resulted in 270 unique configurations.

  • Eviction Policies: We evaluated our Quality Score Eviction Policy against standard policies: LRU, LFU, FIFO, and RR.

  • Workload Configurations:

    • Repetition Scenarios: High, Low, and Mixed.
    • Dataset Sizes: 500, 1000, and 3000 questions.
  • Cache Size: The cache sizes were chosen relative to the number of questions in the workload:

    • For 500 questions: [5, 10, 20]
    • For 1000 questions: [10, 50, 100]
    • For 3000 questions: [30, 150, 300]
  • Alpha (Learning Rate for Quality Score): [0.3, 0.5, 0.7]

  • Composite Score Weights (Quality, Recency, Frequency): We tested two combinations for our policy: (0.6, 0.3, 0.1) and (0.8, 0.1, 0.1).

Benchmarking Environment

All experiments were conducted on a DigitalOcean droplet (s-2vcpu-4gb-intel) running Ubuntu. The droplet was equipped with 2 Intel vCPUs, 4 GB of RAM, and a 120 GB SSD, and was located in the Frankfurt (fra1) datacenter.

Automation

  • The experiments were automated using custom Python scripts.
  • The results will be collected and stored in CSV files for further analysis.