1. Automatic Prefix Caching

1.1. Introduction

Automatic Prefix Caching (APC) caches the KV cache of existing queries, so that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part.

1
2
3
4
5
# set enable_prefix_caching=True to enable APC
llm = LLM(
    model='lmsys/longchat-13b-16k',
    enable_prefix_caching=True
)

1.2. Example Workloads and Limitations

We describe two example workloads, where APC can provide huge performance benefit:

  • Long document query, where the user repeatedly queries the same long document (e.g. software manual or annual report) with different queries.

  • Multi-round conversation, where the user may chat with the application multiple times in the same chatting session.

APC does not bring performance gain when vLLM spends most of the time generating answers to the queries (e.g. when the length of the answer is long), or new queries do not share the same prefix with any of existing queries (so that the computation cannot be reused).

1.3. Implementation

                    Block 1                  Block 2                  Block 3
         [A gentle breeze stirred] [the leaves as children] [laughed in the distance]
Block 1: |<--- block tokens ---->|
Block 2: |<------- prefix ------>| |<--- block tokens --->|
Block 3: |<------------------ prefix -------------------->| |<--- block tokens ---->|

In the example above, the KV cache in the first block can be uniquely identified with the tokens “A gentle breeze stirred”. The third block can be uniquely identified with the tokens in the block “laughed in the distance”, along with the prefix tokens “A gentle breeze stirred the leaves as children”. Therefore, we can build the following one-to-one mapping:

1
hash(prefix tokens + block tokens) <--> KV Block

In “vllm/core/block_manager.py”, hash calculation happens here right berfore allocating the sequnce to block table.

This design achieves automatic prefix caching without the need of maintaining a tree structure among the KV blocks. More specifically, all of the blocks are independent of each other and can be allocated and freed by itself, which enables us to manages the KV cache as ordinary caches in operating system.


🔗References

  1. vLLM | Automatic Prefix Caching - Introduction
  2. vLLM | Automatic Prefix Caching - Implementation
  3. How Speculative Decoding Boosts vLLM Performance by up to 2.8x
  4. vLLM | Speculative Decoding
  5. PyTorch | A Hitchhiker's Guide to Speculative Decoding
  6. Accelerating Production LLMs with Combined Token/Embedding Speculators