VectorCache: Streamlining LLM Query Performance

Optimizing AI Efficiency: A new way to enhance LLM Performance & Reduce Inference Costs with smart semantic caching

Jun 26, 2024

In the burgeoning field of artificial intelligence, the cost and speed of processing large language model (LLM) queries are critical considerations for developers and businesses alike.

This is where vector-cache, a Python library designed for semantic caching, comes into play. It enhances LLM query performance, making it faster and more cost-effective—a boon for AI-driven applications, especially RAG.

What is Vector Cache?

VectorCache is a solution aimed at optimising the use of large language models by caching responses based on semantic similarity. This approach helps in significantly reducing the latency and operational costs associated with LLMs.

Key Features and Benefits

Faster Responses

VectorCache decreases the latency of responses by storing previously computed answers from LLMs. When a new request comes in, the system checks if an answer that is semantically similar to the request has been cached, allowing for quicker feedback.

Reduced Costs

By minimising the number of direct queries to LLMs, VectorCache helps save on usage costs. This is particularly advantageous for applications that require frequent querying to LLMs.

Enhanced Efficiency

Think of VectorCache as a more nuanced version of Redis; it recognises not just exact matches but also queries that are semantically similar. This feature is especially beneficial in domains where queries tend to revolve around specific topics or fields.

Application Layer Integration

This is one very intentional design decision I took, to make vector-cache easy to integrate at application layer, it does not extend or wrap existing LLM library functions and can instead simply work off the ‘query’ string only.

Getting Started with VectorCache

Prerequisites

Before diving into VectorCache, ensure that your system has Python version 3.9.1 or higher. You can check your Python version by running python --version in your command line.

Installation

To avoid any hiccups during installation, it's a good idea to upgrade pip using the following command:

The easiest way to get started is by using the official pip package.

pip install vector-cache

Or you ca clone the git repository on your machine and build around it.

git clone git@github.com:shivendrasoni/vector-cache.git 

cd vector-cache

python -m venv venv #(optional)

python -m pip install -r requirements.txt

For a hands-on introduction to VectorCache, explore the examples provided in the 'examples' folder of the library. A dedicated pip package is also in the pipeline and will be available soon.

Components of VectorCache

Embedding Models: Facilitate similarity searches through various embedding APIs.
Cache Storage: Stores LLM responses for future retrieval based on semantic matches.
Vector Store: Identifies similar requests using the input request's embeddings.
Cache Manager: Manages cache storage and vector store operations, including eviction policies.
Similarity Evaluator: Determines the similarity between requests to ensure accurate cache matches.

How it works:

The idea is very simple, a request to the LLM comes in, we see if a similar query was made previously? If yes, we respond with the previous response, else we send the request to the LLM and then capture its response.

To detail it out a bit more:

A query has to be sent for inference to an LLM.
We vectorise the query and try to find the most similar query in our vector database using cosine similarity (upto a threshold).
1. To vectorise, we use one of the embedding models (eg. openai embeddings, SentenceBert, Anthropic etc)
2. Vector store is a special kind of database, which can store vector representation of text chunks
3. Caching is any fast access key value store. This can be an in memory db like Redis or a persistent store like Mongodb, couchbase etc.
If a similar document was found, we select the relevant doc from our cache (or database), the ID of the document is the value of the record searched above via vector similarity.
If a response was found in the cache we return it, else, we send the query to the LLM and then cache it back into the cache.

How You Can Contribute

Interested in shaping the future of vector-cache? Contributions are welcome! Check the contribution guidelines on the GitHub repository to get started.

All the major components of the package use an Interface, which must be implemented for the library to work. They can be found in the individual directories.

Future Scope

Our work here is far from over and has a lot of scope to be improved. Few ideas which I have to improve this would be:

Add cache invalidation and TTL support
Add more embedding models (Anthropic, Gemini, Nomic etc)
Add more vector store support (Qdrant, Mongodb etc)
Account for various LLM params (like temperature) in cache strategy.
Add async caching strategy.

Conclusion

VectorCache represents a significant advancement in the field of AI technology by enhancing the efficiency and effectiveness of LLM applications. It stands out by not only reducing operational costs and latency but also by providing a sophisticated mechanism to handle semantically similar queries. For developers and enterprises leveraging LLMs, VectorCache offers a compelling solution that marries cost-efficiency with high performance.

Embrace the future of AI with VectorCache—where speed meets efficiency!