How to cache chat model responses

LangChain provides an optional caching layer for chat models. This is useful for two main reasons:

It can save you money by reducing the number of API calls you make to the LLM provider, if you're often requesting the same completion multiple times. This is especially useful during app development.
It can speed up your application by reducing the number of API calls you make to the LLM provider.

This guide will walk you through how to enable this in your apps.

Prerequisites

This guide will assume familiarity with the following concepts:

pip install -qU langchain-openai

import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass()

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

pip install -qU langchain-anthropic

import getpass
import os

os.environ["ANTHROPIC_API_KEY"] = getpass.getpass()

from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(model="claude-3-sonnet-20240229")

pip install -qU langchain-google-vertexai

import getpass
import os

os.environ["GOOGLE_API_KEY"] = getpass.getpass()

from langchain_google_vertexai import ChatVertexAI

llm = ChatVertexAI(model="gemini-pro")

pip install -qU langchain-cohere

import getpass
import os

os.environ["COHERE_API_KEY"] = getpass.getpass()

from langchain_cohere import ChatCohere

llm = ChatCohere(model="command-r")

pip install -qU langchain-fireworks

import getpass
import os

os.environ["FIREWORKS_API_KEY"] = getpass.getpass()

from langchain_fireworks import ChatFireworks

llm = ChatFireworks(model="accounts/fireworks/models/mixtral-8x7b-instruct")

pip install -qU langchain-mistralai

import getpass
import os

os.environ["MISTRAL_API_KEY"] = getpass.getpass()

from langchain_mistralai import ChatMistralAI

llm = ChatMistralAI(model="mistral-large-latest")

pip install -qU langchain-openai

import getpass
import os

os.environ["TOGETHER_API_KEY"] = getpass.getpass()

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="https://api.together.xyz/v1",
    api_key=os.environ["TOGETHER_API_KEY"],
    model="mistralai/Mixtral-8x7B-Instruct-v0.1",)

# | output: false
# | echo: false

import os
from getpass import getpass

from langchain_openai import ChatOpenAI

os.environ["OPENAI_API_KEY"] = getpass()

llm = ChatOpenAI()

# <!-- ruff: noqa: F821 -->
from langchain.globals import set_llm_cache

In Memory Cache

This is an ephemeral cache that stores model calls in memory. It will be wiped when your environment restarts, and is not shared across processes.

%%time
from langchain.cache import InMemoryCache

set_llm_cache(InMemoryCache())

# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")

CPU times: user 645 ms, sys: 214 ms, total: 859 ms
Wall time: 829 ms

AIMessage(content="Why don't scientists trust atoms?\n\nBecause they make up everything!", response_metadata={'token_usage': {'completion_tokens': 13, 'prompt_tokens': 11, 'total_tokens': 24}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': 'fp_c2295e73ad', 'finish_reason': 'stop', 'logprobs': None}, id='run-b6836bdd-8c30-436b-828f-0ac5fc9ab50e-0')

%%time
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")

CPU times: user 822 µs, sys: 288 µs, total: 1.11 ms
Wall time: 1.06 ms

AIMessage(content="Why don't scientists trust atoms?\n\nBecause they make up everything!", response_metadata={'token_usage': {'completion_tokens': 13, 'prompt_tokens': 11, 'total_tokens': 24}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': 'fp_c2295e73ad', 'finish_reason': 'stop', 'logprobs': None}, id='run-b6836bdd-8c30-436b-828f-0ac5fc9ab50e-0')

SQLite Cache

This cache implementation uses a SQLite database to store responses, and will last across process restarts.

!rm .langchain.db

# We can do the same thing with a SQLite cache
from langchain.cache import SQLiteCache

set_llm_cache(SQLiteCache(database_path=".langchain.db"))

%%time
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")

CPU times: user 9.91 ms, sys: 7.68 ms, total: 17.6 ms
Wall time: 657 ms

AIMessage(content='Why did the scarecrow win an award? Because he was outstanding in his field!', response_metadata={'token_usage': {'completion_tokens': 17, 'prompt_tokens': 11, 'total_tokens': 28}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': 'fp_c2295e73ad', 'finish_reason': 'stop', 'logprobs': None}, id='run-39d9e1e8-7766-4970-b1d8-f50213fd94c5-0')

%%time
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")

CPU times: user 52.2 ms, sys: 60.5 ms, total: 113 ms
Wall time: 127 ms

AIMessage(content='Why did the scarecrow win an award? Because he was outstanding in his field!', id='run-39d9e1e8-7766-4970-b1d8-f50213fd94c5-0')

Next steps

You've now learned how to cache model responses to save time and money.

Next, check out the other how-to guides chat models in this section, like how to get a model to return structured output or how to create your own custom chat model.

How to cache chat model responses

In Memory Cache​

SQLite Cache​

Next steps​

Help us out by providing feedback on this documentation page:

In Memory Cache

SQLite Cache

Next steps