π Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu
We introduce LongMemEval, a comprehensive, challenging, and scalable benchmark for testing the long-term memory of chat assistants.
We release 500 high quality questions to test five core long-term memory abilities:
- Information Extraction
- Multi-Session Reasoning
- Knowledge Updates
- Temporal Reasoning
- Abstention
Inspired by the "needle-in-a-haystack" test, we design an attribute-controlled pipeline to compile a coherent, extensible, and timestamped chat history for each question. LongMemEval requires chat systems to parse the dynamic interactions online for memorization, and answer the question after all the interaction sessions.
The LongMemEval dataset is officially released here. Please download and uncompress the data to the data/
folder. You may also download from huggingface.
mkdir -p data/
mv longmemeval_data.tar.gz data/
cd data ; tar -xzvf longmemeval_data.tar.gz ; cd ..
We recommend using a conda environment for the project. You may follow the steps below to set up.
If you only need to calculate the metrics on the outputs produced by your own system, you can install this minimal requirement set which allows you to run src/evaluation/evaluate_qa.py
.
conda create -n longmemeval-lite python=3.9
conda activate longmemeval-lite
pip install -r requirements-lite.txt
If you also would like to run the memory systems introduced in the paper, please set up this environment instead.
conda create -n longmemeval python=3.9
conda activate longmemeval
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements-full.txt
We have tested this environment on a Linux machine with CUDA 12.1. If you use a different platform, you may need to modify the requirements.
Three files are included in the data package:
longmemeval_s.json
: The LongMemEval_S introduced in the paper. Concatenating all the chat history roughly consumes 115k tokens (~40 history sessions) for Llama 3.longmemeval_m.json
: The LongMemEval_M introduced in the paper. Each chat history contains roughly 500 sessions.longmemeval_oracle.json
: LongMemEval with oracle retrieval. Only the evidence sessions are included in the history.
Within each file, there are 500 evaluation instances, each of which contains the following fields:
question_id
: the unique id for each question.question_type
: one ofsingle-session-user
,single-session-assistant
,single-session-preference
,temporal-reasoning
,knowledge-update
, andmulti-session
. Ifquestion_id
ends with_abs
, then the question is anabstention
question.question
: the question content.answer
: the expected answer from the model.question_date
: the date of the question.haystack_session_ids
: a list of the ids of the history sessions (sorted by timestamp forlongmemeval_s.json
andlongmemeval_m.json
; not sorted forlongmemeval_oracle.json
).haystack_dates
: a list of the timestamps of the history sessions.haystack_sessions
: a list of the actual contents of the user-assistant chat history sessions. Each session is a list of turns. Each turn is a direct with the format{"role": user/assistant, "content": message content}
. For the turns that contain the required evidence, an additional fieldhas_answer: true
is provided. This label is used for turn-level memory recall accuracy evaluation.answer_session_ids
: a list of session ids that represent the evidence sessions. This is used for session-level memory recall accuracy evaluation.
To test on LongMemEval, you may directly feed the timestamped history to your own chat system, collect the output, and evaluate with the evaluation script we provide. To do so, save the outputs in a jsonl
format with each line containing two fields: question_id
and hypothesis
. Then, you may run the evaluation script through the following command:
export OPENAI_API_KEY=YOUR_API_KEY
export OPENAI_ORGANIZATION=YOUR_ORGANIZATION # may be omitted if your key belongs to only one organization
cd src/evaluation
python3 evaluate_qa.py gpt-4o your_hypothesis_file ../../data/longmemeval_oracle.json
Running this script will save the evaluation logs into a file called [your_hypothesis_file].log
. In this file, each line will contain a new field called autoeval_label
. While evaluate_qa.py
already reports the averaged scores, you can also aggregate the scores from the log using the following command:
(assuming you are in the src/evaluation folder)
python3 print_qa_metrics.py gpt-4o your_hypothesis_file.log ../../data/longmemeval_oracle.json
LongMemEval supports compiling a chat history of arbitrary length for a question instance, so that you can easily scale up the difficulty over LongMemEval_M
. We will release the code and data for this feature soon.
Please download the compressed data from this link and uncompress the data under data/custom_history
. The released data contains three parts:
1_attr_bg/data_1_attr_bg.json
: user attibutes and backgrounds.2_questions
: questions, answers, evidence statements, as well as the evidence sessions.5_filler_sess/data_5_filler_sess.json
: filler sessions sourced from ShareGPT and UltraChat.6_session_cache/data_6_session_cache.json
: simulated user sessions based on facts extracted from the backgrounds.
You may run python sample_haystack_and_timestamp.py task n_questions min_n_haystack_filler max_n_haystack_filler enforce_json_length
to reproduce the history compilation in LongMemEval.
-
task
is the name of the task. In the released data, we used a different naming compared to the paper. Here is the mapping.Name in Data Official Name single_hop single-session-user implicit_preference_v2 single-session-preference assistant_previnfo single-session-assistant two_hop multi-session multi_session_synthesis multi-session temp_reasoning_implicit temporal-reasoning temp_reasoning_explicit temporal-reasoning knowledge_update knowledge-update -
n_questions
is the maximum number of questions used. -
min_n_haystack_filler
andmax_n_haystack_filler
sets limits on the number of sessions included in the history. 80 is used forlongmemeval_s
and 500 is used forlongmemeval_m
. -
enforce_json_length
is used to limit the length of the chat history. 115000 is used forlongmemeval_s
. Forlongmemeval_m
, this criteria is not used, and you can set it to a large number.
To construct your own chat history, you may follow the format in 2_questions
and 6_session_cache
to create the questions and evidence sessions. Then, you may run sample_haystack_and_timestamp.py
with a similar command.
We provide the experiment code for memory retrieval and retrieval-augmented question answering under the folder src/retrieval
and src/generation
.
If you would like to test OpenAI models as the reader, please provide your OpenAI organization ID and key in line 33 and 34 of src/generation/run_generation.sh
.
If you want to test an open-weight reader LLM, we support it through an OpenAI API emulator locally-served via vllm
. To start the server, use the following command:
cd src/utils
bash serve_vllm.sh GPU MODEL PORT TP_SIZE
GPU
is a comma-separated list of the GPUs you want to useMODEL
is the alias of the model you want to use. You can view or configure it inserve_vllm.sh
.PORT
is the port the server will listen to. It defaults to 8001. If you change the port, make sure it is reflected insrc/generation/run_generation.sh
line 38.TP_SIZE
is the tensor parallel size. It must be smaller than or equal to the number of GPUs specified in GPU.- If you need to limit the maximum number of tokens due to memory requirements, you can use the following command:
bash serve_vllm_with_maxlen.sh GPU MODEL MAXLEN PORT TP_SIZE
To run the long-context generation baseline where the model is provided with the full history, you can use the following commands:
cd src/generation
bash run_generation.sh DATA_FILE MODEL full-history-session TOPK [HISTORY_FORMAT] [USERONLY] [READING_METHOD]
DATA_FILE
is the path to one of the released json files. Note thatlongmemeval_s.json
andlongmemeval_oracle.json
are designed to fit into a model with 128k context, butlongmemeval_m.json
is too long for long-context testing.MODEL
is alias of the model you want to use. You can view or configure it inrun_generation.sh
.TOPK
is the maximum number of history sessions provided to the reader. We recommend setting it to a large number (e.g., 1000) to ensure including all the history sessions.HISTORY_FORMAT
is the format to present the history. It can take eitherjson
ornl
. We recommend usingjson
.USERONLY
removes the assistant-side messages in the prompt. We recommend setting it tofalse
.READING_METHOD
is the reading style. It can take the valuedirect
,con
, orcon-separate
. We recommendcon
, which instructs the model to first extract useful information and then reason over it.
A log file will be generated under the folder generation_logs/
. You can then follow the instructions above in "Testing Your System" to evaluate the QA correctness.
To run memory indexing and retrieval on LongMemEval, following the instructions below.
You may use the following command to run the baseline retrieval:
cd src/retrieval
bash run_retrieval.sh IN_FILE RETRIEVER GRANULARITY
IN_FILE
: the pathRETRIEVER
:flat-bm25
,flat-contriever
,flat-stella
(Stella V5 1.5B), orflat-gte
(gte-Qwen2-7B-instruct). Forflat-stella
, our code requires downloading the model manually from the original repository.GRANULARITY
: the value granularity of the memory index, we supportturn
orsession
.
Note that for the dense embedding models, we support multi-GPU retrieval. By default, the code will utilize all the available GPUs.
The script will output the retrieval results under retrieval_logs/
and print out the evaluation metrics. You may print out the metrics from the log as well by:
python3 src/evaluation/print_retrieval_metrics.py log_file
Also note that for evaluating the retrieval, we always skip the 30 abstention instances. This is because these instances generally refer to non-existing events and do not have a ground truth answer location.
To run the experiments with key expansion, download the released key expansion outputs in this link and place the data under the directory LongMemEval/index_expansion_logs/
. Then, you may run the following command
cd src/retrieval
bash run_retrieval.sh IN_FILE RETRIEVER GRANULARITY EXPANSION_TYPE JOIN_MODE CACHE
EXPANSION_TYPE
: we supportsession-summ
,session-keyphrase
,session-userfact
,turn-keyphrase
,turn-userfact
JOIN_MODE
: we support three modes:separate
: add a new (key, value) pair with the expansion as the key.merge
: merge the expansion and the original key to get the new key.replace
: replace the original key with the expansion content.
CACHE
: the path to the cache file corresponding toEXPANSION_TYPE
.
We release the code for generating the key expansion outputs offline under src/index_expansion
for your reference.
We provide the implementation for pruning out the search space by extracting timestamped events from the sessions, inferring time rage from the query, and using the range to narrow down the search space. First, download the extracted timestamped events here and unzip the data under LongMemEval/index_expansion_logs/
. Next, run the command
cd src/index_expansion
python3 temp_query_search_pruning.py TIMESTAMP_EVENT_FILE RETRIEVAL_LOG GRANULARITY
TIMESTAMP_EVENT_FILE
is the extracted timestamped events file you downloaded.RETRIEVAL_LOG
is the output from any retrieval experiment.GRANULARITY
issession
orturn
and should be consistent with the other two arguments. For the paper, we usedsession
as the granularity.
You may use the following command for question answering with the retrieved memory.
cd src/generation
bash run_generation.sh RETRIEVAL_LOG_FILE MODEL EXP TOPK [HISTORY_FORMAT] [USERONLY] [READING_METHOD]
RETRIEVAL_LOG_FILE
should be the output from the retrieval step. Specifically, this step relies on theretrieval_results
field added in the retrieval step.EXP
should be in the form[RETRIEVER]-[GRANULARITY]
such asflat-stella-session
.- The other parameters are the same as introduced above.
If you find the work useful, please cite:
@article{wu2024longmemeval,
title={LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory},
author={Di Wu and Hongwei Wang and Wenhao Yu and Yuwei Zhang and Kai-Wei Chang and Dong Yu},
year={2024},
eprint={2410.10813},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.10813},
}