We aim to generate synthetic user data that mimics real-world interactions to experiment with recommender systems. Given the last interaction time ti and interactions ri generate predictions for the next time the user visits the site/app. t i+1~ fθ(|ti, ri). Our idea is that users have a hidden state that evolves over time and dictates when users interact with the system.
After following the instructions below (Accordion: a Trainable Simulator for Long-Term Interactive Systems), use the notebook notebooks/preprocess_for_new.ipynb to continue processing the data.
For a nicer explanation of the models, I refer to the poster.
The PDMP approach models user behavior by combining deterministic and stochastic elements. The hidden state of the user evolves continuously over time through a flow function, modeled as an ODE. The intensity function determines when the next interaction will occur by calculating the waiting time distribution. After each interaction, the user's state undergoes a jump, influenced by the interaction's outcome. While this model captures complex dynamics, it faces challenges such as numerical instability, high computational costs, and difficulties in differentiability.
The FA approach approximates user behavior using neural networks (Creating a simple version of the previous idea). It consists of three models: one to predict waiting times between interactions(given the current state), another to update the user’s state after the waiting time, and a third to model the jumps in state caused by user interactions. The approach is more straightforward and faster to train compared to PDMP. However, it struggles with overestimating time intervals between events and fails to terminate the generation of arrival times (keeps generating arrivals).
The DE approach models the probability of user interactions occurring at specific times by estimating the intensity function directly. It uses a neural network to predict the intensity of events, which is trained against a kernel density estimate of the true event times. Another network is used to get the sate at time t. This method is more computationally efficient and easier to parallelize. It successfully avoids generating unnecessary samples, but it’s less clear how to incorporate user interactions into the model.
Use this Notebook to test the models on a single user. You can either use real data(prepared by the multi user notebook), or data generated by another neural network.
Use this Notebook to train the model on several users, employing a Variational Auto Decoder. The PDMP isn't used, since it is too inefficient.
The starting point of this repo is the Github repository. The following text is the content of its README file.
This is prototype code for research purposes only.
The code implements a trainable Poisson process simulator for interactive systems. The purpose of this repository is to provide the implementation used in the public experiments described in the RecSys '21 publication. See that paper for more details on the modeling and inference steps.
There are two modules:
simtrain
-> contains core logic for building the simulator and training on observed dataexperiment
-> enables experiments using a trained simulator
There are three notebooks:
notebooks/process_contentwise_data.ipynb
-> notebook to process the ContentWise dataset into a format that can be used to train the simulatornotebooks/train_simulator.ipynb
-> notebook demonstrating how the simulator is trainednotebooks/boltzmann_study.ipynb
-> notebook to make a Boltzmann exploration hyperparameter sweep using simulation
- Download the ContentWise dataset please use the CSV data from their folder
ContentWiseImpressions/data/ContentWiseImpressions/CW10M-CSV/
- Set your base paths in the file
notebooks/paths.py
to point to where you downloaded the data and where intermediary and processed files will live - Adjust the
simtrain.SETTINGS.py
file according to the available computational resources (i.e.N_SUBSAMPLE_USERS
,NUMEXPR_MAX_THREADS
). Please be aware, the initial loading and first step processing of the data inprocess_contentwise_data.ipynb
takes a lot of time and memory (~4 hours on large single instance). Beyond this point, user subsampling results in a smaller system of users and items that helps speed up the code for the proof of concept.
Requires the following packages:
numpy>=1.18.2
pandas>=0.25.3
matplotlib>=3.2.1
scipy>=1.4.1
scikit-learn>=0.22.2
tensorflow>=1.15.0
grpcio>=1.24.3
tqdm>4.62.2
Cite as:
@inproceedings{mcinerney2021accordion, title={Accordion: A Trainable Simulator for Long-Term Interactive Systems}, author={McInerney, James and Elahi, Ehtsham and Basilico, Justin and Raimond, Yves and Jebara, Tony}, booktitle={Fifteenth ACM Conference on Recommender Systems}, pages={102--113}, year={2021} }