Skip to content

Thahit/Recommender_Sim

Repository files navigation

Exploring the Modelling of User Arrival Times of Online Streaming Sites

We aim to generate synthetic user data that mimics real-world interactions to experiment with recommender systems. Given the last interaction time ti and interactions ri generate predictions for the next time the user visits the site/app. t i+1~ fθ(|ti, ri). Our idea is that users have a hidden state that evolves over time and dictates when users interact with the system.

Data processing

After following the instructions below (Accordion: a Trainable Simulator for Long-Term Interactive Systems), use the notebook notebooks/preprocess_for_new.ipynb to continue processing the data.

Models

For a nicer explanation of the models, I refer to the poster.

Model as Partially Deterministic Markov Process (PDMP)

The PDMP approach models user behavior by combining deterministic and stochastic elements. The hidden state of the user evolves continuously over time through a flow function, modeled as an ODE. The intensity function determines when the next interaction will occur by calculating the waiting time distribution. After each interaction, the user's state undergoes a jump, influenced by the interaction's outcome. While this model captures complex dynamics, it faces challenges such as numerical instability, high computational costs, and difficulties in differentiability.

Alternative Model 1: Function Approximation

The FA approach approximates user behavior using neural networks (Creating a simple version of the previous idea). It consists of three models: one to predict waiting times between interactions(given the current state), another to update the user’s state after the waiting time, and a third to model the jumps in state caused by user interactions. The approach is more straightforward and faster to train compared to PDMP. However, it struggles with overestimating time intervals between events and fails to terminate the generation of arrival times (keeps generating arrivals).

Alternative Model 2: Density Estimation

The DE approach models the probability of user interactions occurring at specific times by estimating the intensity function directly. It uses a neural network to predict the intensity of events, which is trained against a kernel density estimate of the true event times. Another network is used to get the sate at time t. This method is more computationally efficient and easier to parallelize. It successfully avoids generating unnecessary samples, but it’s less clear how to incorporate user interactions into the model.

Experiments

Single User

Use this Notebook to test the models on a single user. You can either use real data(prepared by the multi user notebook), or data generated by another neural network.

Multiple Users

Use this Notebook to train the model on several users, employing a Variational Auto Decoder. The PDMP isn't used, since it is too inefficient.

Accordion: a Trainable Simulator for Long-Term Interactive Systems

The starting point of this repo is the Github repository. The following text is the content of its README file.

This is prototype code for research purposes only.

The code implements a trainable Poisson process simulator for interactive systems. The purpose of this repository is to provide the implementation used in the public experiments described in the RecSys '21 publication. See that paper for more details on the modeling and inference steps.

There are two modules:

  • simtrain -> contains core logic for building the simulator and training on observed data
  • experiment -> enables experiments using a trained simulator

There are three notebooks:

  • notebooks/process_contentwise_data.ipynb -> notebook to process the ContentWise dataset into a format that can be used to train the simulator
  • notebooks/train_simulator.ipynb -> notebook demonstrating how the simulator is trained
  • notebooks/boltzmann_study.ipynb -> notebook to make a Boltzmann exploration hyperparameter sweep using simulation

Setup

  1. Download the ContentWise dataset please use the CSV data from their folder ContentWiseImpressions/data/ContentWiseImpressions/CW10M-CSV/
  2. Set your base paths in the file notebooks/paths.py to point to where you downloaded the data and where intermediary and processed files will live
  3. Adjust the simtrain.SETTINGS.py file according to the available computational resources (i.e. N_SUBSAMPLE_USERS, NUMEXPR_MAX_THREADS). Please be aware, the initial loading and first step processing of the data in process_contentwise_data.ipynb takes a lot of time and memory (~4 hours on large single instance). Beyond this point, user subsampling results in a smaller system of users and items that helps speed up the code for the proof of concept.

Requirements

Requires the following packages:

numpy>=1.18.2

pandas>=0.25.3

matplotlib>=3.2.1

scipy>=1.4.1

scikit-learn>=0.22.2

tensorflow>=1.15.0

grpcio>=1.24.3

tqdm>4.62.2

Bibtex paper reference

Cite as:

@inproceedings{mcinerney2021accordion, title={Accordion: A Trainable Simulator for Long-Term Interactive Systems}, author={McInerney, James and Elahi, Ehtsham and Basilico, Justin and Raimond, Yves and Jebara, Tony}, booktitle={Fifteenth ACM Conference on Recommender Systems}, pages={102--113}, year={2021} }