180321 --> 250321
- CLEAN MDP files, separate classes, use abstract base classes
- INVESTIGATE rewriting agent and estimator classes to fully exploit numpy (not-urgent)
- UNIFY estimators and approximators in my code
- ADJUST the learning plots so that there is a fixed maximum number of steps
- ADJUST the learning plots so that the greedy evaluation starts from the initial state
- FRESH agent or estimator function to calculate bias with respect to action-value function
- FRESH agent train method so that training of estimators is separate from action selection (15 min)
- TRY learning plots with LambChop using RMax estimators
- FIX a bug where the agent only takes one step when exploring and the episode terminates... (bug seems to have fixed itself by restarting)
- TRY learning plots with LambChop using novelty estimators (only)
- THEORY can reactive turing machines help in the analysis of reinforcement learning?
- LINK for custom commands in jupyter lab :(https://towardsdatascience.com/how-to-customize-jupyterlab-keyboard-shortcuts-72321f73753d)