HokageM
diff --git a/‎README.md
+49-96 b/‎README.md
+49-96
diff --git a/‎demo/heatmaps/learner_1999_deep_RL.png
19.2 KB b/‎demo/heatmaps/learner_1999_deep_RL.png
19.2 KB
diff --git a/‎demo/heatmaps/learner_4999_deep_RL.png
21.2 KB b/‎demo/heatmaps/learner_4999_deep_RL.png
21.2 KB
diff --git a/‎demo/heatmaps/learner_999_deep_RL.png
21 KB b/‎demo/heatmaps/learner_999_deep_RL.png
21 KB
diff --git a/‎demo/learning_curves/maxentdeep_4999_RL.png
22.2 KB b/‎demo/learning_curves/maxentdeep_4999_RL.png
22.2 KB
diff --git a/‎demo/learning_curves/maxentdeep_999_RL.png
27.1 KB b/‎demo/learning_curves/maxentdeep_999_RL.png
27.1 KB
diff --git a/‎demo/test_results/test_maxentropydeep_best_model_RL_results.png
42.1 KB b/‎demo/test_results/test_maxentropydeep_best_model_RL_results.png
42.1 KB
diff --git a/‎demo/trained_models/maxentropydeep_3697_best_network_w_-83.0_RL.pth
12.3 KB b/‎demo/trained_models/maxentropydeep_3697_best_network_w_-83.0_RL.pth
12.3 KB
diff --git a/‎setup.cfg
+1-1 b/‎setup.cfg
+1-1
diff --git a/‎src/irlwpython/MaxEntropyDeepRL.py
+197 b/‎src/irlwpython/MaxEntropyDeepRL.py
+197
@@ -6,119 +6,72 @@ Inverse Reinforcement Learning Algorithm implementation with python.
 
 # Implemented Algorithms
 
-## Maximum Entropy IRL: [1]
+## Maximum Entropy IRL:
 
-## Maximum Entropy Deep IRL
+Implementation of the Maximum Entropy inverse reinforcement learning algorithm from [1] and is based on the implementation
+of [lets-do-irl](https://github.com/reinforcement-learning-kr/lets-do-irl/tree/master/mountaincar/maxent).
+It is an IRL algorithm using Q-Learning with a Maximum Entropy update function.
 
-# Experiments
+## Maximum Entropy Deep IRL:
 
-## Mountaincar-v0
-[gym](https://www.gymlibrary.dev/environments/classic_control/mountain_car/)
-
-The expert demonstrations for the Mountaincar-v0 are the same as used in [lets-do-irl](https://github.com/reinforcement-learning-kr/lets-do-irl/tree/master/mountaincar/maxent).
-
-*Heatmap of Expert demonstrations with 400 states*:
-
- <img src="demo/heatmaps/expert_state_frequencies_mountaincar.png">
-
-### Maximum Entropy Inverse Reinforcement Learning
-
-IRL using Q-Learning with a Maximum Entropy update function.
-
-#### Training
-
-*Learner training for 1000 episodes*:
-
-<img src="demo/learning_curves/maxent_999_flat.png">
-
-*Learner training for 4000 episodes*:
-
-<img src="demo/learning_curves/maxent_4999_flat.png">
-
-#### Heatmaps
-
-*Learner state frequencies after 1000 episodes*:
-
-<img src="demo/heatmaps/learner_999_flat.png">
-
-*Learner state frequencies after 2000 episodes*:
-
-<img src="demo/heatmaps/learner_1999_flat.png">
-
-*Learner state frequencies after 5000 episodes*:
-
-<img src="demo/heatmaps/learner_4999_flat.png">
-
-<img src="demo/heatmaps/theta_999_flat.png">
-
-*State rewards heatmap after 5000 episodes*:
-
-<img src="demo/heatmaps/theta_4999_flat.png">
+An implementation of the Maximum Entropy inverse reinforcement learning algorithm, which uses a neural-network for the 
+actor. 
+The estimated irl-reward is learned similar as in Maximum Entropy IRL.
+It is an IRL algorithm using Deep Q-Learning with a Maximum Entropy update function.
 
-*State rewards heatmap after 14000 episodes*:
+## Maximum Entropy Deep RL:
 
-<img src="demo/heatmaps/theta_13999_flat.png">
+An implementation of the Maximum Entropy reinforcement learning algorithm.
+This algorithm is used to compare the IRL algorithms with an RL algorithm.
 
-#### Testing
+# Experiment
 
-*Testing results of the model after 29000 episodes*:
-
-<img src="demo/test_results/test_maxentropy_flat.png">
-
-
-### Deep Maximum Entropy Inverse Reinforcement Learning
-
-IRL using Deep Q-Learning with a Maximum Entropy update function.
-
-#### Training
-
-*Learner training for 1000 episodes*:
-
-<img src="demo/learning_curves/maxentdeep_999_w_reset_10.png">
-
-*Learner training for 5000 episodes*:
-
-<img src="demo/learning_curves/maxentdeep_4999_w_reset_10.png">
-
-#### Heatmaps
-
-*Learner state frequencies after 1000 episodes*:
-
-<img src="demo/heatmaps/learner_999_maxentdeep_w_reset_10.png">
-
-*Learner state frequencies after 2000 episodes*:
-
-<img src="demo/heatmaps/learner_1999_maxentdeep_w_reset_10.png">
-
-*Learner state frequencies after 5000 episodes*:
-
-<img src="demo/heatmaps/learner_4999_maxentdeep_w_reset_10.png">
-
-*State rewards heatmap after 1000 episodes*:
-
-<img src="demo/heatmaps/theta_999_maxentdeep_w_reset_10.png">
+## Mountaincar-v0
 
-*State rewards heatmap after 2000 episodes*:
+The Mountaincar-v0 is used for evaluating the different algorithms.
+Therefore, the implementation of the MDP for the Mountaincar
+from [gym](https://www.gymlibrary.dev/environments/classic_control/mountain_car/) is used.
 
-<img src="demo/heatmaps/theta_1999_maxentdeep_w_reset_10.png">
+The expert demonstrations for the Mountaincar-v0 are the same as used
+in [lets-do-irl](https://github.com/reinforcement-learning-kr/lets-do-irl/tree/master/mountaincar/maxent).
 
-*State rewards heatmap after 5000 episodes*:
+*Heatmap of Expert demonstrations with 400 states*:
 
-<img src="demo/heatmaps/theta_4999_maxentdeep_w_reset_10.png">
+ <img src="demo/heatmaps/expert_state_frequencies_mountaincar.png">
 
+### Comparing the algorithms
 
-#### Testing
+The following tables compare the result of training and testing the two IRL algorithms Maximum Entropy and
+Maximum Entropy Deep. Furthermore, results for the RL algorithm Maximum Entropy Deep algorithm are shown, to
+highlight the differences between IRL and RL.
 
-*Testing results of the best model after 5000 episodes*:
+| Algorithm                | Training Curve after 1000 Episodes                                         | Training Curve after 5000 Episodes                                          |
+|--------------------------|----------------------------------------------------------------------------|-----------------------------------------------------------------------------|
+| Maximum Entropy IRL      | <img src="demo/learning_curves/maxent_999_flat.png" width="400">           | <img src="demo/learning_curves/maxent_4999_flat.png" width="400">           |
+| Maximum Entropy Deep IRL | <img src="demo/learning_curves/maxentdeep_999_w_reset_10.png" width="400"> | <img src="demo/learning_curves/maxentdeep_4999_w_reset_10.png" width="400"> |
+| Maximum Entropy Deep RL  | <img src="demo/learning_curves/maxentdeep_999_RL.png" width="400">         | <img src="demo/learning_curves/maxentdeep_4999_RL.png" width="400">         |
 
-<img src="demo/test_results/test_maxentropydeep_best_model_results.png">
+| Algorithm                | State Frequencies Learner: 1000 Episodes                                    | State Frequencies Learner: 2000 Episodes                                     | State Frequencies Learner: 5000 Episodes                                     |
+|--------------------------|-----------------------------------------------------------------------------|------------------------------------------------------------------------------|------------------------------------------------------------------------------|
+| Maximum Entropy IRL      | <img src="demo/heatmaps/learner_999_flat.png" width="400">                  | <img src="demo/heatmaps/learner_1999_flat.png" width="400">                  | <img src="demo/heatmaps/learner_4999_flat.png" width="400">                  |
+| Maximum Entropy Deep IRL | <img src="demo/heatmaps/learner_999_maxentdeep_w_reset_10.png" width="400"> | <img src="demo/heatmaps/learner_1999_maxentdeep_w_reset_10.png" width="400"> | <img src="demo/heatmaps/learner_4999_maxentdeep_w_reset_10.png" width="400"> |
+| Maximum Entropy Deep RL  | <img src="demo/heatmaps/learner_999_deep_RL.png" width="400">               | <img src="demo/heatmaps/learner_1999_deep_RL.png" width="400">               | <img src="demo/heatmaps/learner_4999_deep_RL.png" width="400">               |
 
-### Deep Maximum Entropy Inverse Reinforcement Learning with Critic
+| Algorithm                | IRL Rewards: 1000 Episodes                                                | IRL Rewards: 2000 Episodes                                                 | IRL Rewards: 5000 Episodes                                                 | IRL Rewards: 14000 Episodes                                |
+|--------------------------|---------------------------------------------------------------------------|----------------------------------------------------------------------------|----------------------------------------------------------------------------|------------------------------------------------------------|
+| Maximum Entropy IRL      | <img src="demo/heatmaps/theta_999_flat.png" width="400">                  | None                                                                       | <img src="demo/heatmaps/theta_4999_flat.png" width="400">                  | <img src="demo/heatmaps/theta_13999_flat.png" width="400"> |
+| Maximum Entropy Deep IRL | <img src="demo/heatmaps/theta_999_maxentdeep_w_reset_10.png" width="400"> | <img src="demo/heatmaps/theta_1999_maxentdeep_w_reset_10.png" width="400"> | <img src="demo/heatmaps/theta_4999_maxentdeep_w_reset_10.png" width="400"> | None                                                       |
+| Maximum Entropy Deep RL  | None                                                                      | None                                                                       | None                                                                       | None                                                       |
 
-Coming soon...
+| Algorithm                | Testing Results: 100 Runs                                                               |
+|--------------------------|-----------------------------------------------------------------------------------------|
+| Maximum Entropy IRL      | <img src="demo/test_results/test_maxentropy_flat.png" width="400">                      |
+| Maximum Entropy Deep IRL | <img src="demo/test_results/test_maxentropydeep_best_model_results.png" width="400">    |
+| Maximum Entropy Deep RL  | <img src="demo/test_results/test_maxentropydeep_best_model_RL_results.png" width="400"> |
 
 # References
-The implementation of MaxEntropyIRL and MountainCar is based on the implementation of: 
+
+The implementation of MaxEntropyIRL and MountainCar is based on the implementation of:
 [lets-do-irl](https://github.com/reinforcement-learning-kr/lets-do-irl/tree/master/mountaincar/maxent)
 
 [1] [BD. Ziebart, et al., "Maximum Entropy Inverse Reinforcement Learning", AAAI 2008](https://cdn.aaai.org/AAAI/2008/AAAI08-227.pdf).
@@ -133,12 +86,12 @@ pip install .
 # Usage
 
 ```commandline
-usage: irl [-h] [--version] [--training] [--testing] [--render] ALGORITHM
+usage: irl-runner [-h] [--version] [--training] [--testing] [--render] ALGORITHM
 
 Implementation of IRL algorithms
 
 positional arguments:
-  ALGORITHM   Currently supported training algorithm: [max-entropy, max-entropy-deep]
+  ALGORITHM   Currently supported training algorithm: [max-entropy, max-entropy-deep, max-entropy-deep-rl]
 
 options:
   -h, --help  show this help message and exit
 
@@ -78,7 +78,7 @@ testing =
 #     script_name = irlwpython.module:function
 # For example:
 console_scripts =
-     irl = irlwpython.main:run
+     irl-runner = irlwpython.main:run
 # And any other entry points, for example:
 # pyscaffold.cli =
 #     awesome = pyscaffoldext.awesome.extension:AwesomeExtension
 
@@ -0,0 +1,197 @@
+import numpy as np
+import math
+
+import torch
+import torch.optim as optim
+import torch.nn as nn
+
+from irlwpython.FigurePrinter import FigurePrinter
+
+
+class QNetwork(nn.Module):
+    def __init__(self, input_size, output_size):
+        super(QNetwork, self).__init__()
+        self.fc1 = nn.Linear(input_size, 64)
+        self.relu1 = nn.ReLU()
+        self.fc2 = nn.Linear(64, 32)
+        self.relu2 = nn.ReLU()
+        self.output_layer = nn.Linear(32, output_size)
+
+        self.printer = FigurePrinter()
+
+    def forward(self, state):
+        x = self.fc1(state)
+        x = self.relu1(x)
+        x = self.fc2(x)
+        x = self.relu2(x)
+        q_values = self.output_layer(x)
+        return q_values
+
+
+class MaxEntropyDeepRL:
+    def __init__(self, target, state_dim, action_size, feature_matrix, one_feature, learning_rate=0.001, gamma=0.99):
+        self.feature_matrix = feature_matrix
+        self.one_feature = one_feature
+
+        self.target = target
+
+        self.q_network = QNetwork(state_dim, action_size)
+        self.target_q_network = QNetwork(state_dim, action_size)
+        self.target_q_network.load_state_dict(self.q_network.state_dict())
+        self.optimizer = optim.Adam(self.q_network.parameters(), lr=learning_rate)
+
+        self.gamma = gamma
+
+        self.printer = FigurePrinter()
+
+    def select_action(self, state, epsilon):
+        """
+        Selects an action based on the q values from the network with epsilon greedy.
+        :param state:
+        :param epsilon:
+        :return:
+        """
+        if np.random.rand() < epsilon:
+            return np.random.choice(3)
+        else:
+            with torch.no_grad():
+                q_values = self.q_network(torch.FloatTensor(state))
+                return torch.argmax(q_values).item()
+
+    def update_q_network(self, state, action, reward, next_state, done):
+        """
+        Updates the q network based on the reward
+        :param state:
+        :param action:
+        :param reward:
+        :param next_state:
+        :param done:
+        :return:
+        """
+        state = torch.FloatTensor(state)
+        next_state = torch.FloatTensor(next_state)
+        q_values = self.q_network(state)
+        next_q_values = self.target_q_network(next_state)
+
+        target = q_values.clone()
+        if not done:
+            target[action] = reward + self.gamma * torch.max(next_q_values).item()
+        else:
+            target[action] = reward
+
+        loss = nn.MSELoss()(q_values, target.detach())
+        self.optimizer.zero_grad()
+        loss.backward()
+        self.optimizer.step()
+
+    def update_target_network(self):
+        """
+        Updates the target network.
+        :return:
+        """
+        self.target_q_network.load_state_dict(self.q_network.state_dict())
+
+    def train(self, n_states, episodes=30000, max_steps=200,
+              epsilon_start=1.0,
+              epsilon_decay=0.995, epsilon_min=0.01):
+        """
+        Trains the network using the maximum entropy deep reinforcement algorithm.
+        :param n_states:
+        :param episodes: Count of training episodes
+        :param max_steps: Max steps per episode
+        :param epsilon_start:
+        :param epsilon_decay:
+        :param epsilon_min:
+        :return:
+        """
+        learner_feature_expectations = np.zeros(n_states)
+
+        epsilon = epsilon_start
+        episode_arr, scores = [], []
+
+        best_reward = -math.inf
+        for episode in range(episodes):
+            state, info = self.target.env_reset()
+            total_reward = 0
+
+            for step in range(max_steps):
+                action = self.select_action(state, epsilon)
+
+                next_state, reward, done, _, _ = self.target.env_step(action)
+                total_reward += reward
+
+                self.update_q_network(state, action, reward, next_state, done)
+                self.update_target_network()
+
+                # State counting for densitiy
+                state_idx = self.target.state_to_idx(state)
+                learner_feature_expectations += self.feature_matrix[int(state_idx)]
+
+                state = next_state
+                if done:
+                    break
+
+            # Keep track of best performing network
+            if total_reward > best_reward:
+                best_reward = total_reward
+                torch.save(self.q_network.state_dict(),
+                           f"../results/maxentropydeep_{episode}_best_network_w_{total_reward}_RL.pth")
+
+            if (episode + 1) % 10 == 0:
+                # calculate density
+                learner = learner_feature_expectations / episode
+                learner_feature_expectations = np.zeros(n_states)
+
+            scores.append(total_reward)
+            episode_arr.append(episode)
+            epsilon = max(epsilon * epsilon_decay, epsilon_min)
+            print(f"Episode: {episode + 1}, Total Reward: {total_reward}, Epsilon: {epsilon}")
+
+            if (episode + 1) % 1000 == 0:
+                score_avg = np.mean(scores)
+                print('{} episode average score is {:.2f}'.format(episode, score_avg))
+                self.printer.save_plot_as_png(episode_arr, scores,
+                                              f"../learning_curves/maxent_{episodes}_{episode}_qnetwork_RL.png")
+                self.printer.save_heatmap_as_png(learner.reshape((20, 20)), f"../heatmap/learner_{episode}_deep_RL.png")
+                self.printer.save_heatmap_as_png(self.theta.reshape((20, 20)),
+                                                 f"../heatmap/theta_{episode}_deep_RL.png")
+
+                torch.save(self.q_network.state_dict(), f"../results/maxent_{episodes}_{episode}_network_main.pth")
+
+            if episode == episodes - 1:
+                self.printer.save_plot_as_png(episode_arr, scores,
+                                              f"../learning_curves/maxentdeep_{episodes}_qdeep_RL.png")
+
+        torch.save(self.q_network.state_dict(), f"src/irlwpython/results/maxentdeep_{episodes}_q_network_RL.pth")
+
+    def test(self, model_path, epsilon=0.01, repeats=100):
+        """
+        Tests the previous trained model.
+        :return:
+        """
+        self.q_network.load_state_dict(torch.load(model_path))
+        episodes, scores = [], []
+
+        for episode in range(repeats):
+            state, info = self.target.env_reset()
+            score = 0
+
+            while True:
+                self.target.env_render()
+                action = self.select_action(state, epsilon)
+                next_state, reward, done, _, _ = self.target.env_step(action)
+
+                score += reward
+                state = next_state
+
+                if done:
+                    scores.append(score)
+                    episodes.append(episode)
+                    break
+
+            if episode % 1 == 0:
+                print('{} episode score is {:.2f}'.format(episode, score))
+
+        self.printer.save_plot_as_png(episodes, scores,
+                                      "src/irlwpython/learning_curves"
+                                      "/test_maxentropydeep_best_model_RL_results.png")