Skip to content

Commit f492208

Browse files
authored
Add results for RL algorithm (#12)
1 parent a346382 commit f492208

13 files changed

+507
-99
lines changed

README.md

+49-96
Original file line numberDiff line numberDiff line change
@@ -6,119 +6,72 @@ Inverse Reinforcement Learning Algorithm implementation with python.
66

77
# Implemented Algorithms
88

9-
## Maximum Entropy IRL: [1]
9+
## Maximum Entropy IRL:
1010

11-
## Maximum Entropy Deep IRL
11+
Implementation of the Maximum Entropy inverse reinforcement learning algorithm from [1] and is based on the implementation
12+
of [lets-do-irl](https://github.com/reinforcement-learning-kr/lets-do-irl/tree/master/mountaincar/maxent).
13+
It is an IRL algorithm using Q-Learning with a Maximum Entropy update function.
1214

13-
# Experiments
15+
## Maximum Entropy Deep IRL:
1416

15-
## Mountaincar-v0
16-
[gym](https://www.gymlibrary.dev/environments/classic_control/mountain_car/)
17-
18-
The expert demonstrations for the Mountaincar-v0 are the same as used in [lets-do-irl](https://github.com/reinforcement-learning-kr/lets-do-irl/tree/master/mountaincar/maxent).
19-
20-
*Heatmap of Expert demonstrations with 400 states*:
21-
22-
<img src="demo/heatmaps/expert_state_frequencies_mountaincar.png">
23-
24-
### Maximum Entropy Inverse Reinforcement Learning
25-
26-
IRL using Q-Learning with a Maximum Entropy update function.
27-
28-
#### Training
29-
30-
*Learner training for 1000 episodes*:
31-
32-
<img src="demo/learning_curves/maxent_999_flat.png">
33-
34-
*Learner training for 4000 episodes*:
35-
36-
<img src="demo/learning_curves/maxent_4999_flat.png">
37-
38-
#### Heatmaps
39-
40-
*Learner state frequencies after 1000 episodes*:
41-
42-
<img src="demo/heatmaps/learner_999_flat.png">
43-
44-
*Learner state frequencies after 2000 episodes*:
45-
46-
<img src="demo/heatmaps/learner_1999_flat.png">
47-
48-
*Learner state frequencies after 5000 episodes*:
49-
50-
<img src="demo/heatmaps/learner_4999_flat.png">
51-
52-
<img src="demo/heatmaps/theta_999_flat.png">
53-
54-
*State rewards heatmap after 5000 episodes*:
55-
56-
<img src="demo/heatmaps/theta_4999_flat.png">
17+
An implementation of the Maximum Entropy inverse reinforcement learning algorithm, which uses a neural-network for the
18+
actor.
19+
The estimated irl-reward is learned similar as in Maximum Entropy IRL.
20+
It is an IRL algorithm using Deep Q-Learning with a Maximum Entropy update function.
5721

58-
*State rewards heatmap after 14000 episodes*:
22+
## Maximum Entropy Deep RL:
5923

60-
<img src="demo/heatmaps/theta_13999_flat.png">
24+
An implementation of the Maximum Entropy reinforcement learning algorithm.
25+
This algorithm is used to compare the IRL algorithms with an RL algorithm.
6126

62-
#### Testing
27+
# Experiment
6328

64-
*Testing results of the model after 29000 episodes*:
65-
66-
<img src="demo/test_results/test_maxentropy_flat.png">
67-
68-
69-
### Deep Maximum Entropy Inverse Reinforcement Learning
70-
71-
IRL using Deep Q-Learning with a Maximum Entropy update function.
72-
73-
#### Training
74-
75-
*Learner training for 1000 episodes*:
76-
77-
<img src="demo/learning_curves/maxentdeep_999_w_reset_10.png">
78-
79-
*Learner training for 5000 episodes*:
80-
81-
<img src="demo/learning_curves/maxentdeep_4999_w_reset_10.png">
82-
83-
#### Heatmaps
84-
85-
*Learner state frequencies after 1000 episodes*:
86-
87-
<img src="demo/heatmaps/learner_999_maxentdeep_w_reset_10.png">
88-
89-
*Learner state frequencies after 2000 episodes*:
90-
91-
<img src="demo/heatmaps/learner_1999_maxentdeep_w_reset_10.png">
92-
93-
*Learner state frequencies after 5000 episodes*:
94-
95-
<img src="demo/heatmaps/learner_4999_maxentdeep_w_reset_10.png">
96-
97-
*State rewards heatmap after 1000 episodes*:
98-
99-
<img src="demo/heatmaps/theta_999_maxentdeep_w_reset_10.png">
29+
## Mountaincar-v0
10030

101-
*State rewards heatmap after 2000 episodes*:
31+
The Mountaincar-v0 is used for evaluating the different algorithms.
32+
Therefore, the implementation of the MDP for the Mountaincar
33+
from [gym](https://www.gymlibrary.dev/environments/classic_control/mountain_car/) is used.
10234

103-
<img src="demo/heatmaps/theta_1999_maxentdeep_w_reset_10.png">
35+
The expert demonstrations for the Mountaincar-v0 are the same as used
36+
in [lets-do-irl](https://github.com/reinforcement-learning-kr/lets-do-irl/tree/master/mountaincar/maxent).
10437

105-
*State rewards heatmap after 5000 episodes*:
38+
*Heatmap of Expert demonstrations with 400 states*:
10639

107-
<img src="demo/heatmaps/theta_4999_maxentdeep_w_reset_10.png">
40+
<img src="demo/heatmaps/expert_state_frequencies_mountaincar.png">
10841

42+
### Comparing the algorithms
10943

110-
#### Testing
44+
The following tables compare the result of training and testing the two IRL algorithms Maximum Entropy and
45+
Maximum Entropy Deep. Furthermore, results for the RL algorithm Maximum Entropy Deep algorithm are shown, to
46+
highlight the differences between IRL and RL.
11147

112-
*Testing results of the best model after 5000 episodes*:
48+
| Algorithm | Training Curve after 1000 Episodes | Training Curve after 5000 Episodes |
49+
|--------------------------|----------------------------------------------------------------------------|-----------------------------------------------------------------------------|
50+
| Maximum Entropy IRL | <img src="demo/learning_curves/maxent_999_flat.png" width="400"> | <img src="demo/learning_curves/maxent_4999_flat.png" width="400"> |
51+
| Maximum Entropy Deep IRL | <img src="demo/learning_curves/maxentdeep_999_w_reset_10.png" width="400"> | <img src="demo/learning_curves/maxentdeep_4999_w_reset_10.png" width="400"> |
52+
| Maximum Entropy Deep RL | <img src="demo/learning_curves/maxentdeep_999_RL.png" width="400"> | <img src="demo/learning_curves/maxentdeep_4999_RL.png" width="400"> |
11353

114-
<img src="demo/test_results/test_maxentropydeep_best_model_results.png">
54+
| Algorithm | State Frequencies Learner: 1000 Episodes | State Frequencies Learner: 2000 Episodes | State Frequencies Learner: 5000 Episodes |
55+
|--------------------------|-----------------------------------------------------------------------------|------------------------------------------------------------------------------|------------------------------------------------------------------------------|
56+
| Maximum Entropy IRL | <img src="demo/heatmaps/learner_999_flat.png" width="400"> | <img src="demo/heatmaps/learner_1999_flat.png" width="400"> | <img src="demo/heatmaps/learner_4999_flat.png" width="400"> |
57+
| Maximum Entropy Deep IRL | <img src="demo/heatmaps/learner_999_maxentdeep_w_reset_10.png" width="400"> | <img src="demo/heatmaps/learner_1999_maxentdeep_w_reset_10.png" width="400"> | <img src="demo/heatmaps/learner_4999_maxentdeep_w_reset_10.png" width="400"> |
58+
| Maximum Entropy Deep RL | <img src="demo/heatmaps/learner_999_deep_RL.png" width="400"> | <img src="demo/heatmaps/learner_1999_deep_RL.png" width="400"> | <img src="demo/heatmaps/learner_4999_deep_RL.png" width="400"> |
11559

116-
### Deep Maximum Entropy Inverse Reinforcement Learning with Critic
60+
| Algorithm | IRL Rewards: 1000 Episodes | IRL Rewards: 2000 Episodes | IRL Rewards: 5000 Episodes | IRL Rewards: 14000 Episodes |
61+
|--------------------------|---------------------------------------------------------------------------|----------------------------------------------------------------------------|----------------------------------------------------------------------------|------------------------------------------------------------|
62+
| Maximum Entropy IRL | <img src="demo/heatmaps/theta_999_flat.png" width="400"> | None | <img src="demo/heatmaps/theta_4999_flat.png" width="400"> | <img src="demo/heatmaps/theta_13999_flat.png" width="400"> |
63+
| Maximum Entropy Deep IRL | <img src="demo/heatmaps/theta_999_maxentdeep_w_reset_10.png" width="400"> | <img src="demo/heatmaps/theta_1999_maxentdeep_w_reset_10.png" width="400"> | <img src="demo/heatmaps/theta_4999_maxentdeep_w_reset_10.png" width="400"> | None |
64+
| Maximum Entropy Deep RL | None | None | None | None |
11765

118-
Coming soon...
66+
| Algorithm | Testing Results: 100 Runs |
67+
|--------------------------|-----------------------------------------------------------------------------------------|
68+
| Maximum Entropy IRL | <img src="demo/test_results/test_maxentropy_flat.png" width="400"> |
69+
| Maximum Entropy Deep IRL | <img src="demo/test_results/test_maxentropydeep_best_model_results.png" width="400"> |
70+
| Maximum Entropy Deep RL | <img src="demo/test_results/test_maxentropydeep_best_model_RL_results.png" width="400"> |
11971

12072
# References
121-
The implementation of MaxEntropyIRL and MountainCar is based on the implementation of:
73+
74+
The implementation of MaxEntropyIRL and MountainCar is based on the implementation of:
12275
[lets-do-irl](https://github.com/reinforcement-learning-kr/lets-do-irl/tree/master/mountaincar/maxent)
12376

12477
[1] [BD. Ziebart, et al., "Maximum Entropy Inverse Reinforcement Learning", AAAI 2008](https://cdn.aaai.org/AAAI/2008/AAAI08-227.pdf).
@@ -133,12 +86,12 @@ pip install .
13386
# Usage
13487

13588
```commandline
136-
usage: irl [-h] [--version] [--training] [--testing] [--render] ALGORITHM
89+
usage: irl-runner [-h] [--version] [--training] [--testing] [--render] ALGORITHM
13790
13891
Implementation of IRL algorithms
13992
14093
positional arguments:
141-
ALGORITHM Currently supported training algorithm: [max-entropy, max-entropy-deep]
94+
ALGORITHM Currently supported training algorithm: [max-entropy, max-entropy-deep, max-entropy-deep-rl]
14295
14396
options:
14497
-h, --help show this help message and exit
19.2 KB
Loading
21.2 KB
Loading

demo/heatmaps/learner_999_deep_RL.png

21 KB
Loading
22.2 KB
Loading
27.1 KB
Loading
Loading
Binary file not shown.

setup.cfg

+1-1
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ testing =
7878
# script_name = irlwpython.module:function
7979
# For example:
8080
console_scripts =
81-
irl = irlwpython.main:run
81+
irl-runner = irlwpython.main:run
8282
# And any other entry points, for example:
8383
# pyscaffold.cli =
8484
# awesome = pyscaffoldext.awesome.extension:AwesomeExtension

src/irlwpython/MaxEntropyDeepRL.py

+197
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
import numpy as np
2+
import math
3+
4+
import torch
5+
import torch.optim as optim
6+
import torch.nn as nn
7+
8+
from irlwpython.FigurePrinter import FigurePrinter
9+
10+
11+
class QNetwork(nn.Module):
12+
def __init__(self, input_size, output_size):
13+
super(QNetwork, self).__init__()
14+
self.fc1 = nn.Linear(input_size, 64)
15+
self.relu1 = nn.ReLU()
16+
self.fc2 = nn.Linear(64, 32)
17+
self.relu2 = nn.ReLU()
18+
self.output_layer = nn.Linear(32, output_size)
19+
20+
self.printer = FigurePrinter()
21+
22+
def forward(self, state):
23+
x = self.fc1(state)
24+
x = self.relu1(x)
25+
x = self.fc2(x)
26+
x = self.relu2(x)
27+
q_values = self.output_layer(x)
28+
return q_values
29+
30+
31+
class MaxEntropyDeepRL:
32+
def __init__(self, target, state_dim, action_size, feature_matrix, one_feature, learning_rate=0.001, gamma=0.99):
33+
self.feature_matrix = feature_matrix
34+
self.one_feature = one_feature
35+
36+
self.target = target
37+
38+
self.q_network = QNetwork(state_dim, action_size)
39+
self.target_q_network = QNetwork(state_dim, action_size)
40+
self.target_q_network.load_state_dict(self.q_network.state_dict())
41+
self.optimizer = optim.Adam(self.q_network.parameters(), lr=learning_rate)
42+
43+
self.gamma = gamma
44+
45+
self.printer = FigurePrinter()
46+
47+
def select_action(self, state, epsilon):
48+
"""
49+
Selects an action based on the q values from the network with epsilon greedy.
50+
:param state:
51+
:param epsilon:
52+
:return:
53+
"""
54+
if np.random.rand() < epsilon:
55+
return np.random.choice(3)
56+
else:
57+
with torch.no_grad():
58+
q_values = self.q_network(torch.FloatTensor(state))
59+
return torch.argmax(q_values).item()
60+
61+
def update_q_network(self, state, action, reward, next_state, done):
62+
"""
63+
Updates the q network based on the reward
64+
:param state:
65+
:param action:
66+
:param reward:
67+
:param next_state:
68+
:param done:
69+
:return:
70+
"""
71+
state = torch.FloatTensor(state)
72+
next_state = torch.FloatTensor(next_state)
73+
q_values = self.q_network(state)
74+
next_q_values = self.target_q_network(next_state)
75+
76+
target = q_values.clone()
77+
if not done:
78+
target[action] = reward + self.gamma * torch.max(next_q_values).item()
79+
else:
80+
target[action] = reward
81+
82+
loss = nn.MSELoss()(q_values, target.detach())
83+
self.optimizer.zero_grad()
84+
loss.backward()
85+
self.optimizer.step()
86+
87+
def update_target_network(self):
88+
"""
89+
Updates the target network.
90+
:return:
91+
"""
92+
self.target_q_network.load_state_dict(self.q_network.state_dict())
93+
94+
def train(self, n_states, episodes=30000, max_steps=200,
95+
epsilon_start=1.0,
96+
epsilon_decay=0.995, epsilon_min=0.01):
97+
"""
98+
Trains the network using the maximum entropy deep reinforcement algorithm.
99+
:param n_states:
100+
:param episodes: Count of training episodes
101+
:param max_steps: Max steps per episode
102+
:param epsilon_start:
103+
:param epsilon_decay:
104+
:param epsilon_min:
105+
:return:
106+
"""
107+
learner_feature_expectations = np.zeros(n_states)
108+
109+
epsilon = epsilon_start
110+
episode_arr, scores = [], []
111+
112+
best_reward = -math.inf
113+
for episode in range(episodes):
114+
state, info = self.target.env_reset()
115+
total_reward = 0
116+
117+
for step in range(max_steps):
118+
action = self.select_action(state, epsilon)
119+
120+
next_state, reward, done, _, _ = self.target.env_step(action)
121+
total_reward += reward
122+
123+
self.update_q_network(state, action, reward, next_state, done)
124+
self.update_target_network()
125+
126+
# State counting for densitiy
127+
state_idx = self.target.state_to_idx(state)
128+
learner_feature_expectations += self.feature_matrix[int(state_idx)]
129+
130+
state = next_state
131+
if done:
132+
break
133+
134+
# Keep track of best performing network
135+
if total_reward > best_reward:
136+
best_reward = total_reward
137+
torch.save(self.q_network.state_dict(),
138+
f"../results/maxentropydeep_{episode}_best_network_w_{total_reward}_RL.pth")
139+
140+
if (episode + 1) % 10 == 0:
141+
# calculate density
142+
learner = learner_feature_expectations / episode
143+
learner_feature_expectations = np.zeros(n_states)
144+
145+
scores.append(total_reward)
146+
episode_arr.append(episode)
147+
epsilon = max(epsilon * epsilon_decay, epsilon_min)
148+
print(f"Episode: {episode + 1}, Total Reward: {total_reward}, Epsilon: {epsilon}")
149+
150+
if (episode + 1) % 1000 == 0:
151+
score_avg = np.mean(scores)
152+
print('{} episode average score is {:.2f}'.format(episode, score_avg))
153+
self.printer.save_plot_as_png(episode_arr, scores,
154+
f"../learning_curves/maxent_{episodes}_{episode}_qnetwork_RL.png")
155+
self.printer.save_heatmap_as_png(learner.reshape((20, 20)), f"../heatmap/learner_{episode}_deep_RL.png")
156+
self.printer.save_heatmap_as_png(self.theta.reshape((20, 20)),
157+
f"../heatmap/theta_{episode}_deep_RL.png")
158+
159+
torch.save(self.q_network.state_dict(), f"../results/maxent_{episodes}_{episode}_network_main.pth")
160+
161+
if episode == episodes - 1:
162+
self.printer.save_plot_as_png(episode_arr, scores,
163+
f"../learning_curves/maxentdeep_{episodes}_qdeep_RL.png")
164+
165+
torch.save(self.q_network.state_dict(), f"src/irlwpython/results/maxentdeep_{episodes}_q_network_RL.pth")
166+
167+
def test(self, model_path, epsilon=0.01, repeats=100):
168+
"""
169+
Tests the previous trained model.
170+
:return:
171+
"""
172+
self.q_network.load_state_dict(torch.load(model_path))
173+
episodes, scores = [], []
174+
175+
for episode in range(repeats):
176+
state, info = self.target.env_reset()
177+
score = 0
178+
179+
while True:
180+
self.target.env_render()
181+
action = self.select_action(state, epsilon)
182+
next_state, reward, done, _, _ = self.target.env_step(action)
183+
184+
score += reward
185+
state = next_state
186+
187+
if done:
188+
scores.append(score)
189+
episodes.append(episode)
190+
break
191+
192+
if episode % 1 == 0:
193+
print('{} episode score is {:.2f}'.format(episode, score))
194+
195+
self.printer.save_plot_as_png(episodes, scores,
196+
"src/irlwpython/learning_curves"
197+
"/test_maxentropydeep_best_model_RL_results.png")

0 commit comments

Comments
 (0)