-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Td3 ddpg action bound fix #211
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
cleanrl/td3_continuous_action.py
Outdated
def to(self, device): | ||
self.action_scale = self.action_scale.to(device) | ||
self.action_bias = self.action_bias.to(device) | ||
return super().to(device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't have to do this. You should do the following in the __init__
function.
self.register_buffer("action_scale", xxx)
self.register_buffer("action_bias", xxx)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. This is actually more to my liking haha.
I will take this opportunity to update the SAC accordingly then. Or maybe another PR ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this PR is fine :)
…based on action_bias and action_scale
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw do we need to do anything about this? The max_action
is still only working in a symmetric case.
actions = np.array(
[
(
actions.tolist()[0]
+ np.random.normal(0, max_action * args.exploration_noise, size=envs.single_action_space.shape[0])
).clip(envs.single_action_space.low, envs.single_action_space.high)
]
)
Please don't worry about re-running the experiments, since all mujoco envs have symmetric low and high values for the action space, and we don't need to re-run the experiments as long as the changes don't impact the benchmark results in any way.
Hey @dosssman I am going to put the old ddpg experiments (Hopper-v2, Walker-v2, HalfCheetah-v2) back to the |
Pretty sure they do. Here is GPU usage from the latest DDPG continous runs: https://wandb.ai/openrlbenchmark/cleanrl/runs/38a4fiaq/system?workspace=user-dosssman |
Yes, this was updated to center the distribution from which the exploration noise is sampled to the low-high range of the action space, as in: cleanrl/cleanrl/td3_continuous_action.py Lines 183 to 190 in db6ff29
EDIT: My bad, it was not done in DDPG yet. |
docs/rl-algorithms/td3.md
Outdated
@@ -64,6 +64,7 @@ Our [`td3_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/ | |||
|
|||
1. [`td3_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/td3_continuous_action.py) uses a two separate objects `qf1` and `qf2` to represents the two Q functions in the Clipped Double Q-learning architecture, whereas [`TD3.py`](https://github.com/sfujim/TD3/blob/master/TD3.py) (Fujimoto et al., 2018)[^2] uses a single `Critic` class that contains both Q networks. That said, these two implementations are virtually the same. | |||
|
|||
1. [`td3_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/td3_continuous_action.py) properly handles action space bounds ... TODO: fill this in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dosssman could you fill this in, please?
docs/rl-algorithms/ddpg.md
Outdated
1. Overall Loss and Entropy Bonus (:material-github: [ppo2/model.py#L91](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py#L91)) | ||
1. Global Gradient Clipping (:material-github: [ppo2/model.py#L102-L108](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py#L102-L108)) --> | ||
|
||
1. [`ddpg_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ddpg_continuous_action.py) properly handles action space bounds ... TODO: fill this in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dosssman could you fill this in (the same as td3), please?
Got it. Thank you! Could you update it in DDPG and update the docs? Then everything looks good to me to be merged :) |
…m distribution that is centered at the mena of the action space boundaries
On it. |
…an/cleanrl into td3_ddpg_action_bound_fix
Docs updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey sorry there is one more thing - could you make the action scale and bias more efficient by not having to transfer them back to CPU every time there is an update? Maybe store an action scale and bias as a numpy array in the host as well.
Queued up some runs with the latests changes as a quick check. |
Cool thank you! If there is no regression in performance. Let's merge the PR. |
Results experiment do not differ much from the previous version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* prototype jax with ddpg * Quick fix * quick fix * Commit changes - successful prototype * Remove scripts * Simplify the implementation: careful with shape * Format * Remove code * formatting changes * formatting change * bug fix * correctly implementing keys * these two lines are not necessary target_params are initialized with the same RNG key * Adapting to the `TrainState` API * Simplify code * use `optax.incremental_update` * Also log q values * Addresses #211 * update docs * Add jax benchmark experiments * remove old files * update benchmark scripts * update lock files * Handle action space bounds * Add docs * Typo * update CI * bug fix and add docs link * Add a note explaining the speed * Update ddpg docs
Description
Adds action scaling into the Actor components of DDPG and TD3 to make sure the action boundaries of the environment are respected when sampling action to compute the target for Q learning update.
Closes #196 .
Preliminary experiments tracked here: https://wandb.ai/openrlbenchmark/openrlbenchmark/reports/MuJoCo-CleanRL-s-TD3-Action-Bound-Fix-Check--VmlldzoyMjAwMjM5
EDIT 1: Updated SAC continuous to use
self.register_buffer
for action_scale and action_bias.Types of changes
Checklist:
pre-commit run --all-files
passes (required).I have updated the documentation and previewed the changes viamkdocs serve
.If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments. See #137 as an example PR.
--capture-video
flag toggled on (required).I have added additional documentation and previewed the changes viamkdocs serve
.