Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Td3 ddpg action bound fix #211

Merged
merged 22 commits into from
Jun 30, 2022
Merged

Conversation

dosssman
Copy link
Collaborator

@dosssman dosssman commented Jun 21, 2022

Description

Adds action scaling into the Actor components of DDPG and TD3 to make sure the action boundaries of the environment are respected when sampling action to compute the target for Q learning update.
Closes #196 .

Preliminary experiments tracked here: https://wandb.ai/openrlbenchmark/openrlbenchmark/reports/MuJoCo-CleanRL-s-TD3-Action-Bound-Fix-Check--VmlldzoyMjAwMjM5

EDIT 1: Updated SAC continuous to use self.register_buffer for action_scale and action_bias.

Types of changes

  • Bug fix
  • New feature
  • New algorithm
  • Documentation

Checklist:

  • I've read the CONTRIBUTION guide (required).
  • I have ensured pre-commit run --all-files passes (required).
  • I have updated the documentation and previewed the changes via mkdocs serve.
  • I have updated the tests accordingly (if applicable).

If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments. See #137 as an example PR.

  • I have contacted @vwxyzjn to obtain access to the openrlbenchmark W&B team (required).
  • I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).
  • I have added additional documentation and previewed the changes via mkdocs serve.
    • I have explained note-worthy implementation details.

@vercel
Copy link

vercel bot commented Jun 21, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
cleanrl ✅ Ready (Inspect) Visit Preview Jun 30, 2022 at 1:46AM (UTC)

dosssman added 2 commits June 21, 2022 12:01

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
def to(self, device):
self.action_scale = self.action_scale.to(device)
self.action_bias = self.action_bias.to(device)
return super().to(device)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to do this. You should do the following in the __init__ function.

self.register_buffer("action_scale", xxx)
self.register_buffer("action_bias", xxx)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. This is actually more to my liking haha.
I will take this opportunity to update the SAC accordingly then. Or maybe another PR ?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR is fine :)

Copy link
Owner

@vwxyzjn vwxyzjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw do we need to do anything about this? The max_action is still only working in a symmetric case.

actions = np.array(
    [
        (
            actions.tolist()[0]
            + np.random.normal(0, max_action * args.exploration_noise, size=envs.single_action_space.shape[0])
        ).clip(envs.single_action_space.low, envs.single_action_space.high)
    ]
)

Please don't worry about re-running the experiments, since all mujoco envs have symmetric low and high values for the action space, and we don't need to re-run the experiments as long as the changes don't impact the benchmark results in any way.

@vwxyzjn
Copy link
Owner

vwxyzjn commented Jun 29, 2022

For some reason, the new ddpg runs don't utilize the GPU...?

image

@vwxyzjn
Copy link
Owner

vwxyzjn commented Jun 29, 2022

Hey @dosssman I am going to put the old ddpg experiments (Hopper-v2, Walker-v2, HalfCheetah-v2) back to the openrlbenchmark/cleanrl since they should not be affected by this PR (as they have the symmetrical [-1, 1] action space)

@dosssman
Copy link
Collaborator Author

Hey @dosssman I am going to put the old ddpg experiments (Hopper-v2, Walker-v2, HalfCheetah-v2) back to the openrlbenchmark/cleanrl since they should not be affected by this PR (as they have the symmetrical [-1, 1] action space)

Pretty sure they do. Here is GPU usage from the latest DDPG continous runs: https://wandb.ai/openrlbenchmark/cleanrl/runs/38a4fiaq/system?workspace=user-dosssman
image

@dosssman
Copy link
Collaborator Author

dosssman commented Jun 29, 2022

Btw do we need to do anything about this? The max_action is still only working in a symmetric case.

actions = np.array(
    [
        (
            actions.tolist()[0]
            + np.random.normal(0, max_action * args.exploration_noise, size=envs.single_action_space.shape[0])
        ).clip(envs.single_action_space.low, envs.single_action_space.high)
    ]
)

Please don't worry about re-running the experiments, since all mujoco envs have symmetric low and high values for the action space, and we don't need to re-run the experiments as long as the changes don't impact the benchmark results in any way.

Yes, this was updated to center the distribution from which the exploration noise is sampled to the low-high range of the action space, as in:

(
actions.tolist()[0]
+ np.random.normal(
actor.action_bias[0].cpu().numpy(),
actor.action_scale[0].cpu().numpy() * args.exploration_noise,
size=envs.single_action_space.shape[0],
)
).clip(envs.single_action_space.low, envs.single_action_space.high)

EDIT: My bad, it was not done in DDPG yet.

@@ -64,6 +64,7 @@ Our [`td3_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/

1. [`td3_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/td3_continuous_action.py) uses a two separate objects `qf1` and `qf2` to represents the two Q functions in the Clipped Double Q-learning architecture, whereas [`TD3.py`](https://github.com/sfujim/TD3/blob/master/TD3.py) (Fujimoto et al., 2018)[^2] uses a single `Critic` class that contains both Q networks. That said, these two implementations are virtually the same.

1. [`td3_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/td3_continuous_action.py) properly handles action space bounds ... TODO: fill this in.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dosssman could you fill this in, please?

1. Overall Loss and Entropy Bonus (:material-github: [ppo2/model.py#L91](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py#L91))
1. Global Gradient Clipping (:material-github: [ppo2/model.py#L102-L108](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py#L102-L108)) -->

1. [`ddpg_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ddpg_continuous_action.py) properly handles action space bounds ... TODO: fill this in.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dosssman could you fill this in (the same as td3), please?

@vwxyzjn
Copy link
Owner

vwxyzjn commented Jun 29, 2022

Btw do we need to do anything about this? The max_action is still only working in a symmetric case.

actions = np.array(
    [
        (
            actions.tolist()[0]
            + np.random.normal(0, max_action * args.exploration_noise, size=envs.single_action_space.shape[0])
        ).clip(envs.single_action_space.low, envs.single_action_space.high)
    ]
)

Please don't worry about re-running the experiments, since all mujoco envs have symmetric low and high values for the action space, and we don't need to re-run the experiments as long as the changes don't impact the benchmark results in any way.

Yes, this was updated to center the distribution from which the exploration noise is sampled to the low-high range of the action space, as in:

(
actions.tolist()[0]
+ np.random.normal(
actor.action_bias[0].cpu().numpy(),
actor.action_scale[0].cpu().numpy() * args.exploration_noise,
size=envs.single_action_space.shape[0],
)
).clip(envs.single_action_space.low, envs.single_action_space.high)

EDIT: My bad, it was not done in DDPG yet.

Got it. Thank you! Could you update it in DDPG and update the docs? Then everything looks good to me to be merged :)

…m distribution that is centered at the mena of the action space boundaries
@dosssman
Copy link
Collaborator Author

On it.

@dosssman
Copy link
Collaborator Author

Docs updated.

Copy link
Owner

@vwxyzjn vwxyzjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey sorry there is one more thing - could you make the action scale and bias more efficient by not having to transfer them back to CPU every time there is an update? Maybe store an action scale and bias as a numpy array in the host as well.

@dosssman
Copy link
Collaborator Author

Queued up some runs with the latests changes as a quick check.

@vwxyzjn
Copy link
Owner

vwxyzjn commented Jun 30, 2022

Cool thank you! If there is no regression in performance. Let's merge the PR.

@dosssman
Copy link
Collaborator Author

Results experiment do not differ much from the previous version.
This PR should be ready for merge.
Thanks in advance.

Copy link
Owner

@vwxyzjn vwxyzjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @dosssman and @huxiao09

@vwxyzjn vwxyzjn merged commit 15df5c0 into vwxyzjn:master Jun 30, 2022
vwxyzjn added a commit that referenced this pull request Jul 12, 2022
* prototype jax with ddpg

* Quick fix

* quick fix

* Commit changes - successful prototype

* Remove scripts

* Simplify the implementation: careful with shape

* Format

* Remove code

* formatting changes

* formatting change

* bug fix

* correctly implementing keys

* these two lines are not necessary

target_params are initialized with the same RNG key

* Adapting to the `TrainState` API

* Simplify code

* use `optax.incremental_update`

* Also log q values

* Addresses #211

* update docs

* Add jax benchmark experiments

* remove old files

* update benchmark scripts

* update lock files

* Handle action space bounds

* Add docs

* Typo

* update CI

* bug fix and add docs link

* Add a note explaining the speed

* Update ddpg docs
@vwxyzjn vwxyzjn mentioned this pull request Oct 19, 2022
@dosssman dosssman deleted the td3_ddpg_action_bound_fix branch March 3, 2025 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DDPG/TD3 target_actor output clip
2 participants