Td3 ddpg action bound fix #211

dosssman · 2022-06-21T00:21:03Z

Description

Adds action scaling into the Actor components of DDPG and TD3 to make sure the action boundaries of the environment are respected when sampling action to compute the target for Q learning update.
Closes #196 .

Preliminary experiments tracked here: https://wandb.ai/openrlbenchmark/openrlbenchmark/reports/MuJoCo-CleanRL-s-TD3-Action-Bound-Fix-Check--VmlldzoyMjAwMjM5

EDIT 1: Updated SAC continuous to use self.register_buffer for action_scale and action_bias.

Types of changes

Bug fix
New feature
New algorithm
Documentation

Checklist:

I've read the CONTRIBUTION guide (required).
I have ensured pre-commit run --all-files passes (required).
~~I have updated the documentation and previewed the changes via mkdocs serve.~~
I have updated the tests accordingly (if applicable).

If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments. See #137 as an example PR.

I have contacted @vwxyzjn to obtain access to the openrlbenchmark W&B team (required).
I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).
~~I have added additional documentation and previewed the changes via mkdocs serve.~~
- I have explained note-worthy implementation details.

…e environment

vercel · 2022-06-21T00:21:10Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated
cleanrl	✅ Ready (Inspect)	Visit Preview	Jun 30, 2022 at 1:46AM (UTC)

cleanrl/td3_continuous_action.py

vwxyzjn · 2022-06-21T15:30:11Z

cleanrl/td3_continuous_action.py

+    def to(self, device):
+        self.action_scale = self.action_scale.to(device)
+        self.action_bias = self.action_bias.to(device)
+        return super().to(device)


You don't have to do this. You should do the following in the __init__ function.

self.register_buffer("action_scale", xxx) self.register_buffer("action_bias", xxx)

Will do. This is actually more to my liking haha.
I will take this opportunity to update the SAC accordingly then. Or maybe another PR ?

In this PR is fine :)

…ster_buffer

…based on action_bias and action_scale

vwxyzjn

Btw do we need to do anything about this? The max_action is still only working in a symmetric case.

actions = np.array(
    [
        (
            actions.tolist()[0]
            + np.random.normal(0, max_action * args.exploration_noise, size=envs.single_action_space.shape[0])
        ).clip(envs.single_action_space.low, envs.single_action_space.high)
    ]
)

Please don't worry about re-running the experiments, since all mujoco envs have symmetric low and high values for the action space, and we don't need to re-run the experiments as long as the changes don't impact the benchmark results in any way.

vwxyzjn · 2022-06-29T01:39:30Z

For some reason, the new ddpg runs don't utilize the GPU...?

vwxyzjn · 2022-06-29T01:42:18Z

Hey @dosssman I am going to put the old ddpg experiments (Hopper-v2, Walker-v2, HalfCheetah-v2) back to the openrlbenchmark/cleanrl since they should not be affected by this PR (as they have the symmetrical [-1, 1] action space)

dosssman · 2022-06-29T01:50:23Z

Hey @dosssman I am going to put the old ddpg experiments (Hopper-v2, Walker-v2, HalfCheetah-v2) back to the openrlbenchmark/cleanrl since they should not be affected by this PR (as they have the symmetrical [-1, 1] action space)

Pretty sure they do. Here is GPU usage from the latest DDPG continous runs: https://wandb.ai/openrlbenchmark/cleanrl/runs/38a4fiaq/system?workspace=user-dosssman

dosssman · 2022-06-29T02:11:27Z

Btw do we need to do anything about this? The max_action is still only working in a symmetric case.
actions = np.array(
    [
        (
            actions.tolist()[0]
            + np.random.normal(0, max_action * args.exploration_noise, size=envs.single_action_space.shape[0])
        ).clip(envs.single_action_space.low, envs.single_action_space.high)
    ]
)
Please don't worry about re-running the experiments, since all mujoco envs have symmetric low and high values for the action space, and we don't need to re-run the experiments as long as the changes don't impact the benchmark results in any way.

Yes, this was updated to center the distribution from which the exploration noise is sampled to the low-high range of the action space, as in:

cleanrl/cleanrl/td3_continuous_action.py

Lines 183 to 190 in db6ff29

    
           ( 
        
               actions.tolist()[0] 
        
               + np.random.normal( 
        
                   actor.action_bias[0].cpu().numpy(), 
        
                   actor.action_scale[0].cpu().numpy() * args.exploration_noise, 
        
                   size=envs.single_action_space.shape[0], 
        
               ) 
        
           ).clip(envs.single_action_space.low, envs.single_action_space.high)

EDIT: My bad, it was not done in DDPG yet.

vwxyzjn · 2022-06-29T02:54:54Z

docs/rl-algorithms/td3.md

@@ -64,6 +64,7 @@ Our [`td3_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/

 1. [`td3_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/td3_continuous_action.py) uses a two separate objects `qf1` and `qf2` to represents the two Q functions in the Clipped Double Q-learning architecture, whereas  [`TD3.py`](https://github.com/sfujim/TD3/blob/master/TD3.py)  (Fujimoto et al., 2018)[^2] uses a single `Critic` class that contains both Q networks. That said, these two implementations are virtually the same.

+1. [`td3_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/td3_continuous_action.py) properly handles action space bounds ... TODO: fill this in.


@dosssman could you fill this in, please?

vwxyzjn · 2022-06-29T02:55:18Z

docs/rl-algorithms/ddpg.md

-1. Overall Loss and Entropy Bonus (:material-github: [ppo2/model.py#L91](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py#L91))
-1. Global Gradient Clipping (:material-github: [ppo2/model.py#L102-L108](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py#L102-L108)) -->
+
+1. [`ddpg_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ddpg_continuous_action.py) properly handles action space bounds ... TODO: fill this in.


@dosssman could you fill this in (the same as td3), please?

vwxyzjn · 2022-06-29T02:56:51Z

Btw do we need to do anything about this? The max_action is still only working in a symmetric case.
actions = np.array(
    [
        (
            actions.tolist()[0]
            + np.random.normal(0, max_action * args.exploration_noise, size=envs.single_action_space.shape[0])
        ).clip(envs.single_action_space.low, envs.single_action_space.high)
    ]
)
Please don't worry about re-running the experiments, since all mujoco envs have symmetric low and high values for the action space, and we don't need to re-run the experiments as long as the changes don't impact the benchmark results in any way.
Yes, this was updated to center the distribution from which the exploration noise is sampled to the low-high range of the action space, as in:

cleanrl/cleanrl/td3_continuous_action.py

Lines 183 to 190 in db6ff29

(

actions.tolist()[0]

+ np.random.normal(

actor.action_bias[0].cpu().numpy(),

actor.action_scale[0].cpu().numpy() * args.exploration_noise,

size=envs.single_action_space.shape[0],

)

).clip(envs.single_action_space.low, envs.single_action_space.high)

EDIT: My bad, it was not done in DDPG yet.

Got it. Thank you! Could you update it in DDPG and update the docs? Then everything looks good to me to be merged :)

…m distribution that is centered at the mena of the action space boundaries

dosssman · 2022-06-29T02:58:49Z

On it.

…an/cleanrl into td3_ddpg_action_bound_fix

dosssman · 2022-06-29T03:26:47Z

Docs updated.

vwxyzjn

Hey sorry there is one more thing - could you make the action scale and bias more efficient by not having to transfer them back to CPU every time there is an update? Maybe store an action scale and bias as a numpy array in the host as well.

cleanrl/ddpg_continuous_action.py

…G and TD3

dosssman · 2022-06-30T01:58:44Z

Queued up some runs with the latests changes as a quick check.

vwxyzjn · 2022-06-30T02:02:38Z

Cool thank you! If there is no regression in performance. Let's merge the PR.

dosssman · 2022-06-30T07:16:14Z

Results experiment do not differ much from the previous version.
This PR should be ready for merge.
Thanks in advance.

vwxyzjn

LGTM. Thanks @dosssman and @huxiao09

* prototype jax with ddpg * Quick fix * quick fix * Commit changes - successful prototype * Remove scripts * Simplify the implementation: careful with shape * Format * Remove code * formatting changes * formatting change * bug fix * correctly implementing keys * these two lines are not necessary target_params are initialized with the same RNG key * Adapting to the `TrainState` API * Simplify code * use `optax.incremental_update` * Also log q values * Addresses #211 * update docs * Add jax benchmark experiments * remove old files * update benchmark scripts * update lock files * Handle action space bounds * Add docs * Typo * update CI * bug fix and add docs link * Add a note explaining the speed * Update ddpg docs

dosssman added 2 commits June 21, 2022 08:30

Prototype the actor's output action scaling to the action space of th…

44b3fe1

…e environment

TD3 and DDPG's action scale and bias move to GPU if needs be

e9fbacf

vercel bot deployed to Preview June 21, 2022 00:21 View deployment

dosssman added 2 commits June 21, 2022 12:01

Fixed formatting

ad4dc49

pre-commit fixed formating

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

d6b96e9

vercel bot deployed to Preview June 21, 2022 03:03 View deployment

dosssman mentioned this pull request Jun 21, 2022

DDPG/TD3 target_actor output clip #196

Closed

vwxyzjn reviewed Jun 21, 2022

View reviewed changes

action_scale and action_bias in TD3 / DDPG and SAC finally using regi…

ba93aaf

…ster_buffer

vercel bot deployed to Preview June 22, 2022 01:33 View deployment

Fixed the = self.register_buffer code artifact

0e20bac

vercel bot deployed to Preview June 22, 2022 01:39 View deployment

dosssman added 2 commits June 26, 2022 01:23

TD3 adjusted the exploration noise for the policy during the rollout …

9985d63

…based on action_bias and action_scale

Removed obsolete next obs actions clamping

ecb2130

vercel bot deployed to Preview June 25, 2022 16:33 View deployment

td3 format fixed by pre-cmmit

c494e72

vercel bot deployed to Preview June 25, 2022 16:56 View deployment

vwxyzjn reviewed Jun 29, 2022

View reviewed changes

cosmatic change: make handle_timeout_termination explicit

6eef004

vercel bot deployed to Preview June 29, 2022 00:29 View deployment

Quick fix

db6ff29

vercel bot deployed to Preview June 29, 2022 00:30 View deployment

vwxyzjn added a commit that referenced this pull request Jun 29, 2022

Addresses #211

6f4fa3d

update docs

7a1ab33

vercel bot deployed to Preview June 29, 2022 02:40 View deployment

Update docs

b261ac1

vercel bot deployed to Preview June 29, 2022 02:52 View deployment

vwxyzjn reviewed Jun 29, 2022

View reviewed changes

DDPG and TD3: got rid of max_action, exploration noise is sampled fro…

48e87d8

…m distribution that is centered at the mena of the action space boundaries

vercel bot deployed to Preview June 29, 2022 02:58 View deployment

vwxyzjn added 2 commits June 28, 2022 23:00

Update docs

a6b40b7

Merge branch 'td3_ddpg_action_bound_fix' of https://github.com/dosssm…

10b606e

…an/cleanrl into td3_ddpg_action_bound_fix

vercel bot deployed to Preview June 29, 2022 03:00 View deployment

dosssman added 2 commits June 29, 2022 12:25

Updated TD3 and DDPG regarding the action_mean and action_scale usage

1461980

Merge branch 'td3_ddpg_action_bound_fix' of https://github.com/dosssm…

18487fa

…an/cleanrl into td3_ddpg_action_bound_fix

vercel bot deployed to Preview June 29, 2022 03:26 View deployment

vwxyzjn requested changes Jun 29, 2022

View reviewed changes

cleanrl/ddpg_continuous_action.py Outdated Show resolved Hide resolved

Reduced needless device copy when sampling action for follouts in DDP…

c2ffe83

…G and TD3

vercel bot deployed to Preview June 29, 2022 04:18 View deployment

Quick fix

3ed96b9

vercel bot deployed to Preview June 30, 2022 01:46 View deployment

vwxyzjn approved these changes Jun 30, 2022

View reviewed changes

vwxyzjn merged commit 15df5c0 into vwxyzjn:master Jun 30, 2022

vwxyzjn mentioned this pull request Oct 19, 2022

RLops Guide #296

Closed

dosssman deleted the td3_ddpg_action_bound_fix branch March 3, 2025 12:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Td3 ddpg action bound fix #211

Td3 ddpg action bound fix #211

dosssman commented Jun 21, 2022 •

edited by vwxyzjn

Loading

vercel bot commented Jun 21, 2022 •

edited

Loading

vwxyzjn Jun 21, 2022

dosssman Jun 22, 2022

vwxyzjn Jun 26, 2022

vwxyzjn left a comment •

edited

Loading

vwxyzjn commented Jun 29, 2022

vwxyzjn commented Jun 29, 2022 •

edited

Loading

dosssman commented Jun 29, 2022

dosssman commented Jun 29, 2022 •

edited

Loading

vwxyzjn Jun 29, 2022

vwxyzjn Jun 29, 2022

vwxyzjn commented Jun 29, 2022

dosssman commented Jun 29, 2022

dosssman commented Jun 29, 2022

vwxyzjn left a comment

dosssman commented Jun 30, 2022

vwxyzjn commented Jun 30, 2022

dosssman commented Jun 30, 2022

vwxyzjn left a comment

		@@ -64,6 +64,7 @@ Our [`td3_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/

		1. [`td3_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/td3_continuous_action.py) uses a two separate objects `qf1` and `qf2` to represents the two Q functions in the Clipped Double Q-learning architecture, whereas [`TD3.py`](https://github.com/sfujim/TD3/blob/master/TD3.py) (Fujimoto et al., 2018)[^2] uses a single `Critic` class that contains both Q networks. That said, these two implementations are virtually the same.

		1. [`td3_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/td3_continuous_action.py) properly handles action space bounds ... TODO: fill this in.

Td3 ddpg action bound fix #211

Td3 ddpg action bound fix #211

Conversation

dosssman commented Jun 21, 2022 • edited by vwxyzjn Loading

Description

Types of changes

Checklist:

vercel bot commented Jun 21, 2022 • edited Loading

vwxyzjn Jun 21, 2022

Choose a reason for hiding this comment

dosssman Jun 22, 2022

Choose a reason for hiding this comment

vwxyzjn Jun 26, 2022

Choose a reason for hiding this comment

vwxyzjn left a comment • edited Loading

Choose a reason for hiding this comment

vwxyzjn commented Jun 29, 2022

vwxyzjn commented Jun 29, 2022 • edited Loading

dosssman commented Jun 29, 2022

dosssman commented Jun 29, 2022 • edited Loading

vwxyzjn Jun 29, 2022

Choose a reason for hiding this comment

vwxyzjn Jun 29, 2022

Choose a reason for hiding this comment

vwxyzjn commented Jun 29, 2022

dosssman commented Jun 29, 2022

dosssman commented Jun 29, 2022

vwxyzjn left a comment

Choose a reason for hiding this comment

dosssman commented Jun 30, 2022

vwxyzjn commented Jun 30, 2022

dosssman commented Jun 30, 2022

vwxyzjn left a comment

Choose a reason for hiding this comment

dosssman commented Jun 21, 2022 •

edited by vwxyzjn

Loading

vercel bot commented Jun 21, 2022 •

edited

Loading

vwxyzjn left a comment •

edited

Loading

vwxyzjn commented Jun 29, 2022 •

edited

Loading

dosssman commented Jun 29, 2022 •

edited

Loading