Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass a random seed number in the function that generates random numbers #316

Open
szy21 opened this issue Sep 20, 2022 · 11 comments · Fixed by #542
Open

Pass a random seed number in the function that generates random numbers #316

szy21 opened this issue Sep 20, 2022 · 11 comments · Fixed by #542
Assignees
Labels
enhancement New feature or request

Comments

@szy21
Copy link
Member

szy21 commented Sep 20, 2022

No description provided.

@charleskawczynski
Copy link
Member

charleskawczynski commented May 27, 2023

Just to add some details / context to this issue:

RRTMGP is not reproducible w.r.t. its random number generation for unthreaded runs because of this line:

Random.rand!(local_rand)
. For threaded runs, different threads will call rand!, impacting the global state shared between threads, and since the order of columns that each thread works on is non-deterministic, neither is the sampled random numbers. To make this reproducible for threaded runs, we'll need to pass in a seed per column.

@Sbozzolo
Copy link
Member

More generally, If would be very useful if we could support reconstructing precisely the state of the random number generator upon restarts so that the stream of random number is the same as if we didn't restart the simulation. If this is not possible, we won't be able to use restarts to debug broken builds

@szy21
Copy link
Member Author

szy21 commented Sep 23, 2024

More generally, If would be very useful if we could support reconstructing precisely the state of the random number generator upon restarts so that the stream of random number is the same as if we didn't restart the simulation. If this is not possible, we won't be able to use restarts to debug broken builds

For this, @sriharshakandala mentioned we will need to store the random number, which will increase the memory. Would that be ok?

@sriharshakandala
Copy link
Member

sriharshakandala commented Sep 23, 2024

Storing the random number will increase the memory footprint by about 2 to 3 orders of magnitude.
We can pass in seed for each column if this helps! Is this preferable?

@szy21
Copy link
Member Author

szy21 commented Sep 23, 2024

2 orders of magnitude sounds large and I would rather avoid that. What do others think?

@Sbozzolo
Copy link
Member

Our goal is to be able to run two identical runs. This requires thread-safety (different threads not changing each other RNG state) so the first step would be to understand the CUDA RNG scheme.

This conversation seems to indicate that RNG is warp-safe out-of-the-box
https://discourse.julialang.org/t/kernel-random-numbers-generation-entropy-randomness-issues/105637

We use overlay method tables during GPU compilation to replace Random.default_rng() to a custom, GPU-friendly RNG: https://github.com/JuliaGPU/CUDA.jl/blob/2ae53761a6a254b98a6689ed0d39781176b245cf/src/device/random.jl#L97 5. Similarly, just calling rand() in a kernel just works and uses the correct RNG.

Specifically, we use Philox2x32, Switch to Philox2x32 for device-side RNG by maleadt · Pull Request #882 · JuliaGPU/CUDA.jl · GitHub 3, a counter-based PRNG. The seed is passed from the host, and the counters are maintained per-warp and initialized at the start of each kernel that uses the RNG, rand: seed kernels from the host. by maleadt · Pull Request #2035 · JuliaGPU/CUDA.jl · GitHub 1. The implementation isn’t fully generic, e.g. you can’t have multiple RNG objects, but it’s pretty close to how Random.jl works.

We should understand this. Maybe all we have to do is worry about warp vs thread.

Second, it would be good to be able to save the RNG state and recover it so that we can support restarts. The details of this will depend on the RNG used.

@szy21
Copy link
Member Author

szy21 commented Sep 23, 2024

@sriharshakandala Let's fix the reproducibility issue when running two identical runs first, which shouldn't require storing the random numbers. We can talk about restarts after the first issue is fixed.

@sriharshakandala
Copy link
Member

sriharshakandala commented Sep 23, 2024

From the conversation, it looks like passing in a single seed might work! Though, this could always differ from the results from the CPU simulation.

maleadt
Regular

danielwe
Nov 2023
I think it can be different per warp, but IIRC (it’s been a while since I wrote that code) the idea was to use a single seed for all warps, as we offset it using a counter that’s based on the global ID of the thread. That’s also what happens by default: a single seed is passed from the host and applied from every thread.

@szy21
Copy link
Member Author

szy21 commented Sep 24, 2024

I just discussed this with Sriharsha. We will modify the code to ensure reproducibility when running two identical runs, without worrying about the restart. After that is done we can explore whether it is feasible to support restart without increasing the memory footprint by too much. @Sbozzolo What do you think?

@Sbozzolo
Copy link
Member

I just discussed this with Sriharsha. We will modify the code to ensure reproducibility when running two identical runs, without worrying about the restart. After that is done we can explore whether it is feasible to support restart without increasing the memory footprint by too much. @Sbozzolo What do you think?

Yes, this is a good start, but I would like us to think about supporting restarts as well.

I don't think it makes sense for the memory footprint to increase by orders of magntitude: even if we saved one element per point on the domain we would only the same size as as any other 3D variable. Also, the state has to be saved only when we produce a checkpoint.

@sriharshakandala
Copy link
Member

Add back PR #542

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants