Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AMDGPU.@sync macro #454

Merged
merged 9 commits into from
Aug 2, 2023
Merged

Add AMDGPU.@sync macro #454

merged 9 commits into from
Aug 2, 2023

Conversation

luraess
Copy link
Contributor

@luraess luraess commented Aug 2, 2023

I am trying to add similar macro to AMDGPU as it exists in CUDA.jl, CUDA.@sync.

One reason is that AMDGPU currently returns following for AMDGPU.@sync:

help?> AMDGPU.@sync
  @sync


  Wait until all lexically-enclosed uses of @async, @spawn,
  @spawnat and @distributed are complete. All exceptions thrown
  by enclosed async operations are collected and thrown as a
  CompositeException.

  Examples
  ≡≡≡≡≡≡≡≡≡≡

  julia> Threads.nthreads()
  4
  
  julia> @sync begin
             Threads.@spawn println("Thread-id $(Threads.threadid()), task 1")
             Threads.@spawn println("Thread-id $(Threads.threadid()), task 2")
         end;
  Thread-id 3, task 1
  Thread-id 1, task 2


julia> 

which seems to be a different then expected behaviour.

Trying to implement the macro, getting inspiration from CUDA.jl, I am facing some issues, namely:

  • The implementation does not sync the current stream (see perf output from MWE below)
  • One needs to import Base: @sync which does not seems to be needed in CUDA.jl
  • I am wondering if the quote should not rather be in following order such that one syncs after ret:
    quote
        local ret = $(esc(code))
        ret
        synchronize()
    end

The MWE I am trying the syncing on is following

using Printf
using AMDGPU

macro inn(A) esc(:( $A[ix+1, iy+1, iz+1] )) end
macro d2_xi(A) esc(:( $A[ix+2, iy+1, iz+1] - $A[ix+1, iy+1, iz+1] - $A[ix+1, iy+1, iz+1] - $A[ix, iy+1, iz+1] )) end
macro d2_yi(A) esc(:( $A[ix+1, iy+2, iz+1] - $A[ix+1, iy+1, iz+1] - $A[ix+1, iy+1, iz+1] - $A[ix+1, iy, iz+1] )) end
macro d2_zi(A) esc(:( $A[ix+1, iy+1, iz+2] - $A[ix+1, iy+1, iz+1] - $A[ix+1, iy+1, iz+1] - $A[ix+1, iy+1, iz] )) end

function lapl!(A2, A, D, dt_dx, dt_dy, dt_dz)
    ix = (workgroupIdx().x - UInt32(1)) * workgroupDim().x + workitemIdx().x
    iy = (workgroupIdx().y - UInt32(1)) * workgroupDim().y + workitemIdx().y
    iz = (workgroupIdx().z - UInt32(1)) * workgroupDim().z + workitemIdx().z
    if (ix < size(A, 1) - 2 && iy < size(A, 2) - 2 && iz < size(A, 3) - 2)
        @inbounds @inn(A2) = @inn(A) + @inn(D) * (dt_dx * @d2_xi(A) + dt_dy * @d2_yi(A) + dt_dz * @d2_zi(A))
    end
    return
end

function compute_roc(A2, A, D, dt_dx, dt_dy, dt_dz, iters, nblocks, nthreads, stream)
    print("Starting the time loop 🚀...")
    tic = time_ns()
    for it = 1:iters
        AMDGPU.@sync @roc gridsize=nblocks groupsize=nthreads #= stream=stream =# lapl!(A2, A, D, dt_dx, dt_dy, dt_dz)
        # AMDGPU.synchronize(stream)
        # AMDGPU.synchronize()
        A, A2 = A2, A
    end
    wtime = (time_ns() - tic) * 1e-9
    println("done")
    return wtime
end

function main(; resol=128)
    # physics
    lx = ly = lz = 10.0
    d0 = 1.0
    # numerics
    nx = ny = nz = resol
    nthreads = (128, 2, 1)
    iters, warmup = 33, 3
    println("Process selecting device $(AMDGPU.device_id(AMDGPU.device()))")
    # preprocessing
    nblocks = cld.((nx, ny, nz), nthreads)
    dx, dy, dz = 1.0 / nx, 1.0 / ny, 1.0 / nz
    dt = min(dx*dx, dy*dy, dz*dz)*d0 / 6.1
    dt_dx, dt_dy, dt_dz = dt / dx, dt / dy, dt / dz
    # init
    A = AMDGPU.rand(Float64, nx, ny, nz)
    D = d0 .* AMDGPU.ones(Float64, nx, ny, nz)
    A2 = copy(A)
    stream = nothing #AMDGPU.HIPStream(:high)
    # action
    # warmup
    compute_roc(A2, A, D, dt_dx, dt_dy, dt_dz, warmup, nblocks, nthreads, stream)
    # time
    wtime = compute_roc(A2, A, D, dt_dx, dt_dy, dt_dz, (iters - warmup), nblocks, nthreads, stream)
    # perf
    A_eff = 3 / 2^30 * nx * ny * nz * sizeof(Float64)
    wtime_it = wtime / (iters - warmup)
    T_eff = A_eff / wtime_it
    @printf("Executed %d steps in = %1.3e sec (@ T_eff = %1.2f GB/s) \n", (iters - warmup), wtime, round(T_eff, sigdigits=3))
    return
end

main(resol=512)

Note that the correct behaviour can be obtained by commenting the AMDGPU.@sync macro and syncing after the kernel with AMDGPU.synchronize (or syncing the specific stream), The expected performance on MI250x is:

Executed 30 steps in = 1.435e-01 sec (@ T_eff = 627.00 GB/s) 

while erroneous perf would return e.g.

Executed 30 steps in = 2.051e-04 sec (@ T_eff = 439000.00 GB/s) 

@luraess luraess requested a review from pxl-th August 2, 2023 07:23
@pxl-th pxl-th marked this pull request as ready for review August 2, 2023 15:33
@pxl-th pxl-th merged commit 125e21b into master Aug 2, 2023
@pxl-th pxl-th deleted the lr/sync branch August 2, 2023 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants