Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gfx1031: initial import #1251

Closed
wants to merge 1 commit into from

Conversation

littlewu2508
Copy link

These logic files are benchmarked using Tensile-rocm-5.0.1 on my desktop
with Radeon RX 6700XT. Bias may exist, but they at least works and
provide a decent performance when running rocblas-bench.

I've also created a PR for Tensile which enables benchmarking and compiling kernels for gfx1031.

Summary of proposed changes:

  • Enables gfx1031
  • Self benchmarked, preliminary

These logic files are benchmarked using Tensile-rocm-5.0.1 on my desktop
with Radeon RX 6700XT. Bias may exist, but they at least works and
provide a decent performance when running rocblas-bench
@yves-renier
Copy link

thanks for this !
you could also add the target in CMakeLists.txt

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 46510141..4dfb561b 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -192,13 +192,13 @@ set(CMAKE_INSTALL_LIBDIR "lib" CACHE INTERNAL "Installation directory for librar
 
 # gpu arch configuration
 set( AMDGPU_TARGETS "all" CACHE STRING "Compile for which gpu architectures?")
-set_property( CACHE AMDGPU_TARGETS PROPERTY STRINGS all gfx803 gfx900 gfx906:xnack- gfx908:xnack- gfx90a:xnack+ gfx90a:xnack- gfx1010 gfx1011 gfx1012 gfx1030 )
+set_property( CACHE AMDGPU_TARGETS PROPERTY STRINGS all gfx803 gfx900 gfx906:xnack- gfx908:xnack- gfx90a:xnack+ gfx90a:xnack- gfx1010 gfx1011 gfx1012 gfx1030 gfx1031)
 
 # Detect if target ID syntax if supported for default all AMDGPU_TARGETS list
 # Sets the AMDGPU_TARGETS with backward compatiblity
 if(COMMAND rocm_check_target_ids)
   rocm_check_target_ids(target_list
-      TARGETS "gfx803;gfx900;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-;gfx1010;gfx1011;gfx1012;gfx1030"
+      TARGETS "gfx803;gfx900;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-;gfx1010;gfx1011;gfx1012;gfx1030;gfx1031"
   )
 else()
   # This section is deprecated. Please use rocm_check_target_ids for future use.
@@ -213,9 +213,9 @@ else()
     if(Tensile_LOGIC STREQUAL "aldebaran")
       # Temporary bypass: if logic aldebaran logic is selected, assume aldebaran compiler support exists
       # To be removed for ROCm 4.4
-      set(target_list "gfx803;gfx900;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-;gfx1010;gfx1011;gfx1012;gfx1030")
+      set(target_list "gfx803;gfx900;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-;gfx1010;gfx1011;gfx1012;gfx1030;gfx1031")
     else()
-      set(target_list "gfx803;gfx900;gfx906:xnack-;gfx908:xnack-;gfx1010;gfx1011;gfx1012;gfx1030")
+      set(target_list "gfx803;gfx900;gfx906:xnack-;gfx908:xnack-;gfx1010;gfx1011;gfx1012;gfx1030;gfx1031")
     endif()
   else()
     set(target_list "gfx803;gfx900;gfx906;gfx908")

@cgmb
Copy link
Contributor

cgmb commented May 30, 2022

Taking off my AMD hat for a moment and speaking just for myself, I'm thrilled to see a community contribution like this one. Please do bear with the rocBLAS team while they sort out their policies and procedures for handling it. To my knowledge, your contribution is the first of its kind that rocBLAS has ever received. There's lots of questions we need to work out, not the least of which is, "how does someone even review a million line pull request?"

I've always pushed to expand the ROCm userbase by bringing ROCm to more hardware and software platforms. I could hardly imagine anything more exciting than a patch to enable the RX 6700 XT getting upstreamed from a Gentoo maintainer. There are, however, some technical challenges for expanding the targeted platforms. The most significant issue is that of binary size. For that reason, I think it's unlikely that gfx1031 will be added to the default list of target architectures. Nevertheless, I can see the value in having it available for users building rocBLAS from source.

I have a technical question about this tuning, though. Have you compared the performance using these gfx1031 kernels on the RX 6700 XT vs. the performance using the existing gfx1030 kernels on the RX 6700 XT with export HSA_OVERRIDE_GFX_VERSION=10.3.0? The gfx1030 and gfx1031 processors share the same ISA and it's possible to run code built for one on the other. However, I'm unclear on what caveats are associated with doing so. It would be interesting to know if rocBLAS can actually pass its test suite when using the gfx version override and if the performance is any good.

@littlewu2508
Copy link
Author

I'm thrilled to see a community contribution like this one.

I have to thank @benjaminulmer who helped me perform the benchmark successfully in ROCm/Tensile#1410.

There's lots of questions we need to work out, not the least of which is, "how does someone even review a million line pull request?"

Can't agree more. Those million lines are generated by Tensile, and every time I ran the Tensile scripts the output slightly differs, due to performance fluctuation I guess. I only have one 6700 XT and the benchmark is not performed in a professional lab where thermal and electricity is not ideal, meaning my result may have bias.

There are, however, some technical challenges for expanding the targeted platforms. The most significant issue is that of binary size. For that reason, I think it's unlikely that gfx1031 will be added to the default list of target architectures.

I understand. I have also deployed rocm from amd's official repo, and those binaries are indeed large.

Nevertheless, I can see the value in having it available for users building rocBLAS from source.

Right, which target to build can be set when configuring compilation. I've already applied this patch in rocBLAS Gentoo package. And in ROCm/ROCm#1714 (comment) reminds me to upstream this.

I also wonder, if binary distribution can also avoid such problem. The code object files (TensileLibrary_*.co, Kernels.so-000-*.hsaco) are separated to different GPU archs, so maybe user can choose which architecture to download and install to shrink the binary size?

IHave you compared the performance using these gfx1031 kernels on the RX 6700 XT vs. the performance using the existing gfx1030 kernels on the RX 6700 XT with export HSA_OVERRIDE_GFX_VERSION=10.3.0? The gfx1030 and gfx1031 processors share the same ISA and it's possible to run code built for one on the other. However, I'm unclear on what caveats are associated with doing so. It would be interesting to know if rocBLAS can actually pass its test suite when using the gfx version override and if the performance is any good.

No, I didn't know about the trick before. I'll try that and compare the performance.

It is true that I take advantage of the similarity between gfx1030 and gfx1031 -- in ROCm/Tensile#1410 you can see I reuse the gfx1030 Tensile configurations, which make things much easier.

@benjaminulmer
Copy link
Contributor

+1 to everything @cgmb said. I'm going to try and get the ball rolling on this and Tensile #1511 this week.

I have a technical question about this tuning, though. Have you compared the performance using these gfx1031 kernels on the RX 6700 XT vs. the performance using the existing gfx1030 kernels on the RX 6700 XT with export HSA_OVERRIDE_GFX_VERSION=10.3.0? The gfx1030 and gfx1031 processors share the same ISA and it's possible to run code built for one on the other. However, I'm unclear on what caveats are associated with doing so. It would be interesting to know if rocBLAS can actually pass its test suite when using the gfx version override and if the performance is any good.

I'm also very curious about this. If this does in fact work (and have decent performance), then there may be a way to use the existing navi21 logic files and kernels for navi22, potentially sidestepping the binary size issue.

@cgmb
Copy link
Contributor

cgmb commented May 30, 2022

I also wonder, if binary distribution can also avoid such problem. The code object files (TensileLibrary_*.co, Kernels.so-000-*.hsaco) are separated to different GPU archs, so maybe user can choose which architecture to download and install to shrink the binary size?

My comment on binary size was more in relation to the change proposed by @yves-renier. The problem is not the size of the package or Tensile files, but the librocblas.so file. Once the shared library exceeds 2 GiB, then 32-bit relative offsets are no longer sufficient and linking may fail. In certain configurations, the rocBLAS library has already been pushing up against those limits.

You're right that this problem could be worked around by providing multiple packages that each contain rocBLAS built for a subset of architectures. I don't think we've come to a conclusion on how we're going to address this long-term, but that is one option under consideration.

I'm also very curious about this. If this does in fact work (and have decent performance), then there may be a way to use the existing navi21 logic files and kernels for navi22, potentially sidestepping the binary size issue.

It would also be nice to check gfx1032 through gfx1036. Unfortunately, I don't have easy access to hardware to test them myself since those processors are not officially supported.

@littlewu2508
Copy link
Author

I have a technical question about this tuning, though. Have you compared the performance using these gfx1031 kernels on the RX 6700 XT vs. the performance using the existing gfx1030 kernels on the RX 6700 XT with export HSA_OVERRIDE_GFX_VERSION=10.3.0? The gfx1030 and gfx1031 processors share the same ISA and it's possible to run code built for one on the other. However, I'm unclear on what caveats are associated with doing so. It would be interesting to know if rocBLAS can actually pass its test suite when using the gfx version override and if the performance is any good.

Sorry for the late reply. I just tried export HSA_OVERRIDE_GFX_VERSION=10.3.0 to see the performance comparison. Below are some results:

classic FP32 GEMM:

HSA_OVERRIDE_GFX_VERSION transA transB M N K alpha lda beta ldb ldc rocblas-Gflops us
NULL N N 4096 4096 4096 1 4096 0 4096 4096 12054.4 11401.6
10.3.0 N N 4096 4096 4096 1 4096 0 4096 4096 12054.3 11401.7

Some weird shape matrix I encountered in real problems:
FP32 GEMM:

HSA_OVERRIDE_GFX_VERSION transA transB M N K alpha lda beta ldb ldc batch_count rocblas-Gflops us
NULL N N 2 2 896 1 2 0 896 2 512 20.8405 176.1
10.3.0 N N 2 2 896 1 2 0 896 2 512 34.9525 105

It showns that the performance of export HSA_OVERRIDE_GFX_VERSION=10.3.0 not worse than my gfx1031 enablement. Some even outperformed (maybe my benchmark result is a bit rough. Maybe I should perform a better benchmark and update the configds). And the norm error is OK.

I'll run the full test suite in the future.

@cgmb
Copy link
Contributor

cgmb commented Aug 15, 2022

Thanks for the information! I did a bit more research as well, and it appears that many chips that are ISA-compatible with each other in theory are being treated as distinct by the compiler and runtime mostly just in case there are unexpected incompatibilities. I'm currently involved in some discussions about how we might enable the runtime to load compatible kernels without requiring users to set any magic environment variables. This is, however, a surprisingly complex topic.

It seems that clang has some documentation on which processors have identical instruction sets (1, 2, 3, 4, 5, 6, 7). That's probably a good starting point for understanding which gfx ISAs can likely be substituted for each other using HSA_OVERRIDE_GFX_VERSION. That said, my understanding is that those groupings are merely a useful heuristic, because supporting the same set of instructions is neither necessary nor sufficient to determine the compatibility of a binary built for Processor A on Processor B.

@cgmb
Copy link
Contributor

cgmb commented Mar 4, 2023

I appreciate the work you've done to enable gfx1031, but this is not the approach we're going to take for adding support. It just doesn't scale very well if we want to enable gfx1030, gfx1031, gfx1032, gfx1033, gfx1034, and gfx1035. Instead, we are working on a solution in the lower levels of the ROCm stack so that code built for gfx1030 will run on all of those GPUs. In the meantime, users on gfx1031, gfx1032, gfx1033, gfx1034, or gfx1035 can set export HSA_OVERRIDE_GFX_VERSION=10.3.0 as a workaround.

@cgmb cgmb closed this Mar 4, 2023
@cgmb
Copy link
Contributor

cgmb commented May 2, 2023

@littlewu2508, I've prepared some (experimental) patches that may help. There would also need to be a patch for rocBLAS to treat these ISAs as being equivalent (due to Tensile having its own loading logic), but I have not gotten to preparing that yet. Please see ROCm Experimental ISA Compatibility on the Debian AI mailing list.

@littlewu2508
Copy link
Author

@littlewu2508, I've prepared some (experimental) patches that may help. There would also need to be a patch for rocBLAS to treat these ISAs as being equivalent (due to Tensile having its own loading logic), but I have not gotten to preparing that yet. Please see ROCm Experimental ISA Compatibility on the Debian AI mailing list.

Thanks! I think that would be a better solution for common roc* libraries. For rocBLAS with Tensile, it still needs some investigation, but as experiences of using HSA_OVERRIDE_GFX_VERSION, it seems to be possible as well.

@cgmb
Copy link
Contributor

cgmb commented May 20, 2023

@littlewu2508, I've improved the logic to handle ISAs that are supersets of other ISAs (e.g. gfx1011 is gfx1010 with additional instructions) and tested the patch set with gfx1030 code objects on a gfx1031 GPU (RX 6750 XT) as well as gfx1010 code objects on a gfx1011 GPU (Radeon Pro V520). I'm feeling pretty good about those patches.

I still need to extend Tensile to include similar logic when loading code modules in order to make this work for rocBLAS, but it shouldn't be too hard once I figure out where that is done.

By the way, it's worth noting that this technique doesn't extend well to GFX11. The gfx1100, gfx1101 and gfx1102 ISAs are not supersets of each other (even if they might look like they are in a naive read of the LLVM documentation).

@cgmb
Copy link
Contributor

cgmb commented May 31, 2023

I've put together a patch implementing a workaround that enables rocBLAS on more GPUs. This one is a bit uglier than the patches for the runtime. It is essentially the equivalent of hardcoding an appropriate value for HSA_OVERRIDE_GFX_VERSION right into rocBLAS/Tensile. This seems to work as expected when combined with the workaround in the runtime that I previously provided. However, you do need to be careful to ensure that you are building rocBLAS for the correct architectures, since you'll need to build for a different architecture than is reported for your GPU.

The LLVM compiler will eventually be the source of truth for code object compatibility and there will be a much cleaner solution provided in ROCm. Nevertheless, I hope that these workarounds are helpful in the meantime.

@cgmb cgmb reopened this Aug 31, 2023
@cgmb cgmb requested a review from babakpst as a code owner August 31, 2023 03:02
@cgmb cgmb requested a review from nakajee as a code owner August 31, 2023 03:02
@cgmb
Copy link
Contributor

cgmb commented Aug 31, 2023

I'm reopening this PR for two reasons:

  1. The compiler feature to enable reusing gfx1030 code objects on gfx1031 has been delayed.
  2. The HSA_OVERRIDE_GFX_VERSION workaround does not work on Windows.

This PR is not necessarily required for enabling gfx1031 support on Linux, as shown by the Debian rocblas package. However, it would be quite useful to users wishing to build rocBLAS from source on Windows for their GPU.

@cgmb
Copy link
Contributor

cgmb commented Jan 11, 2024

[The approach in this pull request] just doesn't scale very well if we want to enable gfx1030, gfx1031, gfx1032, gfx1033, gfx1034, and gfx1035. Instead, we are working on a solution in the lower levels of the ROCm stack so that code built for gfx1030 will run on all of those GPUs.

The alternative approach I mentioned evolved into the 'generic' AMDGPU ISAs proposal. There is a pull request open for their introduction into LLVM: llvm/llvm-project#76955. I also wrote a brief description of these proposed ISAs on the Debian AI mailing list, which might be a useful summary. However, the PR on LLVM now has documentation with significantly more detail than anything I wrote, so I would encourage you to go straight to the LLVM PR for information.

This feature is still very early in its development cycle. It's merely a proposal for the compiler at this point. Even after it is included in the compiler, it's not going to be as simple as 'just recompile the libraries for the new target'. The feature will need logic added to the runtime to map the generic ISAs to specific hardware, and the libraries will also need to be updated (particularly for cases where libraries are loading kernels from disk dynamically).

That is to say, it's going to take a while for this proposal to become an actual solution the problem of running ROCm libraries on a wide variety of GPUs. Nevertheless, @littlewu2508, I wanted to bring it to your attention, as I believe it is an important proposal for users such as yourself.

@userbox020
Copy link

sup bros, you guys rocks im just a hobbist you wou guys are way off my league. I have a rx 6700 (gfx1031) that wanted to use with llamacpp and lots of seaching and chatting in post got a solution to make it work with llamacpp

  1. Install rocm5.6
  2. compile llamacpp with make -j16 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx1030 instead of gfx1031
  3. add the follow env variables:
export ROCM_PATH=/opt/rocm
export HCC_AMDGPU_TARGET=gfx1030
export HSA_OVERRIDE_GFX_VERSION=10.3.0

And now my gpu working beautiful can load any 7b model and some 13b ones. Also would like to ask for your expertise on how to add another second card, i have an old rx 5700 but dont know how to compile or env variables to enable to make it work llamacpp on multigpu. My first tought its to add export HCC_AMDGPU_TARGET=gfx1030,gtx1010 but i dont think will be that easy

If someone can give me a hint or a hand i would really appreciate

@littlewu2508
Copy link
Author

littlewu2508 commented Feb 8, 2024

i have an old rx 5700 but dont know how to compile or env variables to enable to make it work llamacpp on multigpu

export HCC_AMDGPU_TARGET=gfx1030,gtx1010 should be fine. The export HSA_OVERRIDE_GFX_VERSION=10.3.0 can be problematic, because RDNA and RDNA2 is different, I'm afraid this environment variable is treating both your cards as gfx1030.

One solution is to adopt patch mentioned in ROCm Experimental ISA Compatibility on the Debian AI mailing list. It only recognize RDNA2 cards and treat them as gfx1030

@userbox020
Copy link

i have an old rx 5700 but dont know how to compile or env variables to enable to make it work llamacpp on multigpu

export HCC_AMDGPU_TARGET=gfx1030,gtx1010 should be fine. The export HSA_OVERRIDE_GFX_VERSION=10.3.0 can be problematic, because RDNA and RDNA2 is different, I'm afraid this environment variable is treating both your cards as gfx1030.

One solution is to adopt patch mentioned in ROCm Experimental ISA Compatibility on the Debian AI mailing list. It only recognize RDNA2 cards and treat them as gfx1030

I was digging deeper on the gtx1010 compatibiliy, has far i investigate it's way more complicated that expected, the rocm tensile libraries are missing for that gpu since years ago and also are needed some kernel debugging

@littlewu2508
Copy link
Author

I was digging deeper on the gtx1010 compatibiliy, has far i investigate it's way more complicated that expected, the rocm tensile libraries are missing for that gpu since years ago and also are needed some kernel debugging

Yes, although I guess it only hurts performance.

With Tensile, you can benchmark your own GPU and generate the winner GEMM Tensile kernel. That's how this PR is created: I copy the gfx1030 configurations, rename and adjust some parameters ROCm/Tensile#1410.

For Gentoo I packaged dev-util/Tensile with USE=client, and use command documented in https://wiki.gentoo.org/wiki/Tensile to generate Tensile kernel yaml files.

@cgmb
Copy link
Contributor

cgmb commented Feb 8, 2024

I was digging deeper on the gtx1010 compatibiliy, has far i investigate it's way more complicated that expected, the rocm tensile libraries are missing for that gpu since years ago and also are needed some kernel debugging

I haven't tested the latest versions of rocBLAS, but the gfx1010 architecture works fine in rocBLAS from ROCm 5.5 if you disable Tensile "separate architectures" and "lazy loading". Though, the performance is probably quite poor as there's no tuning.

rocBLAS on Debian Trixie is has both those architectures enabled and should 'just work'. I've tested them myself (though not together on the same system).

@userbox020
Copy link

Interesting going to take a look at rocm5.5 however i want to run llamacpp with my rx5700 and i think it only supports rocm5.6 and 5.7

@userbox020
Copy link

Interesting going to take a look at rocm5.5 however i want to run llamacpp with my rx5700 and i think it only supports rocm5.6 and 5.7

@cgmb

@cgmb
Copy link
Contributor

cgmb commented Feb 9, 2024

Interesting going to take a look at rocm5.5 however i want to run llamacpp with my rx5700 and i think it only supports rocm5.6 and 5.7

I don't think so. It works for me with rocBLAS 5.5.1 on Debian Trixie and Ubuntu 23.10 with my Radeon VII. You'd need to put "gfx1010;gfx1030" instead of "gfx906" in the commands below:

apt -y update
apt -y upgrade
apt -y install git hipcc libhipblas-dev librocblas-dev cmake build-essential
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git checkout b2110
CC=clang-15 CXX=clang++-15 cmake -H. -Bbuild -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS="gfx906" -DCMAKE_BUILD_TYPE=Release
make -j16 -C build
build/bin/main -ngl 32 --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -m ~/Downloads/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf --prompt "Once upon a time"

@userbox020
Copy link

@cgmb im using ubuntu 22.04 and rocm6 do you think can work? going to try to compare with vulkan performance.
Ilamaccp already has vulkan support and now i can run all togheter my old and new gpus to have more vram available

@cgmb
Copy link
Contributor

cgmb commented Feb 16, 2024

@littlewu2508, I'm closing because llvm/llvm-project#76955 has been accepted into LLVM. The best path forward for enabling gfx1031 and other RDNA 2 architectures will be to target the gfx10.3-generic ISA in all places where gfx1030 is currently being used. The gfx10.3-generic and gfx1030 ISAs are identical in all but name, so there will be no performance cost despite the increase in compatibility.

I had hoped that this PR could have been merged while we waited for the generic ISAs to arrive, but it got stuck waiting for approval because rocBLAS does not have a defined set of acceptance criteria for pull requests. That will be changing soon, although I don't have an exact timeline.

@cgmb cgmb closed this Feb 16, 2024
@cgmb
Copy link
Contributor

cgmb commented Feb 16, 2024

@cgmb im using ubuntu 22.04 and rocm6 do you think can work? going to try to compare with vulkan performance.

No. Your driver will work, but ROCm/Tensile#1757 will prevent it from working with the packages distributed by AMD. The Debian packages are built with different options and are therefore not affected. You can use the docker command from the gist I linked to use the Debian package in a container while running on Ubuntu 22.04.

The performance on gfx1010 hardware will be poor, but it should be possible to improve by doing tuning for gfx1010 like @littlewu2508 has done here for gfx1031. The gfx1010 tuning would not be redundant even with the introduction of the generic targets, as there has been no meaningful tuning done for RDNA 1 GPUs.

However, this is a lot of off-topic chatter for this issue. If you would like to share the results of your experiments, I encourage you to post a message to the Debian AI mailing list. I need to learn about Tensile tuning myself, so I would be happy to work with you further in that forum.

@userbox020
Copy link

thanks bro @cgmb going to join to the debian forum. Also im running mixed rdna 1 and 2 with vulkan and pcie x1 gen1 because its the only that my mobo has.
Going to try the gits into a docker container, thanks bro see you arroud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants