-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gfx1031: initial import #1251
gfx1031: initial import #1251
Conversation
These logic files are benchmarked using Tensile-rocm-5.0.1 on my desktop with Radeon RX 6700XT. Bias may exist, but they at least works and provide a decent performance when running rocblas-bench
thanks for this ! diff --git a/CMakeLists.txt b/CMakeLists.txt
index 46510141..4dfb561b 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -192,13 +192,13 @@ set(CMAKE_INSTALL_LIBDIR "lib" CACHE INTERNAL "Installation directory for librar
# gpu arch configuration
set( AMDGPU_TARGETS "all" CACHE STRING "Compile for which gpu architectures?")
-set_property( CACHE AMDGPU_TARGETS PROPERTY STRINGS all gfx803 gfx900 gfx906:xnack- gfx908:xnack- gfx90a:xnack+ gfx90a:xnack- gfx1010 gfx1011 gfx1012 gfx1030 )
+set_property( CACHE AMDGPU_TARGETS PROPERTY STRINGS all gfx803 gfx900 gfx906:xnack- gfx908:xnack- gfx90a:xnack+ gfx90a:xnack- gfx1010 gfx1011 gfx1012 gfx1030 gfx1031)
# Detect if target ID syntax if supported for default all AMDGPU_TARGETS list
# Sets the AMDGPU_TARGETS with backward compatiblity
if(COMMAND rocm_check_target_ids)
rocm_check_target_ids(target_list
- TARGETS "gfx803;gfx900;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-;gfx1010;gfx1011;gfx1012;gfx1030"
+ TARGETS "gfx803;gfx900;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-;gfx1010;gfx1011;gfx1012;gfx1030;gfx1031"
)
else()
# This section is deprecated. Please use rocm_check_target_ids for future use.
@@ -213,9 +213,9 @@ else()
if(Tensile_LOGIC STREQUAL "aldebaran")
# Temporary bypass: if logic aldebaran logic is selected, assume aldebaran compiler support exists
# To be removed for ROCm 4.4
- set(target_list "gfx803;gfx900;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-;gfx1010;gfx1011;gfx1012;gfx1030")
+ set(target_list "gfx803;gfx900;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-;gfx1010;gfx1011;gfx1012;gfx1030;gfx1031")
else()
- set(target_list "gfx803;gfx900;gfx906:xnack-;gfx908:xnack-;gfx1010;gfx1011;gfx1012;gfx1030")
+ set(target_list "gfx803;gfx900;gfx906:xnack-;gfx908:xnack-;gfx1010;gfx1011;gfx1012;gfx1030;gfx1031")
endif()
else()
set(target_list "gfx803;gfx900;gfx906;gfx908")
|
Taking off my AMD hat for a moment and speaking just for myself, I'm thrilled to see a community contribution like this one. Please do bear with the rocBLAS team while they sort out their policies and procedures for handling it. To my knowledge, your contribution is the first of its kind that rocBLAS has ever received. There's lots of questions we need to work out, not the least of which is, "how does someone even review a million line pull request?" I've always pushed to expand the ROCm userbase by bringing ROCm to more hardware and software platforms. I could hardly imagine anything more exciting than a patch to enable the RX 6700 XT getting upstreamed from a Gentoo maintainer. There are, however, some technical challenges for expanding the targeted platforms. The most significant issue is that of binary size. For that reason, I think it's unlikely that gfx1031 will be added to the default list of target architectures. Nevertheless, I can see the value in having it available for users building rocBLAS from source. I have a technical question about this tuning, though. Have you compared the performance using these gfx1031 kernels on the RX 6700 XT vs. the performance using the existing gfx1030 kernels on the RX 6700 XT with |
I have to thank @benjaminulmer who helped me perform the benchmark successfully in ROCm/Tensile#1410.
Can't agree more. Those million lines are generated by Tensile, and every time I ran the Tensile scripts the output slightly differs, due to performance fluctuation I guess. I only have one 6700 XT and the benchmark is not performed in a professional lab where thermal and electricity is not ideal, meaning my result may have bias.
I understand. I have also deployed rocm from amd's official repo, and those binaries are indeed large.
Right, which target to build can be set when configuring compilation. I've already applied this patch in rocBLAS Gentoo package. And in ROCm/ROCm#1714 (comment) reminds me to upstream this. I also wonder, if binary distribution can also avoid such problem. The code object files (
No, I didn't know about the trick before. I'll try that and compare the performance. It is true that I take advantage of the similarity between gfx1030 and gfx1031 -- in ROCm/Tensile#1410 you can see I reuse the gfx1030 Tensile configurations, which make things much easier. |
+1 to everything @cgmb said. I'm going to try and get the ball rolling on this and Tensile #1511 this week.
I'm also very curious about this. If this does in fact work (and have decent performance), then there may be a way to use the existing navi21 logic files and kernels for navi22, potentially sidestepping the binary size issue. |
My comment on binary size was more in relation to the change proposed by @yves-renier. The problem is not the size of the package or Tensile files, but the librocblas.so file. Once the shared library exceeds 2 GiB, then 32-bit relative offsets are no longer sufficient and linking may fail. In certain configurations, the rocBLAS library has already been pushing up against those limits. You're right that this problem could be worked around by providing multiple packages that each contain rocBLAS built for a subset of architectures. I don't think we've come to a conclusion on how we're going to address this long-term, but that is one option under consideration.
It would also be nice to check gfx1032 through gfx1036. Unfortunately, I don't have easy access to hardware to test them myself since those processors are not officially supported. |
Sorry for the late reply. I just tried classic FP32 GEMM:
Some weird shape matrix I encountered in real problems:
It showns that the performance of I'll run the full test suite in the future. |
Thanks for the information! I did a bit more research as well, and it appears that many chips that are ISA-compatible with each other in theory are being treated as distinct by the compiler and runtime mostly just in case there are unexpected incompatibilities. I'm currently involved in some discussions about how we might enable the runtime to load compatible kernels without requiring users to set any magic environment variables. This is, however, a surprisingly complex topic. It seems that clang has some documentation on which processors have identical instruction sets (1, 2, 3, 4, 5, 6, 7). That's probably a good starting point for understanding which gfx ISAs can likely be substituted for each other using |
I appreciate the work you've done to enable gfx1031, but this is not the approach we're going to take for adding support. It just doesn't scale very well if we want to enable gfx1030, gfx1031, gfx1032, gfx1033, gfx1034, and gfx1035. Instead, we are working on a solution in the lower levels of the ROCm stack so that code built for gfx1030 will run on all of those GPUs. In the meantime, users on gfx1031, gfx1032, gfx1033, gfx1034, or gfx1035 can set |
@littlewu2508, I've prepared some (experimental) patches that may help. There would also need to be a patch for rocBLAS to treat these ISAs as being equivalent (due to Tensile having its own loading logic), but I have not gotten to preparing that yet. Please see ROCm Experimental ISA Compatibility on the Debian AI mailing list. |
Thanks! I think that would be a better solution for common roc* libraries. For rocBLAS with Tensile, it still needs some investigation, but as experiences of using |
@littlewu2508, I've improved the logic to handle ISAs that are supersets of other ISAs (e.g. gfx1011 is gfx1010 with additional instructions) and tested the patch set with gfx1030 code objects on a gfx1031 GPU (RX 6750 XT) as well as gfx1010 code objects on a gfx1011 GPU (Radeon Pro V520). I'm feeling pretty good about those patches. I still need to extend Tensile to include similar logic when loading code modules in order to make this work for rocBLAS, but it shouldn't be too hard once I figure out where that is done. By the way, it's worth noting that this technique doesn't extend well to GFX11. The gfx1100, gfx1101 and gfx1102 ISAs are not supersets of each other (even if they might look like they are in a naive read of the LLVM documentation). |
I've put together a patch implementing a workaround that enables rocBLAS on more GPUs. This one is a bit uglier than the patches for the runtime. It is essentially the equivalent of hardcoding an appropriate value for The LLVM compiler will eventually be the source of truth for code object compatibility and there will be a much cleaner solution provided in ROCm. Nevertheless, I hope that these workarounds are helpful in the meantime. |
I'm reopening this PR for two reasons:
This PR is not necessarily required for enabling gfx1031 support on Linux, as shown by the Debian rocblas package. However, it would be quite useful to users wishing to build rocBLAS from source on Windows for their GPU. |
The alternative approach I mentioned evolved into the 'generic' AMDGPU ISAs proposal. There is a pull request open for their introduction into LLVM: llvm/llvm-project#76955. I also wrote a brief description of these proposed ISAs on the Debian AI mailing list, which might be a useful summary. However, the PR on LLVM now has documentation with significantly more detail than anything I wrote, so I would encourage you to go straight to the LLVM PR for information. This feature is still very early in its development cycle. It's merely a proposal for the compiler at this point. Even after it is included in the compiler, it's not going to be as simple as 'just recompile the libraries for the new target'. The feature will need logic added to the runtime to map the generic ISAs to specific hardware, and the libraries will also need to be updated (particularly for cases where libraries are loading kernels from disk dynamically). That is to say, it's going to take a while for this proposal to become an actual solution the problem of running ROCm libraries on a wide variety of GPUs. Nevertheless, @littlewu2508, I wanted to bring it to your attention, as I believe it is an important proposal for users such as yourself. |
sup bros, you guys rocks im just a hobbist you wou guys are way off my league. I have a rx 6700 (gfx1031) that wanted to use with llamacpp and lots of seaching and chatting in post got a solution to make it work with llamacpp
And now my gpu working beautiful can load any 7b model and some 13b ones. Also would like to ask for your expertise on how to add another second card, i have an old rx 5700 but dont know how to compile or env variables to enable to make it work llamacpp on multigpu. My first tought its to add If someone can give me a hint or a hand i would really appreciate |
One solution is to adopt patch mentioned in ROCm Experimental ISA Compatibility on the Debian AI mailing list. It only recognize RDNA2 cards and treat them as gfx1030 |
I was digging deeper on the gtx1010 compatibiliy, has far i investigate it's way more complicated that expected, the rocm tensile libraries are missing for that gpu since years ago and also are needed some kernel debugging |
Yes, although I guess it only hurts performance. With Tensile, you can benchmark your own GPU and generate the winner GEMM Tensile kernel. That's how this PR is created: I copy the gfx1030 configurations, rename and adjust some parameters ROCm/Tensile#1410. For Gentoo I packaged |
I haven't tested the latest versions of rocBLAS, but the gfx1010 architecture works fine in rocBLAS from ROCm 5.5 if you disable Tensile "separate architectures" and "lazy loading". Though, the performance is probably quite poor as there's no tuning. rocBLAS on Debian Trixie is has both those architectures enabled and should 'just work'. I've tested them myself (though not together on the same system). |
Interesting going to take a look at rocm5.5 however i want to run llamacpp with my rx5700 and i think it only supports rocm5.6 and 5.7 |
|
I don't think so. It works for me with rocBLAS 5.5.1 on Debian Trixie and Ubuntu 23.10 with my Radeon VII. You'd need to put
|
@cgmb im using ubuntu 22.04 and rocm6 do you think can work? going to try to compare with vulkan performance. |
@littlewu2508, I'm closing because llvm/llvm-project#76955 has been accepted into LLVM. The best path forward for enabling gfx1031 and other RDNA 2 architectures will be to target the gfx10.3-generic ISA in all places where gfx1030 is currently being used. The gfx10.3-generic and gfx1030 ISAs are identical in all but name, so there will be no performance cost despite the increase in compatibility. I had hoped that this PR could have been merged while we waited for the generic ISAs to arrive, but it got stuck waiting for approval because rocBLAS does not have a defined set of acceptance criteria for pull requests. That will be changing soon, although I don't have an exact timeline. |
No. Your driver will work, but ROCm/Tensile#1757 will prevent it from working with the packages distributed by AMD. The Debian packages are built with different options and are therefore not affected. You can use the docker command from the gist I linked to use the Debian package in a container while running on Ubuntu 22.04. The performance on gfx1010 hardware will be poor, but it should be possible to improve by doing tuning for gfx1010 like @littlewu2508 has done here for gfx1031. The gfx1010 tuning would not be redundant even with the introduction of the generic targets, as there has been no meaningful tuning done for RDNA 1 GPUs. However, this is a lot of off-topic chatter for this issue. If you would like to share the results of your experiments, I encourage you to post a message to the Debian AI mailing list. I need to learn about Tensile tuning myself, so I would be happy to work with you further in that forum. |
thanks bro @cgmb going to join to the debian forum. Also im running mixed rdna 1 and 2 with vulkan and pcie x1 gen1 because its the only that my mobo has. |
These logic files are benchmarked using Tensile-rocm-5.0.1 on my desktop
with Radeon RX 6700XT. Bias may exist, but they at least works and
provide a decent performance when running rocblas-bench.
I've also created a PR for Tensile which enables benchmarking and compiling kernels for gfx1031.
Summary of proposed changes: