-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows platform support bringup #36
Comments
Got a build to succeed (run to completion with exit code 0) after disabling many subprojects and writing some local patches.
|
For 3: git diff master Branch1 > ../patchfile and check that in should be fine. apply patches functionality seems pretty straightforward: https://github.com/nod-ai/TheRock/blob/main/build_tools/fetch_sources.py |
FYI - for managing the patches directory, I've been making commits to the repo checked out git dirs. Then you dump them all with
Then if you re-run I was doing it the manual way via |
Sorry - I could have documented my flow on this better: key is that |
Thanks, the |
Progress on #36. This makes it easier to get set up on Windows, which is missing a package manager distribution of the repo tool (https://gerrit.googlesource.com/git-repo/). The tool is a standalone python script that can be run portably via `python repo`. Linux users will probably be more comfortable running it as an executable as just `repo`, but that isn't possible on Windows without some tricks. ## Sample log snippets Download (first run): ``` D:\projects\TheRock (windows-repo-download) λ python ./build_tools/fetch_sources.py Unable to find 'repo', downloading into script dir at D:\projects\TheRock\build_tools\repo Setting up repo in D:\projects\TheRock\sources ++ Exec [D:\projects\TheRock\sources]$ 'C:\Program Files\Python312\python.exe' 'D:\projects\TheRock\build_tools\repo' init -v -u https://github.com/ROCm/ROCm.git -m tools/rocm-build/rocm-6.3.1.xml -b roc-6.3.x repo: reusing existing repo client checkout in D:\projects\TheRock\sources ``` Reuse downloaded tool (future runs): ``` D:\projects\TheRock (windows-repo-download) λ python ./build_tools/fetch_sources.py Found 'repo' in script dir at D:\projects\TheRock\build_tools\repo, using it Setting up repo in D:\projects\TheRock\sources ++ Exec [D:\projects\TheRock\sources]$ 'C:\Program Files\Python312\python.exe' 'D:\projects\TheRock\build_tools\repo' init -v -u https://github.com/ROCm/ROCm.git -m tools/rocm-build/rocm-6.3.1.xml -b roc-6.3.x repo: reusing existing repo client checkout in D:\projects\TheRock\sources ```
Made some progress on pushing patches needed for a minimal local build. Now trying to get that wrapped into a GitHub Actions workflow, using this dev branch: https://github.com/ScottTodd/TheRock/tree/windows-ci-setup (workflow file is here: https://github.com/ScottTodd/TheRock/blob/windows-ci-setup/.github/workflows/build_windows_packages.yml). Here are some logs with a bunch of debug steps included: https://github.com/ScottTodd/TheRock/actions/runs/13082919002/job/36509861561. Stuck for today on
I saw that locally when symlinks were not enabled in the git repository or the fetch_sources.py file was not run. I think the right files are on the runner, but I'm still suspicious of the symlink / hardlink configuration. |
That is a strange error: there should not be a direct dependency between those two files (the dependency crosses a couple of other commands). But I think this is cmake/ninja wonkiness where for custom targets, it makes them depend on the transitive set of input files. So there is probably a lot wrong, and you're just getting that one error out of the pile randomly. One thing you could quickly try: on line 418 of therock_subproject.cmake, comment out the `"${_cmake_source_dir}/CMakeLists.txt" line. This will mean it won't retrigger on sub-project changes, but that is not a problem on CI. Still feeling like there is something else wrong and this is just the error you are getting. I can pull it down on my windows machine and try. |
https://github.com/ScottTodd/TheRock/actions/runs/13122880778/job/36612781302#step:18:60
Similar error... |
It seems to be saying pretty insistently that the file is not there. I've got nothing... The file must be not really there somehow |
🤦 okay, I can repro those eigen sources missing locally now. Probably some artifact of disabling subprojects that failed to build and that resulting in an incomplete build directory. As I was testing locally I configured a few times with different settings. The dependency that you pointed me to might still need some tweaks: TheRock/cmake/therock_subproject.cmake Lines 430 to 431 in b3307ed
At least now I can debug a bit locally. |
Double 🤦 , the eigen error was caused by removing that dependency link. It is load bearing. So I have local builds working reliably, but the GitHub-hosted runners are still having trouble seeing files, perhaps due to symlinks. Might be time to try using our runner cluster instead. |
One clue with the GitHub hosted runners is that there are two drives in use,
This happens in other projects too. Maybe our code in https://github.com/nod-ai/TheRock/blob/main/cmake/therock_subproject.cmake is somehow not handling that on Windows? I can test that theory locally... thus far I've only been using a single |
Something is not right about that log: the PROVIDE paths should be absolute, not relative. Depending on what the current drive is, that could absolutely be a problem. (edit: scratch that. PROVIDE should be relative. INJECT should be absolute) |
A few updates:
|
Progress on #36, making a larger portion of https://github.com/ROCm/rocm-core/ compile on Windows. ## About the changes * The `link.h` and `dlfcn.h` headers do not exist on Windows and are only used when `BUILD_SHARED_LIBS` is set. * The `PATH_MAX` value, defined in `limits.h`, does not exist on Windows. I opted to use a fixed constant value of `4096`, but `FILENAME_MAX` is also an option (see https://stackoverflow.com/a/65174437). * Attributes like `__attribute__((visibility("default")))` do not exist on all compilers. Added some boilerplate cross platform versions (different approaches are possible too, this is just what I use on other projects). ## How I generated the patch 1. Made changes in the source folder 2. Committed to a branch (ScottTodd/rocm-core@0dd798f) 3. Ran ``` bash .\build_tools\save_patches.sh rocm-6.3.1 rocm-core ```
Progress on #36. Together with some other pending PRs, this gets me enough to 1. Fetch sources 2. Configure with CMake 3. Build without errors Not much is actually getting built, since this disables nearly all projects on Windows, but this is enough to start running CI builds and working on components incrementally.
Progress on #36, documenting the current dependency requirements, configure options, and some troubleshooting tips.
Progress on #36. This dependency which was added in #89 contains file paths that are not valid on Windows. Full error logs: https://gist.github.com/ScottTodd/0304483dca863cf6b8b74c15e14dbd85.
Progress on #36. This adds a new workflow, triggered as part of ci.yml, that configures and builds a subset of the project. Future changes will: * Switch to a more powerful self-hosted runner instead of using standard GitHub-hosted runners (we have some runners but they need some configuration to be able to run the `repo` tool without `SSL: CERTIFICATE_VERIFY_FAILED` errors) * Build more subprojects (this will patches or upstream changes to each subproject to improve portability)
I'm working on getting I'm hitting a fatal I think the source is here https://github.com/ROCm/llvm-project |
Ah, found the answer myself 😅 . This hook is forcing the value:
|
Progress on #36. This is gross, and the odd [`.clang-format`](https://github.com/ROCm/rocprofiler-register/blob/amd-staging/.clang-format) file in the rocprofiler-register repo isn't helping. There are probably better design patterns for no-oping the entire project or adding proper support for multiple platforms. I hacked around enough to get a CMake configure + build that produces a 50KB `rocprofiler-register.dll` on Windows using MSVC and a default CMake build on the default branch from GitHub (standalone, outside of TheRock). The DLL is not expected to work since it is missing function definitions and I did not implement Windows-specific versions of the functions, but this unblocks more parts of build in this project that depend on rocprofiler-register. I did no testing on Linux, relying on our CI builds here for that. Hopefully I didn't break anything.
Making a note for later: ccache doesn't seem to be working, at least on my dev machine? That will matter more with large subprojects like amd-llvm.
CI also has a high ratio of "Uncacheable calls" to "Cacheable calls":
Debugging tips: https://ccache.dev/manual/4.10.2.html#_cache_debugging. Could also try sccache. |
Progress on #36. We'll want to switch to self-hosted runners and tune the cache to make the build faster soon. This is still under 20 minutes on standard runners though: https://github.com/nod-ai/TheRock/actions/runs/13339298280/job/37260996314.
From looking at the ccache debug logs, it seems due to at least |
Progress on #36. These runners are more powerful than the standard GitHub-hosted runners so we can more quickly build large projects like LLVM. * This needed credential changes that took a bit of trial and error, but the key thing wasn't git not working correctly but the nested `urllib.request` in `fetch_sources.py` not pulling in the right cert even after installing certifi. I fixed this by manually exporting the cert into the environment based on a suggestion from here: https://stackoverflow.com/questions/77442172/ssl-certificate-verify-failed-certificate-verify-failed-unable-to-get-local-is. * The self-hosted runners have some storage space issues that we are working around with an auxiliary disk mount and some powershell commands. Sample runs: * https://github.com/nod-ai/TheRock/actions/runs/13409402263/job/37455854065 * https://github.com/nod-ai/TheRock/actions/runs/13424413395/job/37504333090?pr=106 * https://github.com/nod-ai/TheRock/actions/runs/13424413395/job/37506514489?pr=106 --------- Co-authored-by: Scott <scott.todd0@gmail.com>
Progress on #36. Putting current status in the docs. Status could also go in the tracking issue but I like the version history in a markdown file and this helps set expectations when following the instructions. I'll need to look closer at existing ROCm support on Windows to form a strategy for expanding support here. A quick scan through the projects suggests that some math-libs and ml-libs projects can run on Windows already, though they may need carve-outs for other parts of the dep tree that are solidly Linux-only at the moment.
Progress on #36. This fixes ccache when compiling with `-DCMAKE_BUILD_TYPE=RelWithDebInfo` and MSVC on Windows. Before: ``` λ ccache --show-stats Cacheable calls: 0 / 4309 ( 0.00%) Hits: 0 Direct: 0 Preprocessed: 0 Misses: 0 Uncacheable calls: 4309 / 4309 (100.0%) Local storage: Cache size (GB): 0.0 / 50.0 ( 0.00%) ``` After: ``` λ ccache --show-stats Cacheable calls: 4099 / 4309 (95.13%) Hits: 4054 / 4099 (98.90%) Direct: 3482 / 4054 (85.89%) Preprocessed: 572 / 4054 (14.11%) Misses: 45 / 4099 ( 1.10%) Uncacheable calls: 210 / 4309 ( 4.87%) Local storage: Cache size (GB): 4.2 / 50.0 ( 8.34%) Hits: 4054 / 4099 (98.90%) Misses: 45 / 4099 ( 1.10%) ``` We had a different workaround in IREE: iree-org/iree#11841. I tried a solution like that one in this project but the meta-project nature of TheRock discourages setting implicit top level options. Instead, I followed how other options like `CMAKE_CXX_COMPILER_LAUNCHER` are forwarded to subprojects explicitly. ### Debugging notes I followed https://ccache.dev/manual/4.10.2.html#_cache_debugging: ```bash ccache --set-config debug=true ccache --set-config debug_dir=D:\cache\ccache_debug ccache --show-config # confirm that the options are set ``` Then I looked in files like `D:\cache\ccache_debug\projects\TheRock\build\compiler\amd-llvm\build\lib\Support\CMakeFiles\LLVMSupport.dir\Allocator.cpp.obj.20250219_112722_308872.ccache-log` and found logs like these: ``` [2025-02-19T11:27:22.309172 28480] Command line: ccache C:\\PROGRA~2\\MICROS~2\\2022\\BUILDT~1\\VC\\Tools\\MSVC\\1442~1.344\\bin\\Hostx64\\x64\\cl.exe /nologo /TP -DGTEST_HAS_RTTI=0 -DUNICODE -D_CRT_NONSTDC_NO_DEPRECATE -D_CRT_NONSTDC_NO_WARNINGS -D_CRT_SECURE_NO_DEPRECATE -D_CRT_SECURE_NO_WARNINGS -D_HAS_EXCEPTIONS=0 -D_SCL_SECURE_NO_DEPRECATE -D_SCL_SECURE_NO_WARNINGS -D_UNICODE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -ID:\\projects\\TheRock\\build\\compiler\\amd-llvm\\build\\lib\\Support -ID:\\projects\\TheRock\\compiler\\amd-llvm\\llvm\\lib\\Support -ID:\\projects\\TheRock\\build\\compiler\\amd-llvm\\build\\include -ID:\\projects\\TheRock\\compiler\\amd-llvm\\llvm\\include /DWIN32;/D_WINDOWS;/EHsc /DWIN32 /D_WINDOWS /Zc:inline /Zc:preprocessor /Zc:__cplusplus /Oi /bigobj /permissive- /W4 -wd4141 -wd4146 -wd4244 -wd4267 -wd4291 -wd4351 -wd4456 -wd4457 -wd4458 -wd4459 -wd4503 -wd4624 -wd4722 -wd4100 -wd4127 -wd4512 -wd4505 -wd4610 -wd4510 -wd4702 -wd4245 -wd4706 -wd4310 -wd4701 -wd4703 -wd4389 -wd4611 -wd4805 -wd4204 -wd4577 -wd4091 -wd4592 -wd4319 -wd4709 -wd5105 -wd4324 -w14062 -we4238 /Gw /O2 /Ob1 /DNDEBUG -std:c++17 -MD -Zi /EHs-c- /GR- /showIncludes /Folib\\Support\\CMakeFiles\\LLVMSupport.dir\\Allocator.cpp.obj /Fdlib\\Support\\CMakeFiles\\LLVMSupport.dir\\LLVMSupport.pdb /FS -c D:\\projects\\TheRock\\compiler\\amd-llvm\\llvm\\lib\\Support\\Allocator.cpp [2025-02-19T11:27:22.312459 28480] Hostname: DESKTOP-HTCE3IS [2025-02-19T11:27:22.312465 28480] Working directory: D:\projects\TheRock\build\compiler\amd-llvm\build [2025-02-19T11:27:22.312475 28480] Compiler: C:\PROGRA~2\MICROS~2\2022\BUILDT~1\VC\Tools\MSVC\1442~1.344\bin\Hostx64\x64\cl.exe [2025-02-19T11:27:22.312476 28480] Compiler type: msvc [2025-02-19T11:27:22.312846 28480] Detected input file: D:\projects\TheRock\compiler\amd-llvm\llvm\lib\Support\Allocator.cpp [2025-02-19T11:27:22.312852 28480] Compiler option -Zi is unsupported [2025-02-19T11:27:22.312857 28480] Failed; falling back to running the real compiler ``` From there it was whack-a-mole trying to find where the `-Zi` flag came from, checking source code and eventually finding https://github.com/llvm/llvm-project/blob/c0c42c8b3213520700f15587ab8aa4477a286ff9/cmake/Modules/CMakePolicy.cmake#L9-L20. Reading the docs informed me that my prior workaround of manipulating `CMAKE_C_FLAGS_RELWITHDEBINFO` is outdated and actually won't work when the new policy is enabled. With the right fix, I deleted my build directory, re-ran a full build, and checked the ccache stats and log files to ensure that ccache was operating as expected again.
Progress on #36. This code looks quite brittle: https://github.com/ROCm/HIPIFY/blob/b803a5270bd9a63da25407181ca0346d472415e1/CMakeLists.txt#L90-L93. I would expect any override of the compiler toolchain like that to be performed via a different mechanism like a check and assert on the existing value (not overwriting), a toolchain file, `add_custom_command`, or `execute_process`... not overwriting the compiler entirely like that.
Progress on #36, following up on #79 (comment). Need to be careful with some of these deps since `therock_finalize_features` can enable a feature implicitly if an explicitly enabled feature depends on it.
Overview
We're aiming to support as much of the toolkit as possible on Windows.
Current status
build_windows_packages.yml
workflow runs as part ofci.yml
on every commit. The workflow builds currently supported subprojects but does not yet build or publish any artifacts.Initial experiments 2025-01-30
Expand to see details
Setup notes
git from https://git-scm.com/downloads (latest version
2.45.1
) then enabled symlinks withBuild Tools for Visual Studio 2022 from https://visualstudio.microsoft.com/downloads/ (latest version
17.12.4
)CMake from https://cmake.org/download/ (latest, version
3.31.5
)Ninja as a CMake generator from https://ninja-build.org/ (latest, version
1.12.1
)ccache from https://ccache.dev/ (latest, version
4.10.2
)Fetching sources
After cloning TheRock, I tried to checkout sources using
python ./build_tools/fetch_sources.py
. That needed the repo tool (https://gerrit.googlesource.com/git-repo/+/HEAD/README.md, docs at https://source.android.com/docs/setup/reference/repo), which is not distributed on Windows, so I downloaded the script manually. I did then have to editbuild_tools/fetch_sources.py
to execpython D:/path/to/repo
instead of justrepo
. We can teach that script how to download the file on its own and run it in a portable way.Configure and build
After fetching sources, I tried configuring and building with CMake (under VSCode):
[proc] Executing command: "C:\Program Files\CMake\bin\cmake.EXE" -DCMAKE_BUILD_TYPE:STRING=RelWithDebInfo -DCMAKE_EXPORT_COMPILE_COMMANDS:BOOL=TRUE -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DTHEROCK_ENABLE_RCCL=OFF -DTHEROCK_ENABLE_MATH_LIBS=OFF -DTHEROCK_ENABLE_ML_FRAMEWORKS=OFF -DTHEROCK_AMDGPU_FAMILIES=gfx110X-dgpu -UTHEROCK_AMDGPU_TARGETS -UTHEROCK_AMDGPU_DIST_BUNDLE_NAME -DLLVM_BUILD_LLVM_DYLIB=OFF --no-warn-unused-cli -SD:/projects/TheRock -Bd:/projects/TheRock/build -G Ninja
I found a few Windows issues specific to TheRock:
Next, I'm finding issues in the subprojects that TheRock includes. Sample logs: https://gist.github.com/ScottTodd/0a6625c4502169c5d54535bf122fc7fd. I'm not sure how deep these issues go, and if some of them are fixed on development branches. Here are a few I've found so far:
rocm-core
__attribute__
is not portable to all compilers (specifically MSVC) in https://github.com/ROCm/rocm-core/blob/master/rocm_version.h.in, do something likeNeed a similar fix for
__attribute__((nonnull))
, maybe_Ret_nonnull
(https://stackoverflow.com/a/78737973)rocm-core
link.h
anddlfcn.h
headers are Unix-only in https://github.com/ROCm/rocm-core/blob/master/rocm_getpath.cpp (use something likeLoadLibraryA()
on a.dll
instead ofdlopen()
on a.so
). The headers should also only be included ifBUILD_SHARED_LIBS
is defined (see next bullet)-DBUILD_SHARED_LIBS=OFF
here helps a bit:TheRock/base/CMakeLists.txt
Line 22 in 96ebf46
rocm-core in the same file uses
PATH_MAX
which is nonstandard. There is aFILENAME_MAX
that is standard but https://stackoverflow.com/a/65174437 says it is not to be trusted for memory allocation. Could do something likemin(FILENAME_MAX, 4096)
(choosing some sensible value that is notINT_MAX
)llvm warns about
Generating libLLVM is not supported on MSVC
in https://github.com/ROCm/llvm-project/blob/amd-staging/llvm/tools/llvm-shlib/CMakeLists.txt so I set-DLLVM_BUILD_LLVM_DYLIB=OFF
in my CMake configure command. Not sure if that was load bearing for anything else yet. Could auto-detect and flip that off by default in TheRock or one of the subprojects that includes it.rocm_smi_lib sets
CMAKE_CXX_FLAGS
that assume gcc/clang: https://github.com/ROCm/rocm_smi_lib/blob/e5c5a1d5b7cba41ab801a987fb6c466503411804/CMakeLists.txt#L79-L85. It seems like https://github.com/ROCm/amdsmi is the successor to rocm_smi, but it has the same issue: https://github.com/ROCm/amdsmi/blob/9872083b491c9b136496fd493414aca5c6c144cb/CMakeLists.txt#L99-L104. Both projects also only claim to support Linux.The text was updated successfully, but these errors were encountered: