Skip to content

Commit

Permalink
[Support] Vendor rpmalloc in-tree and use it for the Windows 64-bit r…
Browse files Browse the repository at this point in the history
…elease (#91862)

### Context

We have a longstanding performance issue on Windows, where to this day,
the default heap allocator is still lockfull. With the number of cores
increasing, building and using LLVM with the default Windows heap
allocator is sub-optimal. Notably, the ThinLTO link times with LLD are
extremely long, and increase proportionally with the number of cores in
the machine.

In
llvm/llvm-project@a6a37a2,
I introduced the ability build LLVM with several popular lock-free
allocators. Downstream users however have to build their own toolchain
with this option, and building an optimal toolchain is a bit tedious and
long. Additionally, LLVM is now integrated into Visual Studio, which
AFAIK re-distributes the vanilla LLVM binaries/installer. The point
being that many users are impacted and might not be aware of this
problem, or are unable to build a more optimal version of the toolchain.

The symptom before this PR is that most of the CPU time goes to the
kernel (darker blue) when linking with ThinLTO:

![16c_ryzen9_windows_heap](https://github.com/llvm/llvm-project/assets/37383324/86c3f6b9-6028-4c1a-ba60-a2fa3876fba7)

With this PR, most time is spent in user space (light blue):

![16c_ryzen9_rpmalloc](https://github.com/llvm/llvm-project/assets/37383324/646b88f3-5b6d-485d-a2e4-15b520bdaf5b)

On higher core count machines, before this PR, the CPU usage becomes
pretty much flat because of contention:

<img width="549" alt="VM_176_windows_heap"
src="https://github.com/llvm/llvm-project/assets/37383324/f27d5800-ee02-496d-a4e7-88177e0727f0">

With this PR, similarily most CPU time is now used:

<img width="549" alt="VM_176_with_rpmalloc"
src="https://github.com/llvm/llvm-project/assets/37383324/7d4785dd-94a7-4f06-9b16-aaa4e2e505c8">

### Changes in this PR

The avenue I've taken here is to vendor/re-licence rpmalloc in-tree, and
use it when building the Windows 64-bit release. Given the permissive
rpmalloc licence, prior discussions with the LLVM foundation and
@lattner suggested this vendoring. Rpmalloc's author (@mjansson) kindly
agreed to ~~donate~~ re-licence the rpmalloc code in LLVM (please do
correct me if I misinterpreted our past communications).

I've chosen rpmalloc because it's small and gives the best value
overall. The source code is only 4 .c files. Rpmalloc is statically
replacing the weak CRT alloc symbols at link time, and has no dynamic
patching like mimalloc. As an alternative, there were several
unsuccessfull attempts made by Russell Gallop to use SCUDO in the past,
please see thread in https://reviews.llvm.org/D86694. If later someone
comes up with a PR of similar performance that uses SCUDO, we could then
delete this vendored rpmalloc folder.

I've added a new cmake flag `LLVM_ENABLE_RPMALLOC` which essentialy sets
`LLVM_INTEGRATED_CRT_ALLOC` to the in-tree rpmalloc source.

### Performance

The most obvious test is profling a ThinLTO linking step with LLD. I've
used a Clang compilation as a testbed, ie.
```
set OPTS=/GS- /D_ITERATOR_DEBUG_LEVEL=0 -Xclang -O3 -fstrict-aliasing -march=native -flto=thin -fwhole-program-vtables -fuse-ld=lld
cmake -G Ninja %ROOT%/llvm -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=TRUE -DLLVM_ENABLE_PROJECTS="clang" -DLLVM_ENABLE_PDB=ON -DLLVM_OPTIMIZED_TABLEGEN=ON -DCMAKE_C_COMPILER=clang-cl.exe -DCMAKE_CXX_COMPILER=clang-cl.exe -DCMAKE_LINKER=lld-link.exe -DLLVM_ENABLE_LLD=ON -DCMAKE_CXX_FLAGS="%OPTS%" -DCMAKE_C_FLAGS="%OPTS%" -DLLVM_ENABLE_LTO=THIN
```
I've profiled the linking step with no LTO cache, with Powershell, such
as:
```
Measure-Command { lld-link /nologo @CMakeFiles\clang.rsp /out:bin\clang.exe /implib:lib\clang.lib /pdb:bin\clang.pdb /version:0.0 /machine:x64 /STACK:10000000 /DEBUG /OPT:REF /OPT:ICF /INCREMENTAL:NO /subsystem:console /MANIFEST:EMBED,ID=1 }`
```

Timings:

| Machine | Allocator | Time to link |
|--------|--------|--------|
| 16c/32t AMD Ryzen 9 5950X | Windows Heap | 10 min 38 sec |
|  | **Rpmalloc** | **4 min 11 sec** |
| 32c/64t AMD Ryzen Threadripper PRO 3975WX | Windows Heap | 23 min 29
sec |
|  | **Rpmalloc** | **2 min 11 sec** |
|  | **Rpmalloc + /threads:64** | **1 min 50 sec** |
| 176 vCPU (2 socket) Intel Xeon Platinium 8481C (fixed clock 2.7 GHz) |
Windows Heap | 43 min 40 sec |
|  | **Rpmalloc** | **1 min 45 sec** |

This also improves the overall performance when building with clang-cl.
I've profiled a regular compilation of clang itself, ie:
```
set OPTS=/GS- /D_ITERATOR_DEBUG_LEVEL=0 /arch:AVX -fuse-ld=lld
cmake -G Ninja %ROOT%/llvm -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=TRUE -DLLVM_ENABLE_PROJECTS="clang;lld" -DLLVM_ENABLE_PDB=ON -DLLVM_OPTIMIZED_TABLEGEN=ON -DCMAKE_C_COMPILER=clang-cl.exe -DCMAKE_CXX_COMPILER=clang-cl.exe -DCMAKE_LINKER=lld-link.exe -DLLVM_ENABLE_LLD=ON -DCMAKE_CXX_FLAGS="%OPTS%" -DCMAKE_C_FLAGS="%OPTS%"
```
This saves approx. 30 sec when building on the Threadripper PRO 3975WX:
```
(default Windows Heap)
C:\src\git\llvm-project>hyperfine -r 5 -p "make_llvm.bat stage1_test2" "ninja clang -C stage1_test2"
Benchmark 1: ninja clang -C stage1_test2
  Time (mean ± σ):     392.716 s ±  3.830 s    [User: 17734.025 s, System: 1078.674 s]
  Range (min … max):   390.127 s … 399.449 s    5 runs

(rpmalloc)
C:\src\git\llvm-project>hyperfine -r 5 -p "make_llvm.bat stage1_test2" "ninja clang -C stage1_test2"
Benchmark 1: ninja clang -C stage1_test2
  Time (mean ± σ):     360.824 s ±  1.162 s    [User: 15148.637 s, System: 905.175 s]
  Range (min … max):   359.208 s … 362.288 s    5 runs
```
# Conflicts:
#	llvm/docs/ReleaseNotes.rst
#	llvm/utils/release/build_llvm_release.bat
  • Loading branch information
aganea authored and gmh5225 committed Sep 27, 2024
1 parent 88dc495 commit cf9b063
Show file tree
Hide file tree
Showing 11 changed files with 5,797 additions and 227 deletions.
30 changes: 29 additions & 1 deletion llvm/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -718,7 +718,35 @@ if( WIN32 AND NOT CYGWIN )
endif()
set(LLVM_NATIVE_TOOL_DIR "" CACHE PATH "Path to a directory containing prebuilt matching native tools (such as llvm-tblgen)")

set(LLVM_INTEGRATED_CRT_ALLOC "" CACHE PATH "Replace the Windows CRT allocator with any of {rpmalloc|mimalloc|snmalloc}. Only works with CMAKE_MSVC_RUNTIME_LIBRARY=MultiThreaded.")
set(LLVM_ENABLE_RPMALLOC "" CACHE BOOL "Replace the CRT allocator with rpmalloc.")
if(LLVM_ENABLE_RPMALLOC)
if(NOT (CMAKE_SYSTEM_NAME MATCHES "Windows|Linux"))
message(FATAL_ERROR "LLVM_ENABLE_RPMALLOC is only supported on Windows and Linux.")
endif()
if(LLVM_USE_SANITIZER)
message(FATAL_ERROR "LLVM_ENABLE_RPMALLOC cannot be used along with LLVM_USE_SANITIZER!")
endif()
if(WIN32)
if(CMAKE_CONFIGURATION_TYPES)
foreach(BUILD_MODE ${CMAKE_CONFIGURATION_TYPES})
string(TOUPPER "${BUILD_MODE}" uppercase_BUILD_MODE)
if(uppercase_BUILD_MODE STREQUAL "DEBUG")
message(WARNING "The Debug target isn't supported along with LLVM_ENABLE_RPMALLOC!")
endif()
endforeach()
else()
if(CMAKE_BUILD_TYPE AND uppercase_CMAKE_BUILD_TYPE STREQUAL "DEBUG")
message(FATAL_ERROR "The Debug target isn't supported along with LLVM_ENABLE_RPMALLOC!")
endif()
endif()
endif()

# Override the C runtime allocator with the in-tree rpmalloc
set(LLVM_INTEGRATED_CRT_ALLOC "${CMAKE_CURRENT_SOURCE_DIR}/lib/Support")
set(CMAKE_MSVC_RUNTIME_LIBRARY "MultiThreaded")
endif()

set(LLVM_INTEGRATED_CRT_ALLOC "${LLVM_INTEGRATED_CRT_ALLOC}" CACHE PATH "Replace the Windows CRT allocator with any of {rpmalloc|mimalloc|snmalloc}. Only works with CMAKE_MSVC_RUNTIME_LIBRARY=MultiThreaded.")
if(LLVM_INTEGRATED_CRT_ALLOC)
if(NOT WIN32)
message(FATAL_ERROR "LLVM_INTEGRATED_CRT_ALLOC is only supported on Windows.")
Expand Down
9 changes: 8 additions & 1 deletion llvm/docs/CMake.rst
Original file line number Diff line number Diff line change
Expand Up @@ -708,8 +708,15 @@ enabled sub-projects. Nearly all of these variable names begin with
$ D:\git> git clone https://github.com/mjansson/rpmalloc
$ D:\llvm-project> cmake ... -DLLVM_INTEGRATED_CRT_ALLOC=D:\git\rpmalloc
This flag needs to be used along with the static CRT, ie. if building the
This option needs to be used along with the static CRT, ie. if building the
Release target, add -DCMAKE_MSVC_RUNTIME_LIBRARY=MultiThreaded.
Note that rpmalloc is also supported natively in-tree, see option below.

**LLVM_ENABLE_RPMALLOC**:BOOL
Similar to LLVM_INTEGRATED_CRT_ALLOC, embeds the in-tree rpmalloc into the
host toolchain as a C runtime allocator. The version currently used is
rpmalloc 1.4.5. This option also implies linking with the static CRT, there's
no need to provide CMAKE_MSVC_RUNTIME_LIBRARY.

**LLVM_INSTALL_DOXYGEN_HTML_DIR**:STRING
The path to install Doxygen-generated HTML documentation to. This path can
Expand Down
Loading

0 comments on commit cf9b063

Please sign in to comment.