-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Align data in Buffer.Memmove for arm64 #93214
Conversation
Tagging subscribers to this area: @dotnet/area-system-buffers Issue DetailsCurrently, we never switch to the native What's more important, it shows nice wins compared to the current impl when data is not 16 bytes aligned (even just 8-byte alignment is expensive) which is quite a common case since GC only offers 8-byte alignment and e.g. String's data is always 4-byte aligned, etc. A benchmark on Apple M2: static byte[] _src = new byte[10000];
static byte[] _dst = new byte[10000];
[Benchmark]
public void CopyTo_align4() => _src.AsSpan(4).CopyTo(_dst.AsSpan(4)); Presumably, it should also help with the noise in microbenchmarks since GC gives us 8-byte alignment which can be sometimes naturally aligned to 16 bytes or may be not, depends on the Moon's phase.
|
@jkotas PTAL |
Would it be better to fix this TODO instead? What is the perf of your microbenchmark with |
Do you mean to jump into native for len>512 uncoditionally? I assume it was disabled for a reason and on some platforms it's slow? |
pretty much the same, sometimes the managed impl is a nanosec or two faster. But it's on Apple M2 where memmove is quite optimized, I am not so sure about other platforms/distros, e.g. I've seen a glibc version that didn't try to align data. |
Yes, it was slow during the initial Arm64 bring up when we run on glibc without any Arm64 specific optimizations. I would expect that memcpy is optimized on all current Arm64 systems that we run on. It is what the TODO alludes to. |
Enabled for XARCH as well since the suggested way to align is cheaper. I am seeing wins even on XArch now for misaligned access:
Len = 4096 is done in native. |
For the reference, codegen of block copy: [StructLayout(LayoutKind.Sequential, Size = 64)]
private struct Block64 { }
static void CopyBlock64(ref byte dest, ref byte src)
{
Unsafe.As<byte, Block64>(ref dest) = Unsafe.As<byte, Block64>(ref src);
} AVX512 CPU: vmovdqu32 zmm0, zmmword ptr [rdx]
vmovdqu32 zmmword ptr [rcx], zmm0 AVX CPU: vmovdqu ymm0, ymmword ptr [rdx]
vmovdqu ymmword ptr [rcx], ymm0
vmovdqu ymm0, ymmword ptr [rdx+0x20]
vmovdqu ymmword ptr [rcx+0x20], ymm0 SSE2 CPU: movups xmm0, xmmword ptr [rdx]
movups xmmword ptr [rcx], xmm0
movups xmm0, xmmword ptr [rdx+0x10]
movups xmmword ptr [rcx+0x10], xmm0
movups xmm0, xmmword ptr [rdx+0x20]
movups xmmword ptr [rcx+0x20], xmm0
movups xmm0, xmmword ptr [rdx+0x30]
movups xmmword ptr [rcx+0x30], xmm0 ARM64 NEON CPU: ldp q16, q17, [x1]
stp q16, q17, [x0]
ldp q16, q17, [x1, #0x20]
stp q16, q17, [x0, #0x20] Two notes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
Currently, we never switch to the native
memmove
on ARM64 for Mac/Linux. Instead of enabling it back, I tested manual alignment fordest
and for most cases saw small wins or the same performance as the nativememmove
on Apple M2 (macOS) and Ampere (Linux).What's more important, it shows nice wins compared to the current impl when data is not 16 bytes aligned (even just 8-byte alignment is expensive) which is quite a common case since GC only offers 8-byte alignment and e.g. String's data is always 4-byte aligned, etc. A benchmark on Apple M2:
Presumably, it should also help with the noise in microbenchmarks since GC gives us 8-byte alignment which can be sometimes naturally aligned to 16 bytes or may be not, depends on the Moon's phase.
I was using the following benchmark to get exact alignment in my tests: