Align data in Buffer.Memmove for arm64 #93214

EgorBo · 2023-10-09T12:14:04Z

Currently, we never switch to the native memmove on ARM64 for Mac/Linux. Instead of enabling it back, I tested manual alignment for dest and for most cases saw small wins or the same performance as the native memmove on Apple M2 (macOS) and Ampere (Linux).

What's more important, it shows nice wins compared to the current impl when data is not 16 bytes aligned (even just 8-byte alignment is expensive) which is quite a common case since GC only offers 8-byte alignment and e.g. String's data is always 4-byte aligned, etc. A benchmark on Apple M2:

static byte[] _src = new byte[10000];
static byte[] _dst = new byte[10000];

[Benchmark]
public void CopyTo_align4() => _src.AsSpan(4).CopyTo(_dst.AsSpan(4));

Presumably, it should also help with the noise in microbenchmarks since GC gives us 8-byte alignment which can be sometimes naturally aligned to 16 bytes or may be not, depends on the Moon's phase.

Method	Toolchain	Mean	Error	StdDev	Ratio
CopyTo_align4	/Core_Root_base/corerun	195.1 ns	0.75 ns	0.70 ns	1.68
CopyTo_align4	/Core_Root_pr/corerun	116.2 ns	0.54 ns	0.50 ns	1.00

I was using the following benchmark to get exact alignment in my tests:

static Benchmarks()
{
    byte* align64_src = (byte*)NativeMemory.AlignedAlloc(1_000_000, 64);
    byte* align64_dst = (byte*)NativeMemory.AlignedAlloc(1_000_000, 64);
    _align4_src = align64_src + 4;
    _align4_dst = align64_dst + 4;
}

[Benchmark]
[Arguments(...)]
public void CopyTo_align4(int len) =>
    new Span<byte>(_align4_src, len).CopyTo(new Span<byte>(_align4_dst, len));

ghost · 2023-10-09T12:14:39Z

Tagging subscribers to this area: @dotnet/area-system-buffers
See info in area-owners.md if you want to be subscribed.

Issue Details

Currently, we never switch to the native memmove on ARM64 for Mac/Linux. Instead of enabling it back, I tested manual alignment for dest and for most cases saw wins over the native memmove on Apple M1 (macOS) and Ampere (Linux).

What's more important, it shows nice wins compared to the current impl when data is not 16 bytes aligned (even just 8-byte alignment is expensive) which is quite a common case since GC only offers 8-byte alignment and e.g. String's data is always 4-byte aligned, etc. A benchmark on Apple M2:

static byte[] _src = new byte[10000];
static byte[] _dst = new byte[10000];

[Benchmark]
public void CopyTo_align4() => _src.AsSpan(4).CopyTo(_dst.AsSpan(4));

Presumably, it should also help with the noise in microbenchmarks since GC gives us 8-byte alignment which can be sometimes naturally aligned to 16 bytes or may be not, depends on the Moon's phase.

Method	Toolchain	Mean	Error	StdDev	Ratio
CopyTo_align4	/Core_Root_base/corerun	195.1 ns	0.75 ns	0.70 ns	1.68
CopyTo_align4	/Core_Root_pr/corerun	116.2 ns	0.54 ns	0.50 ns	1.00

Author:	EgorBo
Assignees:	EgorBo
Labels:	`area-System.Buffers`
Milestone:	-

EgorBo · 2023-10-09T14:24:09Z

@jkotas PTAL

jkotas · 2023-10-09T15:27:19Z

runtime/src/libraries/System.Private.CoreLib/src/System/Buffer.Unix.cs

Line 10 in 9f4884c

// TODO-ARM64-UNIX-OPT revisit when glibc optimized memmove is in Linux distros

Would it be better to fix this TODO instead? What is the perf of your microbenchmark with MemmoveNativeThreshold = 2048?

src/libraries/System.Private.CoreLib/src/System/Buffer.cs

EgorBo · 2023-10-09T15:34:28Z

runtime/src/libraries/System.Private.CoreLib/src/System/Buffer.Unix.cs

Line 10 in 9f4884c

// TODO-ARM64-UNIX-OPT revisit when glibc optimized memmove is in Linux distros

Would it be better to fix this TODO instead? What is the perf of your microbenchmark with MemmoveNativeThreshold = 2048?

Do you mean to jump into native for len>512 uncoditionally? I assume it was disabled for a reason and on some platforms it's slow?

EgorBo · 2023-10-09T15:39:57Z

What is the perf of your microbenchmark with MemmoveNativeThreshold = 2048

pretty much the same, sometimes the managed impl is a nanosec or two faster. But it's on Apple M2 where memmove is quite optimized, I am not so sure about other platforms/distros, e.g. I've seen a glibc version that didn't try to align data.

jkotas · 2023-10-09T15:41:40Z

I assume it was disabled for a reason and on some platforms it's slow?

Yes, it was slow during the initial Arm64 bring up when we run on glibc without any Arm64 specific optimizations. I would expect that memcpy is optimized on all current Arm64 systems that we run on. It is what the TODO alludes to.

src/libraries/System.Private.CoreLib/src/System/Buffer.cs

EgorBo · 2023-10-09T18:25:41Z

Enabled for XARCH as well since the suggested way to align is cheaper. I am seeing wins even on XArch now for misaligned access:

Method	Job	Toolchain	len	Mean
CopyTo_align4	Job-VLSCTC	\Core_Root\corerun.exe	256	2.284 ns
CopyTo_align4	Job-BCQRRZ	\Core_Root_base\corerun.exe	256	2.299 ns

CopyTo_align4	Job-VLSCTC	\Core_Root\corerun.exe	512	2.986 ns
CopyTo_align4	Job-BCQRRZ	\Core_Root_base\corerun.exe	512	5.534 ns

CopyTo_align4	Job-VLSCTC	\Core_Root\corerun.exe	1024	5.717 ns
CopyTo_align4	Job-BCQRRZ	\Core_Root_base\corerun.exe	1024	8.299 ns

CopyTo_align4	Job-VLSCTC	\Core_Root\corerun.exe	2048	11.466 ns
CopyTo_align4	Job-BCQRRZ	\Core_Root_base\corerun.exe	2048	16.798 ns

CopyTo_align4	Job-VLSCTC	\Core_Root\corerun.exe	4096	26.465 ns
CopyTo_align4	Job-BCQRRZ	\Core_Root_base\corerun.exe	4096	27.076 ns

Len = 4096 is done in native.

src/libraries/System.Private.CoreLib/src/System/Buffer.cs

EgorBo · 2023-10-09T18:37:23Z

For the reference, codegen of block copy:

[StructLayout(LayoutKind.Sequential, Size = 64)]
private struct Block64 { }

static void CopyBlock64(ref byte dest, ref byte src)
{
    Unsafe.As<byte, Block64>(ref dest) = Unsafe.As<byte, Block64>(ref src);
}

AVX512 CPU:

       vmovdqu32 zmm0, zmmword ptr [rdx]
       vmovdqu32 zmmword ptr [rcx], zmm0

AVX CPU:

       vmovdqu  ymm0, ymmword ptr [rdx]
       vmovdqu  ymmword ptr [rcx], ymm0
       vmovdqu  ymm0, ymmword ptr [rdx+0x20]
       vmovdqu  ymmword ptr [rcx+0x20], ymm0

SSE2 CPU:

       movups   xmm0, xmmword ptr [rdx]
       movups   xmmword ptr [rcx], xmm0
       movups   xmm0, xmmword ptr [rdx+0x10]
       movups   xmmword ptr [rcx+0x10], xmm0
       movups   xmm0, xmmword ptr [rdx+0x20]
       movups   xmmword ptr [rcx+0x20], xmm0
       movups   xmm0, xmmword ptr [rdx+0x30]
       movups   xmmword ptr [rcx+0x30], xmm0

ARM64 NEON CPU:

            ldp     q16, q17, [x1]
            stp     q16, q17, [x0]
            ldp     q16, q17, [x1, #0x20]
            stp     q16, q17, [x0, #0x20]

Two notes:

it probably makes sense for JIT to use at least 2 regs for BLK unrolling for better pipelining (or even 4 on some x-archs).
Although, I wasn't able to detect wins from that on XARCH, but there were wins on ARM64
AVX-512 CPUs might benefit from Block128 copy or even Block256 (My Ryzen with AVX512 won't benefit from it because its AVX-512 is basically 2xAVX2)

jkotas

Thanks

EgorBo · 2023-10-09T20:26:17Z

Hopefully, will make benchmarks like this more stable:

Align dest to 16 bytes for Buffer.Memmove

246163d

ghost assigned EgorBo Oct 9, 2023

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 9, 2023

EgorBo added area-System.Buffers and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Oct 9, 2023

Use Vector128.Size

0e15943

jkotas reviewed Oct 9, 2023

View reviewed changes