s390x: extracting an element at a non-`const` index from a SIMD vector generates bad code #137372

folkertdev · 2025-02-21T13:54:50Z

I'm trying to add an implementation of vec_extract for the s390x-unknown-linux-gnu target in stdarch:

#include <vecintrin.h>

unsigned int foo(vector unsigned int a, signed int b) { 
    return vec_extract(a, b);
}

Turns into this LLVM

define dso_local zeroext i32 @foo(<4 x i32> noundef %a, i32 noundef signext %b) local_unnamed_addr {
entry:
  %and.i = and i32 %b, 3
  %vecext.i = extractelement <4 x i32> %a, i32 %and.i
  ret i32 %vecext.i
}

And generates the following assembly

foo:
        nilf    %r2, 3              # bitwise and with 0b11 to prevent UB due to index out of bounds
        vlgvf   %r0, %v24, 0(%r2)   # put the element at index %r2 from the vector %24 into %r0
        llgfr   %r2, %r0            # zero-extend that value, put it into %r2
        br      %r14                # ret

in particular, the vlgvf (f for fullword, there are variations for other widths) is the relevant instruction here. It extracts the value at the given index.

Contrary to most other targets, the index argument to a vec_extract does not need to be const. The std::intrinsics::simd::simd_extract function does need its index argument to be const, and therefore can't straightforwardly be used to implement vec_extract.

Attempt 1

I tried simple field extraction:

https://godbolt.org/z/sbhYj316x

#![feature(core_intrinsics, repr_simd, s390x_target_feature)]

#[repr(simd)]
struct vector_unsigned_int([u32; 4]);

#[no_mangle]
unsafe extern "C" fn extract_element_bad(v: vector_unsigned_int, idx: u32) -> u32 {
    v.0[idx as usize % 4]
}

(extern "C" is used so that the vector is passed by-value, but we see the same assembly when the vector is created within the function)

Indexing into the underlying array in this way may soon be banned, though there is an alternative approach that does the same thing. Unfortunately, this version does not optimize well:

define noundef zeroext i32 @extract_element_bad(<4 x i32> %0, i32 noundef zeroext %idx) unnamed_addr personality ptr @rust_eh_personality {
start:
  %v = alloca [16 x i8], align 8
  store <4 x i32> %0, ptr %v, align 8
  %1 = and i32 %idx, 3
  %_3 = zext nneg i32 %1 to i64
  %2 = getelementptr inbounds nuw i32, ptr %v, i64 %_3
  %_0 = load i32, ptr %2, align 4
  ret i32 %_0
}

The portable-simd implementation of Index appears to be doing the same thing, and generates the same code https://godbolt.org/z/eecM6qdbW. That totally makes sense for most targets, because a pointer load is the best you can do.

Attempt 2

I did find that this version does optimize well

#[no_mangle]
unsafe extern "C" fn extract_element_good(v: vector_unsigned_int, idx: u32) -> u32 {
    match idx % 4 {
        0 => simd_extract(v, 0),
        1 => simd_extract(v, 1),
        2 => simd_extract(v, 2),
        3 => simd_extract(v, 3),
        _ => unreachable!(),
    }
}

define noundef zeroext i32 @extract_element_good(<4 x i32> %v, i32 noundef zeroext %idx) unnamed_addr personality ptr @rust_eh_personality {
start:
  %_3 = and i32 %idx, 3
  %switch.idx.cast = zext nneg i32 %_3 to i64
  %0 = extractelement <4 x i32> %v, i64 %switch.idx.cast
  ret i32 %0
}

but that is unwieldy kind of unwieldy, and while I can make it work for stdarch it won't work for portable_simd.

Solutions

I think there should be a way of indexing into a vector that emits an extractelement rather than a getelementptr. Semantically that seems more accurate (and might optimize better in some cases?), even though on most targets the generated assembly will be the same.

Some things I'm not sure about

how big a deal this is for s390x (cc @uweigand do you know how important generating vlgvf is?)
maybe adding the extra intrinsic does still run into the complexity that Ban projecting into repr(simd) types compiler-team#838 wants to fix
maybe extracting with a non-const value is a terrible idea on some/most targets and unimplemented by design

The text was updated successfully, but these errors were encountered:

uweigand · 2025-02-21T19:17:34Z

It does indeed appear to be the case that Intel and Arm don't have instructions to extract vector lanes with a variable index (except for some special cases with SVE it seems). Not sure why this is the case. But s390x is certainly not the only architecture to provide this support; PowerPC has basically the same capability, and from what I can see in the LLVM sources at least RISC-V, Hexagon, and LoongArch also support this.

Of course even on s390x vec_extract is much more likely to be used with a constant index than a variable one. So as long as that case is supported well, I guess the solution would be acceptable. I'm still not quite sure why Rust decided to disable variable indexes with simd_extract when the underlying LLVM extractelement does support it ...

workingjubilee · 2025-02-21T19:46:12Z

Because when we use it in our intrinsic implementations, if we permit the index to be variable, we do not generate the assembly we want. Simple as.

rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Feb 21, 2025

folkertdev linked a pull request Feb 22, 2025 that will close this issue

add core::intrinsics::simd::{simd_extract_dyn, simd_insert_dyn} #137447

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s390x: extracting an element at a non-`const` index from a SIMD vector generates bad code #137372

s390x: extracting an element at a non-`const` index from a SIMD vector generates bad code #137372

folkertdev commented Feb 21, 2025

uweigand commented Feb 21, 2025

workingjubilee commented Feb 21, 2025

s390x: extracting an element at a non-const index from a SIMD vector generates bad code #137372

s390x: extracting an element at a non-const index from a SIMD vector generates bad code #137372

Comments

folkertdev commented Feb 21, 2025

Attempt 1

Attempt 2

Solutions

uweigand commented Feb 21, 2025

workingjubilee commented Feb 21, 2025

s390x: extracting an element at a non-`const` index from a SIMD vector generates bad code #137372

s390x: extracting an element at a non-`const` index from a SIMD vector generates bad code #137372