Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s390x: extracting an element at a non-const index from a SIMD vector generates bad code #137372

Open
folkertdev opened this issue Feb 21, 2025 · 2 comments · May be fixed by #137447
Open

s390x: extracting an element at a non-const index from a SIMD vector generates bad code #137372

folkertdev opened this issue Feb 21, 2025 · 2 comments · May be fixed by #137447
Labels
needs-triage This issue may need triage. Remove it if it has been sufficiently triaged.

Comments

@folkertdev
Copy link
Contributor

I'm trying to add an implementation of vec_extract for the s390x-unknown-linux-gnu target in stdarch:

https://godbolt.org/z/e65Mvf5vM

#include <vecintrin.h>

unsigned int foo(vector unsigned int a, signed int b) { 
    return vec_extract(a, b);
}

Turns into this LLVM

define dso_local zeroext i32 @foo(<4 x i32> noundef %a, i32 noundef signext %b) local_unnamed_addr {
entry:
  %and.i = and i32 %b, 3
  %vecext.i = extractelement <4 x i32> %a, i32 %and.i
  ret i32 %vecext.i
}

And generates the following assembly

foo:
        nilf    %r2, 3              # bitwise and with 0b11 to prevent UB due to index out of bounds
        vlgvf   %r0, %v24, 0(%r2)   # put the element at index %r2 from the vector %24 into %r0
        llgfr   %r2, %r0            # zero-extend that value, put it into %r2
        br      %r14                # ret

in particular, the vlgvf (f for fullword, there are variations for other widths) is the relevant instruction here. It extracts the value at the given index.

Contrary to most other targets, the index argument to a vec_extract does not need to be const. The std::intrinsics::simd::simd_extract function does need its index argument to be const, and therefore can't straightforwardly be used to implement vec_extract.

Attempt 1

I tried simple field extraction:

https://godbolt.org/z/sbhYj316x

#![feature(core_intrinsics, repr_simd, s390x_target_feature)]

#[repr(simd)]
struct vector_unsigned_int([u32; 4]);

#[no_mangle]
unsafe extern "C" fn extract_element_bad(v: vector_unsigned_int, idx: u32) -> u32 {
    v.0[idx as usize % 4]
}

(extern "C" is used so that the vector is passed by-value, but we see the same assembly when the vector is created within the function)

Indexing into the underlying array in this way may soon be banned, though there is an alternative approach that does the same thing. Unfortunately, this version does not optimize well:

define noundef zeroext i32 @extract_element_bad(<4 x i32> %0, i32 noundef zeroext %idx) unnamed_addr personality ptr @rust_eh_personality {
start:
  %v = alloca [16 x i8], align 8
  store <4 x i32> %0, ptr %v, align 8
  %1 = and i32 %idx, 3
  %_3 = zext nneg i32 %1 to i64
  %2 = getelementptr inbounds nuw i32, ptr %v, i64 %_3
  %_0 = load i32, ptr %2, align 4
  ret i32 %_0
}

The portable-simd implementation of Index appears to be doing the same thing, and generates the same code https://godbolt.org/z/eecM6qdbW. That totally makes sense for most targets, because a pointer load is the best you can do.

Attempt 2

I did find that this version does optimize well

#[no_mangle]
unsafe extern "C" fn extract_element_good(v: vector_unsigned_int, idx: u32) -> u32 {
    match idx % 4 {
        0 => simd_extract(v, 0),
        1 => simd_extract(v, 1),
        2 => simd_extract(v, 2),
        3 => simd_extract(v, 3),
        _ => unreachable!(),
    }
}
define noundef zeroext i32 @extract_element_good(<4 x i32> %v, i32 noundef zeroext %idx) unnamed_addr personality ptr @rust_eh_personality {
start:
  %_3 = and i32 %idx, 3
  %switch.idx.cast = zext nneg i32 %_3 to i64
  %0 = extractelement <4 x i32> %v, i64 %switch.idx.cast
  ret i32 %0
}

but that is unwieldy kind of unwieldy, and while I can make it work for stdarch it won't work for portable_simd.

Solutions

I think there should be a way of indexing into a vector that emits an extractelement rather than a getelementptr. Semantically that seems more accurate (and might optimize better in some cases?), even though on most targets the generated assembly will be the same.

Some things I'm not sure about

  • how big a deal this is for s390x (cc @uweigand do you know how important generating vlgvf is?)
  • maybe adding the extra intrinsic does still run into the complexity that Ban projecting into repr(simd) types compiler-team#838 wants to fix
  • maybe extracting with a non-const value is a terrible idea on some/most targets and unimplemented by design
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Feb 21, 2025
@uweigand
Copy link
Contributor

It does indeed appear to be the case that Intel and Arm don't have instructions to extract vector lanes with a variable index (except for some special cases with SVE it seems). Not sure why this is the case. But s390x is certainly not the only architecture to provide this support; PowerPC has basically the same capability, and from what I can see in the LLVM sources at least RISC-V, Hexagon, and LoongArch also support this.

Of course even on s390x vec_extract is much more likely to be used with a constant index than a variable one. So as long as that case is supported well, I guess the solution would be acceptable. I'm still not quite sure why Rust decided to disable variable indexes with simd_extract when the underlying LLVM extractelement does support it ...

@workingjubilee
Copy link
Member

Because when we use it in our intrinsic implementations, if we permit the index to be variable, we do not generate the assembly we want. Simple as.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-triage This issue may need triage. Remove it if it has been sufficiently triaged.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants