Skip to content

Commit 2065087

Browse files
author
Carl Love
committed
Unroll loop in lookup_2_lanes
The current loop goes from 0 to 31. It has an if statement to do an assignment for j < 16 and a different assignment for j >= 16. By unrolling the loop to do the j < 16 and the j >= 16 iterations in parallel the if j < 16 is eliminated and the number of loop iterations is reduced in half. Then unroll the loop for the j < 16 and the j >=16 to a depth of 2. This change results in approximately a 55% reduction in the execution time for the bench_ivf_fastscan.py workload on Power 10 when compiled with CMAKE_INSTALL_CONFIG_NAME=Release. The removal of the if (j < 16) statement and the unrolling of the loop removes branch cycle stall and register dependencies on instruction issue. The result is the unrolled code is able issue instructions earlier thus reducing the total number of cycles required to execute the function.
1 parent 252ae16 commit 2065087

File tree

1 file changed

+25
-6
lines changed

1 file changed

+25
-6
lines changed

faiss/utils/simdlib_emulated.h

+25-6
Original file line numberDiff line numberDiff line change
@@ -532,16 +532,35 @@ struct simd32uint8 : simd256bit {
532532
// The very important operation that everything relies on
533533
simd32uint8 lookup_2_lanes(const simd32uint8& idx) const {
534534
simd32uint8 c;
535-
for (int j = 0; j < 32; j++) {
535+
for (int j = 0; j < 16; j = j + 2) {
536+
// j < 16, unrolled to depth of 2
536537
if (idx.u8[j] & 0x80) {
537538
c.u8[j] = 0;
538539
} else {
539540
uint8_t i = idx.u8[j] & 15;
540-
if (j < 16) {
541-
c.u8[j] = u8[i];
542-
} else {
543-
c.u8[j] = u8[16 + i];
544-
}
541+
c.u8[j] = u8[i];
542+
}
543+
544+
if (idx.u8[j + 1] & 0x80) {
545+
c.u8[j + 1] = 0;
546+
} else {
547+
uint8_t i = idx.u8[j + 1] & 15;
548+
c.u8[j + 1] = u8[i];
549+
}
550+
551+
// j >= 16, unrolled to depth of 2
552+
if (idx.u8[j + 16] & 0x80) {
553+
c.u8[j + 16] = 0;
554+
} else {
555+
uint8_t i = idx.u8[j + 16] & 15;
556+
c.u8[j + 16] = u8[i + 16];
557+
}
558+
559+
if (idx.u8[j + 17] & 0x80) {
560+
c.u8[j + 17] = 0;
561+
} else {
562+
uint8_t i = idx.u8[j + 17] & 15;
563+
c.u8[j + 17] = u8[i + 16];
545564
}
546565
}
547566
return c;

0 commit comments

Comments
 (0)