Optimize `is_permutation` for reversed sequences #2043

AdamBucior · 2021-07-06T17:01:28Z

Fixes #279.
Also fixes P0202R3_constexpr_algorithm_and_exchange test for next_permutation and prev_permutation (it was broken and never called) and adds is_permutation coverage to this test (which was apparently missed).

miscco

I have to say that I find this overly complicated for a rather simple matching algorithm.

From what I can tell, we should be able to putt all in the first loop at 1005 without any issue

            if constexpr (bidirectional_iterator<_It1> && bidirectional_iterator<_It2>) {
                // determine final iterator values
                auto _Final1 = _Find_last_iterator(_First1, _Last1, _Count);
                auto _Final2 = _Find_last_iterator(_First2, _Last2, _Count);

                for (;;--_Count) { // trim
                    if (_Count == 0) { // everything matched
                        return true;
                    }

                    // Matching prefix
                    if (_STD invoke(_Pred, _STD invoke(_Proj1, *_First1), _STD invoke(_Proj2, *_First2))) {
                        ++_First1;
                        ++_First2;
                        continue;
                    }

                    // Matching suffix
                    --_Final1;
                    --_Final2;
                    if (_STD invoke(_Pred, _STD invoke(_Proj1, *_Final1), _STD invoke(_Proj2, *_Final2))) {
                        continue;
                    }

                    // Matching reversed
                    if (_STD invoke(_Pred, _STD invoke(_Proj1, *_First1), _STD invoke(_Proj2, *_Final2))) {
                        ++_First1;
                        ++_Final1;
                        continue;
                    }

                    // Matching reversed 
                    if (_STD invoke(_Pred, _STD invoke(_Proj1, *_Final1), _STD invoke(_Proj2, *_First2))) {
                        ++_First2;
                        ++_Final2;
                        continue;
                    }

                    break;
                }
          }

Note this assume that we completely split into bidirectional / forward ranges

We could also do the dance and optimize bidirectional + forward range that way

stl/inc/algorithm

miscco · 2021-07-06T18:57:21Z

I think we could merge everything without loosing efficiency into

                _It1 _Final1;
                if constexpr (bidirectional_iterator<_It1>) {
                    _Final1 = _Find_last_iterator(_First1, _Last1, _Count);
                }

                _It2 _Final2;
                if constexpr (bidirectional_iterator<_It2>) {
                    _Final2 = _Find_last_iterator(_First2, _Last2, _Count);
                }

                for (;;) { // trim
                    // Matching suffix
                    if (_STD invoke(_Pred, _STD invoke(_Proj1, *_First1), _STD invoke(_Proj2, *_First2))) {
                        ++_First1;
                        ++_First2;
                        continue;
                    }

                    // Matching reversed
                    if constexpr (bidirectional_iterator<_It2>) {
                        if (_STD invoke(_Pred, _STD invoke(_Proj1, *_First1), _STD invoke(_Proj2, *--_Final2))) {
                            ++_First1;
                            continue;
                        }
                    }

                    // Matching reversed
                    if constexpr (bidirectional_iterator<_It1>) {
                        if (_STD invoke(_Pred, _STD invoke(_Proj1, *--_Final1), _STD invoke(_Proj2, *_First2))) {
                            ++_First2;
                            if constexpr (bidirectional_iterator<_It2>) {
                                ++_Final2;
                            }
                            continue;
                        }
                    }

                    // Matching suffix
                    if constexpr (bidirectional_iterator<_It1> && bidirectional_iterator<_It2>) {
                        if (_STD invoke(_Pred, _STD invoke(_Proj1, *_Final1), _STD invoke(_Proj2, *_Final2))) {
                            continue;
                        }
                    }

                    if (--_Count == 0) { // everything matched
                        return true;
                    }

                    break;
                }

AdamBucior · 2021-07-06T19:16:56Z

I have to say that I find this overly complicated for a rather simple matching algorithm.

From what I can tell, we should be able to putt all in the first loop at 1005 without any issue

            if constexpr (bidirectional_iterator<_It1> && bidirectional_iterator<_It2>) {
                // determine final iterator values
                auto _Final1 = _Find_last_iterator(_First1, _Last1, _Count);
                auto _Final2 = _Find_last_iterator(_First2, _Last2, _Count);

                for (;;--_Count) { // trim
                    if (_Count == 0) { // everything matched
                        return true;
                    }

                    // Matching prefix
                    if (_STD invoke(_Pred, _STD invoke(_Proj1, *_First1), _STD invoke(_Proj2, *_First2))) {
                        ++_First1;
                        ++_First2;
                        continue;
                    }

                    // Matching suffix
                    --_Final1;
                    --_Final2;
                    if (_STD invoke(_Pred, _STD invoke(_Proj1, *_Final1), _STD invoke(_Proj2, *_Final2))) {
                        continue;
                    }

                    // Matching reversed
                    if (_STD invoke(_Pred, _STD invoke(_Proj1, *_First1), _STD invoke(_Proj2, *_Final2))) {
                        ++_First1;
                        ++_Final1;
                        continue;
                    }

                    // Matching reversed 
                    if (_STD invoke(_Pred, _STD invoke(_Proj1, *_Final1), _STD invoke(_Proj2, *_First2))) {
                        ++_First2;
                        ++_Final2;
                        continue;
                    }

                    break;
                }
          }

Note this assume that we completely split into bidirectional / forward ranges

We could also do the dance and optimize bidirectional + forward range that way

That loop makes a lot of unnecessary comparisons. Consider ranges like { 1, 2, 3, 4, 5 } and { 2, 1, 3, 4, 5 }:
It firstly compares prefixes: no luck, 1 and 2 are different.
Then goes to compare suffixes: hooray, 5 and 5 are equal, we trim those and start from the beginning.
We again compare prefixes: again no luck, 1 and 2 are still different.
Then compare suffixes: hooray, 4 and 4 are equal, we trim those and start from the beginning.
We compare the same prefixes for the third time: again no luck, 1 and 2 are still different.
And this goes on and on.

stl/inc/xutility

miscco · 2021-07-07T08:49:35Z

I believe that the same thing happens with your loop with slightly changed inputs. That said you can easily work around this

        template <class _It1, class _Se1, class _It2, class _Se2, class _Pr, class _Pj1, class _Pj2>
        _NODISCARD static constexpr bool _Is_permutation_sized(_It1 _First1, _Se1 _Last1, _It2 _First2, _Se2 _Last2,
            iter_difference_t<_It1> _Count, _Pr _Pred, _Pj1 _Proj1, _Pj2 _Proj2) {
            _STL_INTERNAL_STATIC_ASSERT(forward_iterator<_It1>);
            _STL_INTERNAL_STATIC_ASSERT(sentinel_for<_Se1, _It1>);
            _STL_INTERNAL_STATIC_ASSERT(forward_iterator<_It2>);
            _STL_INTERNAL_STATIC_ASSERT(sentinel_for<_Se2, _It2>);
            _STL_INTERNAL_STATIC_ASSERT(
                indirect_equivalence_relation<_Pr, projected<_It1, _Pj1>, projected<_It2, _Pj2>>);
            _STL_INTERNAL_CHECK(_RANGES distance(_First1, _Last1) == _Count);
            _STL_INTERNAL_CHECK(_RANGES distance(_First2, _Last2) == _Count);

            _It1 _Final1;
            if constexpr (bidirectional_iterator<_It1>) {
                _Final1 = _Find_last_iterator(_First1, _Last1, _Count);
            }

            _It2 _Final2;
            if constexpr (bidirectional_iterator<_It2>) {
                _Final2 = _Find_last_iterator(_First2, _Last2, _Count);
            }

            _Trim_checked _Checked = _Trim_checked::_Nothing_checked;
            for (;; --_Count) { // trim
                if (_Count == 0) { // everything matched
                    return true;
                }

                // Matching prefix
                if (!(_Checked & _Trim_checked::_Prefix_checked)
                    && _STD invoke(_Pred, _STD invoke(_Proj1, *_First1), _STD invoke(_Proj2, *_First2))) {
                    ++_First1;
                    ++_First2;
                    _Checked &= _Trim_checked::_Suffix_checked;
                    continue;
                } else {
                    _Checked |= _Trim_checked::_Prefix_checked;
                }

                // Matching reversed
                if constexpr (bidirectional_iterator<_It2>) {
                    --_Final2;
                    if (!(_Checked & _Trim_checked::_Reversed12_checked)
                        && _STD invoke(_Pred, _STD invoke(_Proj1, *_First1), _STD invoke(_Proj2, *_Final2))) {
                        ++_First1;
                        _Checked &= _Trim_checked::_Reversed21_checked;
                        continue;
                    } else {
                        _Checked |= _Trim_checked::_Reversed12_checked;
                    }
                }

                // Matching reversed
                if constexpr (bidirectional_iterator<_It1>) {
                    --_Final1;
                    if (!(_Checked & _Trim_checked::_Reversed21_checked)
                        && _STD invoke(_Pred, _STD invoke(_Proj1, *_Final1), _STD invoke(_Proj2, *_First2))) {
                        ++_First2;
                        if constexpr (bidirectional_iterator<_It2>) {
                            ++_Final2;
                        }
                        _Checked &= _Trim_checked::_Reversed12_checked;
                        continue;
                    } else {
                        _Checked |= _Trim_checked::_Reversed21_checked;
                    }
                }

                // Matching suffix
                if constexpr (bidirectional_iterator<_It1> && bidirectional_iterator<_It2>) {
                    if (!(_Checked & _Trim_checked::_Suffix_checked)
                        && _STD invoke(_Pred, _STD invoke(_Proj1, *_Final1), _STD invoke(_Proj2, *_Final2))) {
                        _Checked &= _Trim_checked::_Prefix_checked;
                        continue;
                    } else {
                        _Checked |= _Trim_checked::_Suffix_checked;
                    }
                }

                if constexpr (bidirectional_iterator<_It1>) {
                    ++_Final1;
                }

                if constexpr (bidirectional_iterator<_It2>) {
                    ++_Final2;
                }

                break;
            }

            if (_Count == 1) { // single non-matching elements remain; not a permutation
                return false;
            }

            // If we get here, _Count > 1, initial elements do not match, and final elements do not match.
            if constexpr (bidirectional_iterator<_It1>) {
                return _Match_counts(_STD move(_First1), _STD move(_Final1), _STD move(_First2), _STD move(_Last2),
                    _Pred, _Proj1, _Proj2);
            } else if constexpr (bidirectional_iterator<_It2>) {
                return _Match_counts(_STD move(_First1), _STD move(_Last1), _STD move(_First2), _STD move(_Final2),
                    _Pred, _Proj1, _Proj2);
            } else if constexpr (bidirectional_iterator<_It1> && bidirectional_iterator<_It2>) {
                return _Match_counts(_STD move(_First1), _STD move(_Final1), _STD move(_First2), _STD move(_Final2),
                    _Pred, _Proj1, _Proj2);
            } else {
                return _Match_counts(_STD move(_First1), _STD move(_Last1), _STD move(_First2), _STD move(_Last2),
                    _Pred, _Proj1, _Proj2);
            }
        }

EDIT fixed a bug

AdamBucior · 2021-07-07T08:57:11Z

I believe that the same thing happens with your loop with slightly changed inputs.

Could you elaborate on this?

miscco · 2021-07-07T10:25:22Z

FYI, I put an POC of my proposal (only ranges edition) here https://github.com/miscco/STL/tree/is_permutation

Happy to discuss alternatives / performance

AdamBucior · 2021-07-07T12:28:16Z

I made some benchmarks:

AdamBucior/is_permutation-optmization

arr1 + arr1: 87867000ns
arr1 + arr2: 112476700ns
arr1 + arr3: 125705600ns
arr1 + arr4: 163001700ns
arr1 + arr5: 447000700ns
arr6 + arr7: 167218500ns

miscco/is_permutation

arr1 + arr1: 234621600ns ~167% increase
arr1 + arr2: 311507800ns ~176% increase
arr1 + arr3: 261688100ns ~108% increase
arr1 + arr4: 290552500ns ~78% increase
arr1 + arr5: 668636900ns ~49% increase
arr6 + arr7: 307597600ns ~84% increase

Your implementation introduces a lot of overhead even for equal arrays.

Benchmark code:

#include <algorithm>
#include <chrono>
#include <cassert>
#include <format>
#include <iostream>

using namespace std;
using namespace chrono;

template <size_t Iterations, typename Rng>
auto benchmark(const Rng& r1, const Rng& r2, bool expected) {
    steady_clock::time_point start, end;
    start = steady_clock::now();

    for (size_t i = 0; i < Iterations; ++i) {
        assert(ranges::is_permutation(r1, r2) == expected);
    }

    end = steady_clock::now();

    return end - start;
}


int main() {
    int arr1[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
    int arr2[] = {9, 8, 7, 6, 5, 4, 3, 2, 1, 0};
    int arr3[] = {9, 8, 7, 3, 4, 5, 6, 2, 1, 0};
    int arr4[] = {9, 1, 7, 3, 5, 4, 6, 2, 8, 0};
    int arr5[] = {9, 1, 7, 5, 6, 3, 4, 2, 8, 0};
    int arr6[] = {0, 1, 2, 3, 4, 10, 5, 6, 7, 8, 9};
    int arr7[] = {9, 1, 7, 3, 5, 11, 4, 6, 2, 8, 0};
    cout << format("arr1 + arr1: {}\n", benchmark<10000000>(arr1, arr1, true));
    cout << format("arr1 + arr2: {}\n", benchmark<10000000>(arr1, arr2, true));
    cout << format("arr1 + arr3: {}\n", benchmark<10000000>(arr1, arr3, true));
    cout << format("arr1 + arr4: {}\n", benchmark<10000000>(arr1, arr4, true));
    cout << format("arr1 + arr5: {}\n", benchmark<10000000>(arr1, arr5, true));
    cout << format("arr6 + arr7: {}\n", benchmark<10000000>(arr6, arr7, false));

    system("pause");
}

miscco · 2021-07-07T13:37:32Z

I am getting slightly different results

2021-07-07T15:48:55+02:00
Running C:\Users\mschellenbergercosta\source\repos\Benchmark_is_permutation\out\build\x64-Release\Benchmark_is_permutation\Benchmark_is_permutation.exe
Run on (8 X 1992 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_same/8                     22.9 ns         19.1 ns     49777778
BM_same/64                     127 ns          123 ns      5600000
BM_same/512                    870 ns          816 ns       746667
BM_same/1024                  1719 ns         1650 ns       407273
BM_same/8192                 10274 ns        10463 ns        74667
BM_same_miscco/8              20.0 ns         18.6 ns     34461538
BM_same_miscco/64              131 ns          129 ns      4977778
BM_same_miscco/512            1042 ns         1046 ns       746667
BM_same_miscco/1024           1990 ns         1995 ns       344615
BM_same_miscco/8192          17498 ns        17648 ns        40727
BM_same_adam/8                13.1 ns         12.8 ns     56000000
BM_same_adam/64               49.7 ns         49.2 ns     14933333
BM_same_adam/512               426 ns          424 ns      1659259
BM_same_adam/1024              861 ns          854 ns       896000
BM_same_adam/8192             7860 ns         8022 ns        89600
BM_reversed/8                  149 ns          148 ns      4977778
BM_reversed/64               10182 ns        10010 ns        64000
BM_reversed/512             592309 ns       585938 ns         1120
BM_reversed/1024           2345326 ns      2351589 ns          299
BM_reversed/8192         171859100 ns    171875000 ns            4
BM_reversed_miscco/8          21.1 ns         21.0 ns     32000000
BM_reversed_miscco/64          134 ns          132 ns      4977778
BM_reversed_miscco/512        1000 ns         1004 ns       746667
BM_reversed_miscco/1024       1998 ns         1950 ns       344615
BM_reversed_miscco/8192      17849 ns        17264 ns        40727
BM_reversed_adam/8            15.4 ns         15.7 ns     40727273
BM_reversed_adam/64           63.2 ns         64.2 ns     11200000
BM_reversed_adam/512           493 ns          496 ns      1544828
BM_reversed_adam/1024          984 ns          984 ns       746667
BM_reversed_adam/8192         9283 ns         9417 ns        89600
BM_shuffled/8                  190 ns          185 ns     13937778
BM_shuffled/64               11226 ns        11161 ns        56000
BM_shuffled/512             597549 ns       599888 ns         1120
BM_shuffled/1024           2398843 ns      2351589 ns          299
BM_shuffled/8192         173120725 ns    171875000 ns            4
BM_shuffled_miscco/8           155 ns          157 ns      4480000
BM_shuffled_miscco/64        10392 ns        10498 ns        64000
BM_shuffled_miscco/512      601231 ns       599888 ns         1120
BM_shuffled_miscco/1024    2422690 ns      2403846 ns          299
BM_shuffled_miscco/8192  182383850 ns    183593750 ns            4
BM_shuffled_adam/8             161 ns          160 ns      4480000
BM_shuffled_adam/64          10105 ns        10045 ns        74667
BM_shuffled_adam/512        591250 ns       599888 ns         1120
BM_shuffled_adam/1024      2377680 ns      2351589 ns          299
BM_shuffled_adam/8192    181883750 ns    183593750 ns            4

So essentially your algorithm obliterates mine in sorted case and is better in the pure reversal case. Otherwise it performs roughly equal

stl/inc/xutility

…mization

StephanTLavavej

Thanks, this looks good! (And a video review of this will be available in the near future.) Just a few issues, nothing major.

stl/inc/xutility

stl/inc/algorithm

stl/inc/xutility

stl/inc/algorithm

tests/std/tests/P0896R4_ranges_alg_is_permutation/test.cpp

mnatsuhara

This looks awesome! Thanks for this great work :)

I've made a few comments, but they're for small comment nitpicks and one suggested restructuring (hence, no changes requested).

If you don't want to reset testing for these comments, perhaps @StephanTLavavej can add them to his list of cleanups for the next big cleanup PR! :)

stl/inc/xutility

stl/inc/algorithm

StephanTLavavej · 2021-11-11T23:52:58Z

Thanks! Since this code is fresh in our minds, I'll validate and push changes instead of deferring cleanups until later. 🧠

Co-authored-by: Miya Natsuhara <46756417+mnatsuhara@users.noreply.github.com>

StephanTLavavej · 2021-11-12T02:24:59Z

@AdamBucior @mnatsuhara I've pushed changes, plus a merge with main since it had been a while.

StephanTLavavej · 2021-11-12T05:22:32Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2021-11-13T03:22:23Z

Thanks for implementing this major optimization in the STL's highest-complexity algorithm! 🚀 😻 🎉

Optimize is_permutation for reversed sequences

96bcfdf

AdamBucior requested a review from a team as a code owner July 6, 2021 17:01

clang-format

ca09bc0

miscco reviewed Jul 6, 2021

View reviewed changes

stl/inc/algorithm Show resolved Hide resolved

stl/inc/algorithm Outdated Show resolved Hide resolved

StephanTLavavej added the performance Must go faster label Jul 6, 2021

miscco reviewed Jul 7, 2021

View reviewed changes

stl/inc/xutility Show resolved Hide resolved

miscco reviewed Jul 7, 2021

View reviewed changes

stl/inc/xutility Outdated Show resolved Hide resolved

StephanTLavavej assigned mnatsuhara Jul 14, 2021

CaseyCarter changed the title ~~Optimize is_permutation for reversed sequences~~ Optimize is_permutation for reversed sequences Jul 14, 2021

AdamBucior added 2 commits July 20, 2021 10:38

Merge remote-tracking branch 'upstream/main' into is_permutation-opti…

e6cd63a

…mization

Merge remote-tracking branch 'upstream/main' into is_permutation-opti…

8345f3b

…mization

StephanTLavavej mentioned this pull request Aug 18, 2021

Fix test_permutations() in P0202R3_constexpr_algorithm_and_exchange #2137

Merged

Rename _Last to _Back

982f0a7

StephanTLavavej self-assigned this Aug 25, 2021

AdamBucior added 2 commits August 27, 2021 12:23

Merge remote-tracking branch 'upstream/main' into is_permutation-opti…

7bbdb02

…mization

workaround

c9cbb51

StephanTLavavej requested changes Sep 24, 2021

View reviewed changes

StephanTLavavej removed their assignment Sep 24, 2021

Code review

6ed3f38

StephanTLavavej self-assigned this Sep 28, 2021

StephanTLavavej approved these changes Sep 28, 2021

View reviewed changes

StephanTLavavej removed their assignment Sep 28, 2021

mnatsuhara approved these changes Nov 11, 2021

View reviewed changes

stl/inc/xutility Outdated Show resolved Hide resolved

stl/inc/xutility Outdated Show resolved Hide resolved

stl/inc/algorithm Outdated Show resolved Hide resolved

stl/inc/algorithm Show resolved Hide resolved

mnatsuhara removed their assignment Nov 11, 2021

StephanTLavavej and others added 4 commits November 11, 2021 16:54

Code review feedback: Comments.

9f1ba80

Co-authored-by: Miya Natsuhara <46756417+mnatsuhara@users.noreply.github.com>

Code review feedback: Rework control flow.

c4b6674

Co-authored-by: Miya Natsuhara <46756417+mnatsuhara@users.noreply.github.com>

Rename _Break to _HaveWorkAfterTrimming.

6b85a3b

Merge branch 'main' into is_permutation-optimization

6f8f871

StephanTLavavej approved these changes Nov 12, 2021

View reviewed changes

StephanTLavavej self-assigned this Nov 12, 2021

mnatsuhara approved these changes Nov 12, 2021

View reviewed changes

StephanTLavavej merged commit 254fca2 into microsoft:main Nov 13, 2021

AdamBucior deleted the is_permutation-optimization branch November 13, 2021 11:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `is_permutation` for reversed sequences #2043

Optimize `is_permutation` for reversed sequences #2043

AdamBucior commented Jul 6, 2021

miscco left a comment

miscco commented Jul 6, 2021 •

edited

Loading

AdamBucior commented Jul 6, 2021

miscco commented Jul 7, 2021 •

edited

Loading

AdamBucior commented Jul 7, 2021

miscco commented Jul 7, 2021

AdamBucior commented Jul 7, 2021

miscco commented Jul 7, 2021 •

edited

Loading

StephanTLavavej left a comment

mnatsuhara left a comment

StephanTLavavej commented Nov 11, 2021

StephanTLavavej commented Nov 12, 2021

StephanTLavavej commented Nov 12, 2021

StephanTLavavej commented Nov 13, 2021

Optimize is_permutation for reversed sequences #2043

Optimize is_permutation for reversed sequences #2043

Conversation

AdamBucior commented Jul 6, 2021

miscco left a comment

Choose a reason for hiding this comment

miscco commented Jul 6, 2021 • edited Loading

AdamBucior commented Jul 6, 2021

miscco commented Jul 7, 2021 • edited Loading

AdamBucior commented Jul 7, 2021

miscco commented Jul 7, 2021

AdamBucior commented Jul 7, 2021

miscco commented Jul 7, 2021 • edited Loading

StephanTLavavej left a comment

Choose a reason for hiding this comment

mnatsuhara left a comment

Choose a reason for hiding this comment

StephanTLavavej commented Nov 11, 2021

StephanTLavavej commented Nov 12, 2021

StephanTLavavej commented Nov 12, 2021

StephanTLavavej commented Nov 13, 2021

Optimize `is_permutation` for reversed sequences #2043

Optimize `is_permutation` for reversed sequences #2043

miscco commented Jul 6, 2021 •

edited

Loading

miscco commented Jul 7, 2021 •

edited

Loading

miscco commented Jul 7, 2021 •

edited

Loading