Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize is_permutation for reversed sequences #2043

Merged
merged 12 commits into from
Nov 13, 2021

Conversation

AdamBucior
Copy link
Contributor

Fixes #279.
Also fixes P0202R3_constexpr_algorithm_and_exchange test for next_permutation and prev_permutation (it was broken and never called) and adds is_permutation coverage to this test (which was apparently missed).

@AdamBucior AdamBucior requested a review from a team as a code owner July 6, 2021 17:01
Copy link
Contributor

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to say that I find this overly complicated for a rather simple matching algorithm.

From what I can tell, we should be able to putt all in the first loop at 1005 without any issue

            if constexpr (bidirectional_iterator<_It1> && bidirectional_iterator<_It2>) {
                // determine final iterator values
                auto _Final1 = _Find_last_iterator(_First1, _Last1, _Count);
                auto _Final2 = _Find_last_iterator(_First2, _Last2, _Count);

                for (;;--_Count) { // trim
                    if (_Count == 0) { // everything matched
                        return true;
                    }

                    // Matching prefix
                    if (_STD invoke(_Pred, _STD invoke(_Proj1, *_First1), _STD invoke(_Proj2, *_First2))) {
                        ++_First1;
                        ++_First2;
                        continue;
                    }

                    // Matching suffix
                    --_Final1;
                    --_Final2;
                    if (_STD invoke(_Pred, _STD invoke(_Proj1, *_Final1), _STD invoke(_Proj2, *_Final2))) {
                        continue;
                    }

                    // Matching reversed
                    if (_STD invoke(_Pred, _STD invoke(_Proj1, *_First1), _STD invoke(_Proj2, *_Final2))) {
                        ++_First1;
                        ++_Final1;
                        continue;
                    }

                    // Matching reversed 
                    if (_STD invoke(_Pred, _STD invoke(_Proj1, *_Final1), _STD invoke(_Proj2, *_First2))) {
                        ++_First2;
                        ++_Final2;
                        continue;
                    }

                    break;
                }
          }

Note this assume that we completely split into bidirectional / forward ranges

We could also do the dance and optimize bidirectional + forward range that way

@miscco
Copy link
Contributor

miscco commented Jul 6, 2021

I think we could merge everything without loosing efficiency into

                _It1 _Final1;
                if constexpr (bidirectional_iterator<_It1>) {
                    _Final1 = _Find_last_iterator(_First1, _Last1, _Count);
                }

                _It2 _Final2;
                if constexpr (bidirectional_iterator<_It2>) {
                    _Final2 = _Find_last_iterator(_First2, _Last2, _Count);
                }

                for (;;) { // trim
                    // Matching suffix
                    if (_STD invoke(_Pred, _STD invoke(_Proj1, *_First1), _STD invoke(_Proj2, *_First2))) {
                        ++_First1;
                        ++_First2;
                        continue;
                    }

                    // Matching reversed
                    if constexpr (bidirectional_iterator<_It2>) {
                        if (_STD invoke(_Pred, _STD invoke(_Proj1, *_First1), _STD invoke(_Proj2, *--_Final2))) {
                            ++_First1;
                            continue;
                        }
                    }

                    // Matching reversed
                    if constexpr (bidirectional_iterator<_It1>) {
                        if (_STD invoke(_Pred, _STD invoke(_Proj1, *--_Final1), _STD invoke(_Proj2, *_First2))) {
                            ++_First2;
                            if constexpr (bidirectional_iterator<_It2>) {
                                ++_Final2;
                            }
                            continue;
                        }
                    }

                    // Matching suffix
                    if constexpr (bidirectional_iterator<_It1> && bidirectional_iterator<_It2>) {
                        if (_STD invoke(_Pred, _STD invoke(_Proj1, *_Final1), _STD invoke(_Proj2, *_Final2))) {
                            continue;
                        }
                    }

                    if (--_Count == 0) { // everything matched
                        return true;
                    }

                    break;
                }

@AdamBucior
Copy link
Contributor Author

I have to say that I find this overly complicated for a rather simple matching algorithm.

From what I can tell, we should be able to putt all in the first loop at 1005 without any issue

            if constexpr (bidirectional_iterator<_It1> && bidirectional_iterator<_It2>) {
                // determine final iterator values
                auto _Final1 = _Find_last_iterator(_First1, _Last1, _Count);
                auto _Final2 = _Find_last_iterator(_First2, _Last2, _Count);

                for (;;--_Count) { // trim
                    if (_Count == 0) { // everything matched
                        return true;
                    }

                    // Matching prefix
                    if (_STD invoke(_Pred, _STD invoke(_Proj1, *_First1), _STD invoke(_Proj2, *_First2))) {
                        ++_First1;
                        ++_First2;
                        continue;
                    }

                    // Matching suffix
                    --_Final1;
                    --_Final2;
                    if (_STD invoke(_Pred, _STD invoke(_Proj1, *_Final1), _STD invoke(_Proj2, *_Final2))) {
                        continue;
                    }

                    // Matching reversed
                    if (_STD invoke(_Pred, _STD invoke(_Proj1, *_First1), _STD invoke(_Proj2, *_Final2))) {
                        ++_First1;
                        ++_Final1;
                        continue;
                    }

                    // Matching reversed 
                    if (_STD invoke(_Pred, _STD invoke(_Proj1, *_Final1), _STD invoke(_Proj2, *_First2))) {
                        ++_First2;
                        ++_Final2;
                        continue;
                    }

                    break;
                }
          }

Note this assume that we completely split into bidirectional / forward ranges

We could also do the dance and optimize bidirectional + forward range that way

That loop makes a lot of unnecessary comparisons. Consider ranges like { 1, 2, 3, 4, 5 } and { 2, 1, 3, 4, 5 }:
It firstly compares prefixes: no luck, 1 and 2 are different.
Then goes to compare suffixes: hooray, 5 and 5 are equal, we trim those and start from the beginning.
We again compare prefixes: again no luck, 1 and 2 are still different.
Then compare suffixes: hooray, 4 and 4 are equal, we trim those and start from the beginning.
We compare the same prefixes for the third time: again no luck, 1 and 2 are still different.
And this goes on and on.

@StephanTLavavej StephanTLavavej added the performance Must go faster label Jul 6, 2021
@miscco
Copy link
Contributor

miscco commented Jul 7, 2021

I believe that the same thing happens with your loop with slightly changed inputs. That said you can easily work around this

        template <class _It1, class _Se1, class _It2, class _Se2, class _Pr, class _Pj1, class _Pj2>
        _NODISCARD static constexpr bool _Is_permutation_sized(_It1 _First1, _Se1 _Last1, _It2 _First2, _Se2 _Last2,
            iter_difference_t<_It1> _Count, _Pr _Pred, _Pj1 _Proj1, _Pj2 _Proj2) {
            _STL_INTERNAL_STATIC_ASSERT(forward_iterator<_It1>);
            _STL_INTERNAL_STATIC_ASSERT(sentinel_for<_Se1, _It1>);
            _STL_INTERNAL_STATIC_ASSERT(forward_iterator<_It2>);
            _STL_INTERNAL_STATIC_ASSERT(sentinel_for<_Se2, _It2>);
            _STL_INTERNAL_STATIC_ASSERT(
                indirect_equivalence_relation<_Pr, projected<_It1, _Pj1>, projected<_It2, _Pj2>>);
            _STL_INTERNAL_CHECK(_RANGES distance(_First1, _Last1) == _Count);
            _STL_INTERNAL_CHECK(_RANGES distance(_First2, _Last2) == _Count);

            _It1 _Final1;
            if constexpr (bidirectional_iterator<_It1>) {
                _Final1 = _Find_last_iterator(_First1, _Last1, _Count);
            }

            _It2 _Final2;
            if constexpr (bidirectional_iterator<_It2>) {
                _Final2 = _Find_last_iterator(_First2, _Last2, _Count);
            }

            _Trim_checked _Checked = _Trim_checked::_Nothing_checked;
            for (;; --_Count) { // trim
                if (_Count == 0) { // everything matched
                    return true;
                }

                // Matching prefix
                if (!(_Checked & _Trim_checked::_Prefix_checked)
                    && _STD invoke(_Pred, _STD invoke(_Proj1, *_First1), _STD invoke(_Proj2, *_First2))) {
                    ++_First1;
                    ++_First2;
                    _Checked &= _Trim_checked::_Suffix_checked;
                    continue;
                } else {
                    _Checked |= _Trim_checked::_Prefix_checked;
                }

                // Matching reversed
                if constexpr (bidirectional_iterator<_It2>) {
                    --_Final2;
                    if (!(_Checked & _Trim_checked::_Reversed12_checked)
                        && _STD invoke(_Pred, _STD invoke(_Proj1, *_First1), _STD invoke(_Proj2, *_Final2))) {
                        ++_First1;
                        _Checked &= _Trim_checked::_Reversed21_checked;
                        continue;
                    } else {
                        _Checked |= _Trim_checked::_Reversed12_checked;
                    }
                }

                // Matching reversed
                if constexpr (bidirectional_iterator<_It1>) {
                    --_Final1;
                    if (!(_Checked & _Trim_checked::_Reversed21_checked)
                        && _STD invoke(_Pred, _STD invoke(_Proj1, *_Final1), _STD invoke(_Proj2, *_First2))) {
                        ++_First2;
                        if constexpr (bidirectional_iterator<_It2>) {
                            ++_Final2;
                        }
                        _Checked &= _Trim_checked::_Reversed12_checked;
                        continue;
                    } else {
                        _Checked |= _Trim_checked::_Reversed21_checked;
                    }
                }

                // Matching suffix
                if constexpr (bidirectional_iterator<_It1> && bidirectional_iterator<_It2>) {
                    if (!(_Checked & _Trim_checked::_Suffix_checked)
                        && _STD invoke(_Pred, _STD invoke(_Proj1, *_Final1), _STD invoke(_Proj2, *_Final2))) {
                        _Checked &= _Trim_checked::_Prefix_checked;
                        continue;
                    } else {
                        _Checked |= _Trim_checked::_Suffix_checked;
                    }
                }

                if constexpr (bidirectional_iterator<_It1>) {
                    ++_Final1;
                }

                if constexpr (bidirectional_iterator<_It2>) {
                    ++_Final2;
                }

                break;
            }

            if (_Count == 1) { // single non-matching elements remain; not a permutation
                return false;
            }

            // If we get here, _Count > 1, initial elements do not match, and final elements do not match.
            if constexpr (bidirectional_iterator<_It1>) {
                return _Match_counts(_STD move(_First1), _STD move(_Final1), _STD move(_First2), _STD move(_Last2),
                    _Pred, _Proj1, _Proj2);
            } else if constexpr (bidirectional_iterator<_It2>) {
                return _Match_counts(_STD move(_First1), _STD move(_Last1), _STD move(_First2), _STD move(_Final2),
                    _Pred, _Proj1, _Proj2);
            } else if constexpr (bidirectional_iterator<_It1> && bidirectional_iterator<_It2>) {
                return _Match_counts(_STD move(_First1), _STD move(_Final1), _STD move(_First2), _STD move(_Final2),
                    _Pred, _Proj1, _Proj2);
            } else {
                return _Match_counts(_STD move(_First1), _STD move(_Last1), _STD move(_First2), _STD move(_Last2),
                    _Pred, _Proj1, _Proj2);
            }
        }

EDIT fixed a bug

@AdamBucior
Copy link
Contributor Author

I believe that the same thing happens with your loop with slightly changed inputs.

Could you elaborate on this?

@miscco
Copy link
Contributor

miscco commented Jul 7, 2021

FYI, I put an POC of my proposal (only ranges edition) here https://github.com/miscco/STL/tree/is_permutation

Happy to discuss alternatives / performance

@AdamBucior
Copy link
Contributor Author

I made some benchmarks:

AdamBucior/is_permutation-optmization

arr1 + arr1: 87867000ns
arr1 + arr2: 112476700ns
arr1 + arr3: 125705600ns
arr1 + arr4: 163001700ns
arr1 + arr5: 447000700ns
arr6 + arr7: 167218500ns

miscco/is_permutation

arr1 + arr1: 234621600ns ~167% increase
arr1 + arr2: 311507800ns ~176% increase
arr1 + arr3: 261688100ns ~108% increase
arr1 + arr4: 290552500ns ~78% increase
arr1 + arr5: 668636900ns ~49% increase
arr6 + arr7: 307597600ns ~84% increase

Your implementation introduces a lot of overhead even for equal arrays.

Benchmark code:

#include <algorithm>
#include <chrono>
#include <cassert>
#include <format>
#include <iostream>

using namespace std;
using namespace chrono;

template <size_t Iterations, typename Rng>
auto benchmark(const Rng& r1, const Rng& r2, bool expected) {
    steady_clock::time_point start, end;
    start = steady_clock::now();

    for (size_t i = 0; i < Iterations; ++i) {
        assert(ranges::is_permutation(r1, r2) == expected);
    }

    end = steady_clock::now();

    return end - start;
}


int main() {
    int arr1[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
    int arr2[] = {9, 8, 7, 6, 5, 4, 3, 2, 1, 0};
    int arr3[] = {9, 8, 7, 3, 4, 5, 6, 2, 1, 0};
    int arr4[] = {9, 1, 7, 3, 5, 4, 6, 2, 8, 0};
    int arr5[] = {9, 1, 7, 5, 6, 3, 4, 2, 8, 0};
    int arr6[] = {0, 1, 2, 3, 4, 10, 5, 6, 7, 8, 9};
    int arr7[] = {9, 1, 7, 3, 5, 11, 4, 6, 2, 8, 0};
    cout << format("arr1 + arr1: {}\n", benchmark<10000000>(arr1, arr1, true));
    cout << format("arr1 + arr2: {}\n", benchmark<10000000>(arr1, arr2, true));
    cout << format("arr1 + arr3: {}\n", benchmark<10000000>(arr1, arr3, true));
    cout << format("arr1 + arr4: {}\n", benchmark<10000000>(arr1, arr4, true));
    cout << format("arr1 + arr5: {}\n", benchmark<10000000>(arr1, arr5, true));
    cout << format("arr6 + arr7: {}\n", benchmark<10000000>(arr6, arr7, false));

    system("pause");
}

@miscco
Copy link
Contributor

miscco commented Jul 7, 2021

I am getting slightly different results

2021-07-07T15:48:55+02:00
Running C:\Users\mschellenbergercosta\source\repos\Benchmark_is_permutation\out\build\x64-Release\Benchmark_is_permutation\Benchmark_is_permutation.exe
Run on (8 X 1992 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_same/8                     22.9 ns         19.1 ns     49777778
BM_same/64                     127 ns          123 ns      5600000
BM_same/512                    870 ns          816 ns       746667
BM_same/1024                  1719 ns         1650 ns       407273
BM_same/8192                 10274 ns        10463 ns        74667
BM_same_miscco/8              20.0 ns         18.6 ns     34461538
BM_same_miscco/64              131 ns          129 ns      4977778
BM_same_miscco/512            1042 ns         1046 ns       746667
BM_same_miscco/1024           1990 ns         1995 ns       344615
BM_same_miscco/8192          17498 ns        17648 ns        40727
BM_same_adam/8                13.1 ns         12.8 ns     56000000
BM_same_adam/64               49.7 ns         49.2 ns     14933333
BM_same_adam/512               426 ns          424 ns      1659259
BM_same_adam/1024              861 ns          854 ns       896000
BM_same_adam/8192             7860 ns         8022 ns        89600
BM_reversed/8                  149 ns          148 ns      4977778
BM_reversed/64               10182 ns        10010 ns        64000
BM_reversed/512             592309 ns       585938 ns         1120
BM_reversed/1024           2345326 ns      2351589 ns          299
BM_reversed/8192         171859100 ns    171875000 ns            4
BM_reversed_miscco/8          21.1 ns         21.0 ns     32000000
BM_reversed_miscco/64          134 ns          132 ns      4977778
BM_reversed_miscco/512        1000 ns         1004 ns       746667
BM_reversed_miscco/1024       1998 ns         1950 ns       344615
BM_reversed_miscco/8192      17849 ns        17264 ns        40727
BM_reversed_adam/8            15.4 ns         15.7 ns     40727273
BM_reversed_adam/64           63.2 ns         64.2 ns     11200000
BM_reversed_adam/512           493 ns          496 ns      1544828
BM_reversed_adam/1024          984 ns          984 ns       746667
BM_reversed_adam/8192         9283 ns         9417 ns        89600
BM_shuffled/8                  190 ns          185 ns     13937778
BM_shuffled/64               11226 ns        11161 ns        56000
BM_shuffled/512             597549 ns       599888 ns         1120
BM_shuffled/1024           2398843 ns      2351589 ns          299
BM_shuffled/8192         173120725 ns    171875000 ns            4
BM_shuffled_miscco/8           155 ns          157 ns      4480000
BM_shuffled_miscco/64        10392 ns        10498 ns        64000
BM_shuffled_miscco/512      601231 ns       599888 ns         1120
BM_shuffled_miscco/1024    2422690 ns      2403846 ns          299
BM_shuffled_miscco/8192  182383850 ns    183593750 ns            4
BM_shuffled_adam/8             161 ns          160 ns      4480000
BM_shuffled_adam/64          10105 ns        10045 ns        74667
BM_shuffled_adam/512        591250 ns       599888 ns         1120
BM_shuffled_adam/1024      2377680 ns      2351589 ns          299
BM_shuffled_adam/8192    181883750 ns    183593750 ns            4

So essentially your algorithm obliterates mine in sorted case and is better in the pure reversal case. Otherwise it performs roughly equal

@CaseyCarter CaseyCarter changed the title Optimize is_permutation for reversed sequences Optimize is_permutation for reversed sequences Jul 14, 2021
@StephanTLavavej StephanTLavavej self-assigned this Aug 25, 2021
Copy link
Member

@StephanTLavavej StephanTLavavej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks good! (And a video review of this will be available in the near future.) Just a few issues, nothing major.

@StephanTLavavej StephanTLavavej removed their assignment Sep 24, 2021
@StephanTLavavej StephanTLavavej self-assigned this Sep 28, 2021
@StephanTLavavej StephanTLavavej removed their assignment Sep 28, 2021
Copy link
Contributor

@mnatsuhara mnatsuhara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome! Thanks for this great work :)

I've made a few comments, but they're for small comment nitpicks and one suggested restructuring (hence, no changes requested).

If you don't want to reset testing for these comments, perhaps @StephanTLavavej can add them to his list of cleanups for the next big cleanup PR! :)

@mnatsuhara mnatsuhara removed their assignment Nov 11, 2021
@StephanTLavavej
Copy link
Member

Thanks! Since this code is fresh in our minds, I'll validate and push changes instead of deferring cleanups until later. 🧠

StephanTLavavej and others added 4 commits November 11, 2021 16:54
Co-authored-by: Miya Natsuhara <46756417+mnatsuhara@users.noreply.github.com>
Co-authored-by: Miya Natsuhara <46756417+mnatsuhara@users.noreply.github.com>
@StephanTLavavej
Copy link
Member

@AdamBucior @mnatsuhara I've pushed changes, plus a merge with main since it had been a while.

@StephanTLavavej StephanTLavavej self-assigned this Nov 12, 2021
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 254fca2 into microsoft:main Nov 13, 2021
@StephanTLavavej
Copy link
Member

Thanks for implementing this major optimization in the STL's highest-complexity algorithm! 🚀 😻 🎉

@AdamBucior AdamBucior deleted the is_permutation-optimization branch November 13, 2021 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

<xutility>: is_permutation() could detect reversed sequences
4 participants