[FEATURE] EPR wavelettree #75

rrahn · 2020-07-15T16:03:21Z

Finalises and improves the original PR for the EPR dictionary.
Supersedes #70
resolves seqan/product_backlog#21

This is a more efficient re-implementation of the rank_support_int_v which is roughly 20% faster than SeqAn2

Run on (8 X 2600 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 262K (x4)
  L3 Unified 6291K (x1)
Load Average: 2.72, 2.72, 2.78
---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
rank<seqan::Dna>/100                    36.9 ns         36.7 ns     18915954
rank<seqan::Dna>/10000                  39.2 ns         38.8 ns     18693086
rank<seqan::Dna>/1000000                38.2 ns         38.0 ns     18565079
rank<seqan::Dna>/100000000              96.3 ns         96.0 ns      7416747
rank<seqan::Dna>/1000000000              118 ns          117 ns      5913111
rank<seqan::AminoAcid>/100              33.6 ns         33.5 ns     20678980
rank<seqan::AminoAcid>/10000            33.3 ns         33.2 ns     20833581
rank<seqan::AminoAcid>/1000000          80.8 ns         80.5 ns      8452371
rank<seqan::AminoAcid>/100000000         171 ns          170 ns      4500421
rank<seqan::AminoAcid>/1000000000        224 ns          222 ns      3202225

And the new epr rank (using interleaved superblocks):

Run on (8 X 2600 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 262K (x4)
  L3 Unified 6291K (x1)
Load Average: 3.51, 2.90, 2.72
--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
rank<4>/100               28.5 ns         28.3 ns     24970214
rank<4>/10000             28.2 ns         28.1 ns     24002853
rank<4>/1000000           28.8 ns         28.7 ns     24594369
rank<4>/100000000         73.5 ns         73.2 ns      9967818
rank<4>/1000000000        87.3 ns         87.1 ns      8899059
rank<27>/100              26.0 ns         25.9 ns     27164528
rank<27>/10000            27.1 ns         26.9 ns     26758205
rank<27>/1000000          60.9 ns         60.7 ns     11156445
rank<27>/100000000         113 ns          113 ns      6447750
rank<27>/1000000000        183 ns          181 ns      3406459

Old SDSL epr branch (using aligned int_vector with wasting bits):

--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
rank<4>/100               28.2 ns         28.0 ns     22793433
rank<4>/10000             27.8 ns         27.7 ns     24657611
rank<4>/1000000           30.9 ns         30.3 ns     22921885
rank<4>/100000000          110 ns          109 ns      6638280
rank<4>/1000000000         142 ns          141 ns      4901103
rank<27>/100              26.3 ns         26.1 ns     25351021
rank<27>/10000            27.3 ns         27.1 ns     26736436
rank<27>/1000000          59.7 ns         59.4 ns      9280866
rank<27>/100000000         178 ns          177 ns      4100545
rank<27>/1000000000        234 ns          230 ns      2543253

h-2 · 2020-07-21T15:23:42Z

Thank you for working on this first of all!

These are my small benchmark results. It should be noted that none of these use the official search interface and instead include code that avoids allocations inside the search (see seqan/seqan3#1991 ). Otherwise, additional slowdown occurs.

Also, this uses SeqAn3 from around February since later changes broke the interfaces I added to avoid said allocations.

12 Threads

	lambda2	lambda3 WT	lambda3 EPR Seiler	lambda3 EPR Rahn
Wall-clock time of search+extend	22s	81s	43s	32s
CPU-time of search-step	170s	789s	389s	249s
Index size (on-disk)	397 MB	361 MB	711 MB	736 MB
Max. RSS	1018 MB	608 MB	998 MB	1052 MB

Reading of database and pre-processing is not included in the wall clock time [deserialisation is a lot slower than SeqAn2's btw, around 5x, but this is somewhat expected].

CPU-time is counted within the program so not 100% accurate, but it's a good indication and reflects total wall clock times.

6 Threads

	lambda2	lambda3 WT	lambda3 EPR Seiler	lambda3 EPR Rahn
Wall-clock time of search+extend	30s	114s	58s	45s
CPU-time of search-step	126s	595s	277s	193s
Index size (on-disk)	397 MB	361 MB	711 MB	736 MB
Max. RSS	900 MB	568 MB	932 MB	985 MB

Conclusion

The new branch significantly improves performance for Lambda. But it is still notably slower than lambda2/SeqAn2. I know that lazy-loading and online-processing of sequences in lambda3 adds about 10% run-time, but currently we are still at +50% run-time. All other steps that happen as part of the wall-clock time (alignment, output) are using the same SeqAn2-based code in all versions of Lambda.
So even if we allot an additional +10% to unknown sources, it seems that search itself incurs +30% runtime. I have done no profiling, so I don't want to guess where this comes from, but it could very well be something outside the rank computation which you obviously improved quite a bit!

In summary: This PR is a significant improvement and I wholeheartedly support merging it in the current form. But the search as a whole still hasn't caught up to SeqAn2 and more research in the area needs to be done.

P.S: The serialised index of the new EPR is much bigger than in SeqAn2, although this is not reflected in memory usage. Any clue why?

include/sdsl/rank_support_int_v.hpp

include/sdsl/rank_support_int.hpp

include/sdsl/rank_support_int_v.hpp

eseiler · 2020-10-28T09:28:35Z

include/sdsl/rank_support_int_v.hpp

+    //!\brief Does nothing for the rank_support_int structure.
+    void set_vector(const int_vector<>* /*other_text*/ ){} // TODO: Check where this interface is needed, since it is dangerous?
+    // I would be able to reset the text without recomputing the rank support structure which is in general a
+    // bad design.


You don't need it since you store the text somewhere else. (Except some other class expects the function to be available)
Usually you need this interface for serialization since you have to set the supported vector after deserializing the rank support vector.

include/sdsl/rank_support_int_v.hpp

include/sdsl/rank_support_int.hpp

test/wt_byte_epr_test.cpp

eseiler · 2021-11-11T15:37:23Z

From what I've seen, we cannot use the wt_byte_test directly as we do not return the original ranks.
E.g. T=mississippi (example01.txt).
Both EPR and other WTs have sigma=5, but the EPR return ranks 0-4, while the other WTs return the original char rank.

rrahn · 2021-11-12T07:40:33Z

@eseiler many thanks for all your efforts!!! That totally 🤘
Please, let me know if there is anything you'd like to discuss regarding the design.

…into 64-bit words

…ntation of epr rank structure.

both text and sdsl::memory_manager might deallocate memory on program exit

eseiler · 2021-11-15T13:25:56Z

@rrahn @h-2

I'll try to have a closer look once again this week (last week I just focused on getting CI to work).
If I don't discover major things, would there be any objections to merging this PR?

rrahn · 2021-11-15T18:25:21Z

I think it should be fine for now, since it can be tested thoroughly before the next release.
Seems that there are still some improvements regarding space used for the serialised index as well as performance with respect to SeqAn2.

h-2 · 2021-11-16T12:07:24Z

Thank you both for your work on this!

rrahn mentioned this pull request Jul 15, 2020

Implement the EPR dictionary seqan/product_backlog#21

Closed

14 tasks

rrahn force-pushed the feature/fast_epr_wavelettree branch from 6b73bf9 to 4784365 Compare July 15, 2020 16:44

eseiler self-requested a review September 10, 2020 11:32

eseiler requested changes Oct 30, 2020

View reviewed changes

eseiler force-pushed the feature/fast_epr_wavelettree branch 2 times, most recently from da84588 to 9af1d0c Compare November 11, 2021 14:57

eseiler force-pushed the feature/fast_epr_wavelettree branch 3 times, most recently from 66ea712 to 6dafa61 Compare November 11, 2021 18:32

eseiler mentioned this pull request Nov 12, 2021

Wavelet trees based on EPR dictionaries #70

Closed

cpockrandt and others added 17 commits November 12, 2021 20:45

[FEATURE] EPR-dictionaries for alphabets where strings fit perfectly …

a496b6e

…into 64-bit words

[NOP] Convert files to use spaces instead of tabs.

3cb081b

[FEATURE] Add bit compressed word.

08bf70f

[MISC] Cleanup implementation of fast rank support int vector.

f174904

[MISC] Refactor rank_support_int.

b99c5da

[FIX] Fix int vector const iterator copy assignment.

052da2a

[FEATURE] Add serialisation support.

0df98f4

[MISC] Make masks initialisation c++14 conform.

2cfa70f

[FEATURE] Make EPR wavelet tree compatible with new efficient impleme…

e4c54a6

…ntation of epr rank structure.

[MISC] Some cleanup fixes.

6c6c8b4

[DOC] Update documentation of the serialisation methods.

f5d8fb7

[FIX] Make the absolute block position function safe.

e380924

[FEATURE] Use smallest viable integral type to store the block ranks.

594ee8c

[MISC] run clang-format

cab51d5

[FIX] serialise via child node

9ceba19

[FIX] missing serialised member

e5a3cc7

[MISC] add equality operators

4c2faac

eseiler added 5 commits November 12, 2021 21:06

[MISC] make wt_epr iterable

be09c3b

[FIX] signed/unsigned comparison error

489b051

[FIX] handle rank on empty text

f89f2a0

[FIX] TEST_CASE -> TEST_SUITE

866ab05

[FIX] double stack dealloc

a47c364

both text and sdsl::memory_manager might deallocate memory on program exit

eseiler force-pushed the feature/fast_epr_wavelettree branch from 6dafa61 to a47c364 Compare November 12, 2021 20:07

eseiler changed the title ~~Feature/fast epr wavelettree~~ [FEATURE] EPR wavelettree Nov 16, 2021

eseiler approved these changes Nov 16, 2021

View reviewed changes

eseiler merged commit 317f63c into xxsds:master Nov 16, 2021

eseiler mentioned this pull request Oct 20, 2022

EPR Dictionaries seqan/product_backlog#18

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] EPR wavelettree #75

[FEATURE] EPR wavelettree #75

rrahn commented Jul 15, 2020 •

edited

Loading

h-2 commented Jul 21, 2020

eseiler Oct 28, 2020

eseiler commented Nov 11, 2021

rrahn commented Nov 12, 2021

eseiler commented Nov 15, 2021

rrahn commented Nov 15, 2021

h-2 commented Nov 16, 2021

[FEATURE] EPR wavelettree #75

[FEATURE] EPR wavelettree #75

Conversation

rrahn commented Jul 15, 2020 • edited Loading

h-2 commented Jul 21, 2020

12 Threads

6 Threads

Conclusion

eseiler Oct 28, 2020

Choose a reason for hiding this comment

eseiler commented Nov 11, 2021

rrahn commented Nov 12, 2021

eseiler commented Nov 15, 2021

rrahn commented Nov 15, 2021

h-2 commented Nov 16, 2021

rrahn commented Jul 15, 2020 •

edited

Loading