cleanup + switch to GitHub Actions + add fst automata support + massive compile time improvements #6

BurntSushi · 2020-02-22T00:02:34Z

(This PR can't be merged until a new release of fst is out.)

It doesn't actually test anything, but it builds. Without std, there's really no way to test this crate.

When the 'fst1' feature is enabled, this will provide impls for the `fst::Automata` trait for every DFA type (dense and sparse). Note that this is not intended to be released until the fst 0.4 (or 1.0) release. Also, the 'fst1' feature is disabled by default since it is quite niche. (Well, regex-automata is itself niche, but fst support is probably even more niche!)

And also add a '\w' compilation benchmark.

We can now use regex-syntax's UTF-8 support directly.

This adds sub-commands to it and makes it easier to inspect and benchmark automata construction. We even add a 'debug-nfa' sub-command that uses an undocumented API to build an NFA directly. Hopefully soon regex-automata will export a proper NFA API.

This fixes a number of warnings, removes various allow(..) blocks and other nominal things.

This simplifies the NFA quite a bit by implementing reversal in one place: the implementation of concatenation. This way, we don't need to waste time reversing the HIR and the extra allocations that comes with. We also eliminate some intermediate allocations when compiling certain opcodes, such as alternations.

A sparse state permits combining a bunch of alternations into one state without a proliferation of 'alt' chains. Its use is somewhat limited since it requires that all choices have equal priority, but it does have one very specific use case: large UTF-8 automata. We don't use the new sparse state in this commit, but a subsequent commit will introduce a more efficient means of compiling UTF-8 automata which will utilize sparse states.

This improves the byte code for expressions like 'a{2,5}' in a way that makes episilon closures a bit smaller. This makes traversing the NFA faster, particularly if it's used directly for matching.

This uses Daciuk's algorithm to compile nearly minimal forward UTF-8 automata. For the reverse case, we use a data structure of my own devising (a "range trie") to organize our Unicode character class such that it is valid input to Daciuk's algorithm. While this is always beneficial when constructing a DFA, in the reverse case, it does add potentially significant overhead to building the NFA. Therefore, we retain a simpler minimizer based on caching suffixes (like the one found in the regex crate). This also re-organizes the `nfa` module into a few different parts, adds more docs and cleans up some things.

fst 0.4.0 depends on a newer version of Rust than 1.28, but since it's an optional dependency, this is fine.

style: switch to rustfmt

2ca646d

BurntSushi force-pushed the ag/ci branch 6 times, most recently from 328772d to f852567 Compare February 22, 2020 03:05

BurntSushi mentioned this pull request Feb 22, 2020

lots of polishing, regular maintenance, preparing for 0.4 (or 1.0?) release BurntSushi/fst#96

Merged

9 tasks

BurntSushi force-pushed the ag/ci branch 8 times, most recently from f4d3f38 to 6450622 Compare March 4, 2020 12:53

BurntSushi force-pushed the ag/ci branch from 6450622 to c7a29c5 Compare March 8, 2020 16:00

BurntSushi changed the title ~~cleanup + switch to GitHub Actions + add fst automata support~~ cleanup + switch to GitHub Actions + add fst automata support + massive compile time improvements Mar 8, 2020

BurntSushi force-pushed the ag/ci branch 2 times, most recently from 6a73cc0 to d58937c Compare March 8, 2020 23:18

BurntSushi added 10 commits March 8, 2020 19:28

debug: update deps

32c7fc6

tests: make cargo test --lib --no-default-features work

1fba563

It doesn't actually test anything, but it builds. Without std, there's really no way to test this crate.

ci: switch to GitHub Actions

a91c061

bench: update to criterion 0.3

5c0880d

bench: add reverse compilation benchmark

9c02234

And also add a '\w' compilation benchmark.

deps: remove utf8-ranges dependency

900b583

We can now use regex-syntax's UTF-8 support directly.

polish: fix warnings and such

6b72bf0

This fixes a number of warnings, removes various allow(..) blocks and other nominal things.

BurntSushi added 4 commits March 8, 2020 19:28

nfa: improve repetition compilation

ef7f6c7

This improves the byte code for expressions like 'a{2,5}' in a way that makes episilon closures a bit smaller. This makes traversing the NFA faster, particularly if it's used directly for matching.

deps: upgrade to fst 0.4.0

aa2654c

fst 0.4.0 depends on a newer version of Rust than 1.28, but since it's an optional dependency, this is fine.

BurntSushi force-pushed the ag/ci branch from d58937c to aa2654c Compare March 8, 2020 23:28

BurntSushi merged commit 9340dcf into master Mar 9, 2020

BurntSushi deleted the ag/ci branch March 9, 2020 00:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cleanup + switch to GitHub Actions + add fst automata support + massive compile time improvements #6

cleanup + switch to GitHub Actions + add fst automata support + massive compile time improvements #6

BurntSushi commented Feb 22, 2020 •

edited

Loading

cleanup + switch to GitHub Actions + add fst automata support + massive compile time improvements #6

cleanup + switch to GitHub Actions + add fst automata support + massive compile time improvements #6

Conversation

BurntSushi commented Feb 22, 2020 • edited Loading

BurntSushi commented Feb 22, 2020 •

edited

Loading