Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

cleanup + switch to GitHub Actions + add fst automata support + massive compile time improvements #6

Merged
merged 15 commits into from
Mar 9, 2020

Conversation

BurntSushi
Copy link
Owner

@BurntSushi BurntSushi commented Feb 22, 2020

(This PR can't be merged until a new release of fst is out.)

@BurntSushi BurntSushi force-pushed the ag/ci branch 6 times, most recently from 328772d to f852567 Compare February 22, 2020 03:05
@BurntSushi BurntSushi force-pushed the ag/ci branch 8 times, most recently from f4d3f38 to 6450622 Compare March 4, 2020 12:53
@BurntSushi BurntSushi changed the title cleanup + switch to GitHub Actions + add fst automata support cleanup + switch to GitHub Actions + add fst automata support + massive compile time improvements Mar 8, 2020
@BurntSushi BurntSushi force-pushed the ag/ci branch 2 times, most recently from 6a73cc0 to d58937c Compare March 8, 2020 23:18
It doesn't actually test anything, but it builds. Without std, there's
really no way to test this crate.
When the 'fst1' feature is enabled, this will provide impls for the
`fst::Automata` trait for every DFA type (dense and sparse).

Note that this is not intended to be released until the fst 0.4 (or 1.0)
release. Also, the 'fst1' feature is disabled by default since it is
quite niche. (Well, regex-automata is itself niche, but fst support is
probably even more niche!)
And also add a '\w' compilation benchmark.
We can now use regex-syntax's UTF-8 support directly.
This adds sub-commands to it and makes it easier to inspect and benchmark
automata construction.

We even add a 'debug-nfa' sub-command that uses an undocumented API to
build an NFA directly. Hopefully soon regex-automata will export a proper
NFA API.
This fixes a number of warnings, removes various allow(..) blocks
and other nominal things.
This simplifies the NFA quite a bit by implementing reversal in one
place: the implementation of concatenation. This way, we don't need to
waste time reversing the HIR and the extra allocations that comes with.

We also eliminate some intermediate allocations when compiling certain
opcodes, such as alternations.
A sparse state permits combining a bunch of alternations into one
state without a proliferation of 'alt' chains. Its use is somewhat
limited since it requires that all choices have equal priority, but
it does have one very specific use case: large UTF-8 automata.

We don't use the new sparse state in this commit, but a subsequent
commit will introduce a more efficient means of compiling UTF-8
automata which will utilize sparse states.
This improves the byte code for expressions like 'a{2,5}' in a way
that makes episilon closures a bit smaller. This makes traversing the
NFA faster, particularly if it's used directly for matching.
This uses Daciuk's algorithm to compile nearly minimal forward UTF-8
automata. For the reverse case, we use a data structure of my own
devising (a "range trie") to organize our Unicode character class such
that it is valid input to Daciuk's algorithm. While this is always
beneficial when constructing a DFA, in the reverse case, it does add
potentially significant overhead to building the NFA. Therefore, we
retain a simpler minimizer based on caching suffixes (like the one found
in the regex crate).

This also re-organizes the `nfa` module into a few different parts, adds
more docs and cleans up some things.
fst 0.4.0 depends on a newer version of Rust than 1.28, but since it's
an optional dependency, this is fine.
@BurntSushi BurntSushi merged commit 9340dcf into master Mar 9, 2020
@BurntSushi BurntSushi deleted the ag/ci branch March 9, 2020 00:14
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant