This repository has been archived by the owner on Jul 7, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 26
cleanup + switch to GitHub Actions + add fst automata support + massive compile time improvements #6
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
328772d
to
f852567
Compare
Merged
9 tasks
f4d3f38
to
6450622
Compare
6a73cc0
to
d58937c
Compare
It doesn't actually test anything, but it builds. Without std, there's really no way to test this crate.
When the 'fst1' feature is enabled, this will provide impls for the `fst::Automata` trait for every DFA type (dense and sparse). Note that this is not intended to be released until the fst 0.4 (or 1.0) release. Also, the 'fst1' feature is disabled by default since it is quite niche. (Well, regex-automata is itself niche, but fst support is probably even more niche!)
And also add a '\w' compilation benchmark.
We can now use regex-syntax's UTF-8 support directly.
This adds sub-commands to it and makes it easier to inspect and benchmark automata construction. We even add a 'debug-nfa' sub-command that uses an undocumented API to build an NFA directly. Hopefully soon regex-automata will export a proper NFA API.
This fixes a number of warnings, removes various allow(..) blocks and other nominal things.
This simplifies the NFA quite a bit by implementing reversal in one place: the implementation of concatenation. This way, we don't need to waste time reversing the HIR and the extra allocations that comes with. We also eliminate some intermediate allocations when compiling certain opcodes, such as alternations.
A sparse state permits combining a bunch of alternations into one state without a proliferation of 'alt' chains. Its use is somewhat limited since it requires that all choices have equal priority, but it does have one very specific use case: large UTF-8 automata. We don't use the new sparse state in this commit, but a subsequent commit will introduce a more efficient means of compiling UTF-8 automata which will utilize sparse states.
This improves the byte code for expressions like 'a{2,5}' in a way that makes episilon closures a bit smaller. This makes traversing the NFA faster, particularly if it's used directly for matching.
This uses Daciuk's algorithm to compile nearly minimal forward UTF-8 automata. For the reverse case, we use a data structure of my own devising (a "range trie") to organize our Unicode character class such that it is valid input to Daciuk's algorithm. While this is always beneficial when constructing a DFA, in the reverse case, it does add potentially significant overhead to building the NFA. Therefore, we retain a simpler minimizer based on caching suffixes (like the one found in the regex crate). This also re-organizes the `nfa` module into a few different parts, adds more docs and cleans up some things.
fst 0.4.0 depends on a newer version of Rust than 1.28, but since it's an optional dependency, this is fine.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
(This PR can't be merged until a new release of
fst
is out.)