-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite for spaCy v3 #173
Closed
Closed
Rewrite for spaCy v3 #173
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Various small fixes
fix align indices in split_by_doc
This has diverged too much from the version on master, as this targets spaCy v3. It doesn't really make sense to do a merge here. Instead I've labelled this "develop". When we're ready to release we'll name the stuff on master something like |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The current
spacy-transformers
had to go to quite some effort to work around limitations in spaCy v2 and Thinc v7. It also had to take on a lot of tasks that the Transformers library now handles itself. Limitations in Thinc (and the fact that the library was undocumented!) were particularly painful, because transformers really aren't too useful unless you can get in and fiddle with the model architectures.With the forthcoming spaCy v3, Thinc v8, and Huggingface's constant awesome improvements, things are now much nicer, so we can make this library much much smaller.
I also want to make a slightly different trade-off in the library. Previously we tried to do a lot and offer a lot in the extension attributes. This made it hard to keep up with all of the different transformer models as they're released. It also sometimes meant that the wrapper could get in the way of the underlying transformers models.
The new trade-off is to simply do less, at least in terms of the alignments and extension attributes. We now offer just one extension attribute,
doc._.trf_data
, which provides aspacy_transformers.types.TransformerData
object, a simple dataclass that holds the tensor outputs for the doc, the tokens data, and alignment information.If you want more extension attributes, it's easy to design and set them yourself, by providing a custom
annotation_setter
function. Your function will receive a batch of documents and aFullTransformerBatch
object that holds the input and output objects passed from the transformers library -- so you know you'll be able to implement whatever you need.The previous version also went to some effort to rebatch data by sentence, to allow prediction on long documents. I still believe in this idea, but hard-coding for it could easily get in the way. Instead, the transformers now let you provide a function to map a batch of documents into a batch of
Span
objects. You can even have spans that overlap, or which only cover subsets of theDoc
objects. Thedoc._.trf_data
object will tell you which spans the transformers data refers to, making it easy to use the output.The workflow for training models with
spacy-transformers
is also dramatically better, using the improvements from spaCy v3 and Thinc v8. The main workflow is to write a config file, using Thinc's new config system.You can find two early example config files here:
Transformer
pipeline component: https://github.com/explosion/spacy-transformers/blob/feature/spacy-v3/examples/listen/joint-dep-pos-distilbert.cfgYou run the config files with the examples/train_from_config.py script (in future you'll actually use
spacy train-from-config
).I'm not that satisfied with the names of everything yet, but all the pieces are in place and it works (although I still need to tune the hyper-parameters to get better models).
The
Transformer
pipeline component lets you run the transformer once to set thedoc._.trf_data
extension, and also have downstream components use the transformer features and pass gradients back to the transformer, allowing easy multitask learning. I'm hoping we can have a pipeline where we run one transformer model shared between a whole pipeline, including tagging, parsing, NER, morphology, coref and SRL.TODO