Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider pinning your spaCy version in requirements.txt? #178

Open
honnibal opened this issue Nov 15, 2017 · 1 comment
Open

Consider pinning your spaCy version in requirements.txt? #178

honnibal opened this issue Nov 15, 2017 · 1 comment

Comments

@honnibal
Copy link

I just noticed that your requirements.txt doesn't pin to any particular version of spaCy or NLTK.

We've recently pushed spaCy 2, and while we've endeavoured to keep breaking changes to a minimum, it's a pretty big release: https://github.com/explosion/spaCy/releases/tag/v2.0.2

Even if the API doesn't change, there's the potential for problematic train/test skew for you if we make bug fixes to the tokenization, especially for languages other than English. Our compatibility policy is that changes that can affect statistical models can be made on minor releases --- e.g. spaCy 2.1.0 might fix some bug in the Hungarian tokenizer that affects a large number of tokens for that language. This means that sometimes, models trained with one minor version will suffer decreased accuracy if another version of the library is used at runtime.

There are also potential performance considerations. There's currently an open ticket about performance degradation of the tokenizer. It's unfortunate that this problem made it into the release, and we're working on it. But in the meantime, users who make a new installation of torch.text might find their preprocessing is much slower.

@jekbradbury
Copy link
Contributor

Our policy so far has been to treat SpaCy and NLTK as optional dependencies and use whatever version the user's already working with/already has installed. Choosing the "spacy" tokenizer option is a convenience function for manually creating a lambda that calls SpaCy's English tokenizer.
But that's not actually incompatible with providing a version in requirements.txt, since the optional dependencies there aren't installed or checked by pip install torchtext, so we'll go ahead and pin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants