-
Notifications
You must be signed in to change notification settings - Fork 996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental search indexing #701
Comments
Does the system in use (or sqlalchemy) have anything like the Django signals? That is how you would typically set something like this up on a Django project. If I had some guidance on the signals part of this bug I could help implement the rest. |
Yes! It's a tad bit more complicated than in Django, but you can see something similar already done at warehouse/cache/origin/init.py#L20-L49. SQLAlchemy only tells you what has changed during a flush (which is distinct from committing the transaction) so you need to have one hook to record what things need reindexed, and another thing to actually trigger that. ALTHOUGH, now that I think about it, since this would be done in Celery (or should be anyways) you only need the flush hook, because our Celery tasks automatically get buffered in memory and get sent after the transaction is committed. |
(In case @robhudson doesn't know: during the Warehouse team's current grant-funded effort to improve Warehouse, we're prioritizing getting it to the point where we can redirect pypi.python.org to pypi.org so the site is more sustainable and reliable; more details at our development roadmap.) Because I figure we'll want to get this sorted out before we flip the switch, I'm marking this for the launch milestone, but @ewdurbin can of course override me if I'm wrong. |
In our meeting this week we decided this belongs in a future milestone on our development roadmap and does not have to block the launch. Anyone who wants to help work on this (such as Rob): For directions for getting set up, see our Getting Started Guide. If you are working on this issue and have questions, please feel free to ask them here, in |
@honzakral I really appreciated your help in #727 (comment) and asked my colleagues: what would be the highest-value ElasticSearch question in Warehouse we could ask you to advise or help with? I got the answer: incremental search (this issue). I'd be deeply appreciative if you could share your thoughts or even help us add this feature to Warehouse. |
I am always happy to help! In Django I just use this code (0) to make it happen. It's a sample project so a bit simplistic, but works. For more production ready code you'd want to capture the DB IDs of new/updated (no distinction needed, both will result in an The question is how frequently does this happen - if a lot (> hundreds/second) you might want to then push the serialized documents back into some queue and have one process read from the queue and stream to elasticsearch via the 0 - https://github.com/HonzaKral/es-django-example/blob/master/qa/models.py#L129-L146 |
I'm interested in taking this on if nobody else has already started. I'll get the project set up and re-familiarize myself with the code base over the next few days and report back with any questions or comments. |
@robhudson Please feel free to go ahead and work on this! And as you probably know, we're available to talk in |
first pass of this is complete in #3693! |
Partial/Incremental index updates are now live again. We're monitoring for empty search results and will reopen if this attempt goes poorly. |
Currently, the only support way to update the search index is via bulk indexing periodically. Ideally we'll have something that can subscribe to the changes made to the database and trigger a new update (or removal in the case of deletion) of a project in the Elasticsearch index.
The text was updated successfully, but these errors were encountered: