Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental search indexing #701

Closed
dstufft opened this issue Sep 26, 2015 · 11 comments
Closed

Incremental search indexing #701

dstufft opened this issue Sep 26, 2015 · 11 comments
Labels
feature request requires triaging maintainers need to do initial inspection of issue search Elasticsearch, search filters, and so on

Comments

@dstufft
Copy link
Member

dstufft commented Sep 26, 2015

Currently, the only support way to update the search index is via bulk indexing periodically. Ideally we'll have something that can subscribe to the changes made to the database and trigger a new update (or removal in the case of deletion) of a project in the Elasticsearch index.

@robhudson
Copy link
Contributor

Does the system in use (or sqlalchemy) have anything like the Django signals? That is how you would typically set something like this up on a Django project.

If I had some guidance on the signals part of this bug I could help implement the rest.

@dstufft
Copy link
Member Author

dstufft commented Jun 2, 2016

Yes! It's a tad bit more complicated than in Django, but you can see something similar already done at warehouse/cache/origin/init.py#L20-L49. SQLAlchemy only tells you what has changed during a flush (which is distinct from committing the transaction) so you need to have one hook to record what things need reindexed, and another thing to actually trigger that.

ALTHOUGH, now that I think about it, since this would be done in Celery (or should be anyways) you only need the flush hook, because our Celery tasks automatically get buffered in memory and get sent after the transaction is committed.

@nlhkabu nlhkabu added the requires triaging maintainers need to do initial inspection of issue label Jul 2, 2016
@nlhkabu nlhkabu added the search Elasticsearch, search filters, and so on label May 13, 2017
@brainwane
Copy link
Contributor

(In case @robhudson doesn't know: during the Warehouse team's current grant-funded effort to improve Warehouse, we're prioritizing getting it to the point where we can redirect pypi.python.org to pypi.org so the site is more sustainable and reliable; more details at our development roadmap.)

Because I figure we'll want to get this sorted out before we flip the switch, I'm marking this for the launch milestone, but @ewdurbin can of course override me if I'm wrong.

@brainwane brainwane modified the milestones: 4: Launch: redirect pypi.python.org to pypi.org, 6. Post Legacy Shutdown Mar 1, 2018
@brainwane
Copy link
Contributor

In our meeting this week we decided this belongs in a future milestone on our development roadmap and does not have to block the launch.

Anyone who wants to help work on this (such as Rob): For directions for getting set up, see our Getting Started Guide. If you are working on this issue and have questions, please feel free to ask them here, in #pypa-dev on Freenode, or on the pypa-dev mailing list. Thanks!

@brainwane
Copy link
Contributor

@honzakral I really appreciated your help in #727 (comment) and asked my colleagues: what would be the highest-value ElasticSearch question in Warehouse we could ask you to advise or help with? I got the answer: incremental search (this issue). I'd be deeply appreciative if you could share your thoughts or even help us add this feature to Warehouse.

@honzakral
Copy link
Contributor

I am always happy to help! In Django I just use this code (0) to make it happen. It's a sample project so a bit simplistic, but works.

For more production ready code you'd want to capture the DB IDs of new/updated (no distinction needed, both will result in an index operation) and removed models, send them to a celery task that will serialize and index into elasticsearch.

The question is how frequently does this happen - if a lot (> hundreds/second) you might want to then push the serialized documents back into some queue and have one process read from the queue and stream to elasticsearch via the bulk helpers (1). If not a lot you can push directly from the celery worker calling .save() on the DocType subclass.

0 - https://github.com/HonzaKral/es-django-example/blob/master/qa/models.py#L129-L146
1 - http://elasticsearch-py.readthedocs.io/en/master/helpers.html#bulk-helpers

@robhudson
Copy link
Contributor

I'm interested in taking this on if nobody else has already started. I'll get the project set up and re-familiarize myself with the code base over the next few days and report back with any questions or comments.

@brainwane
Copy link
Contributor

@robhudson Please feel free to go ahead and work on this! And as you probably know, we're available to talk in #pypa-dev on Freenode, or on the pypa-dev mailing list.

@ewdurbin
Copy link
Member

first pass of this is complete in #3693!

@ewdurbin
Copy link
Member

see #3700 and #3693 for ultimately an ultimately failed attempt as inspiration!

@ewdurbin
Copy link
Member

Partial/Incremental index updates are now live again. We're monitoring for empty search results and will reopen if this attempt goes poorly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request requires triaging maintainers need to do initial inspection of issue search Elasticsearch, search filters, and so on
Projects
None yet
Development

No branches or pull requests

6 participants