Incremental search indexing #701

dstufft · 2015-09-26T17:05:49Z

Currently, the only support way to update the search index is via bulk indexing periodically. Ideally we'll have something that can subscribe to the changes made to the database and trigger a new update (or removal in the case of deletion) of a project in the Elasticsearch index.

robhudson · 2016-06-02T18:15:15Z

Does the system in use (or sqlalchemy) have anything like the Django signals? That is how you would typically set something like this up on a Django project.

If I had some guidance on the signals part of this bug I could help implement the rest.

dstufft · 2016-06-02T18:19:47Z

Yes! It's a tad bit more complicated than in Django, but you can see something similar already done at warehouse/cache/origin/init.py#L20-L49. SQLAlchemy only tells you what has changed during a flush (which is distinct from committing the transaction) so you need to have one hook to record what things need reindexed, and another thing to actually trigger that.

ALTHOUGH, now that I think about it, since this would be done in Celery (or should be anyways) you only need the flush hook, because our Celery tasks automatically get buffered in memory and get sent after the transaction is committed.

brainwane · 2018-02-17T00:18:05Z

(In case @robhudson doesn't know: during the Warehouse team's current grant-funded effort to improve Warehouse, we're prioritizing getting it to the point where we can redirect pypi.python.org to pypi.org so the site is more sustainable and reliable; more details at our development roadmap.)

Because I figure we'll want to get this sorted out before we flip the switch, I'm marking this for the launch milestone, but @ewdurbin can of course override me if I'm wrong.

brainwane · 2018-03-01T06:04:40Z

In our meeting this week we decided this belongs in a future milestone on our development roadmap and does not have to block the launch.

Anyone who wants to help work on this (such as Rob): For directions for getting set up, see our Getting Started Guide. If you are working on this issue and have questions, please feel free to ask them here, in #pypa-dev on Freenode, or on the pypa-dev mailing list. Thanks!

brainwane · 2018-03-29T14:32:37Z

@honzakral I really appreciated your help in #727 (comment) and asked my colleagues: what would be the highest-value ElasticSearch question in Warehouse we could ask you to advise or help with? I got the answer: incremental search (this issue). I'd be deeply appreciative if you could share your thoughts or even help us add this feature to Warehouse.

honzakral · 2018-03-29T14:55:06Z

I am always happy to help! In Django I just use this code (0) to make it happen. It's a sample project so a bit simplistic, but works.

For more production ready code you'd want to capture the DB IDs of new/updated (no distinction needed, both will result in an index operation) and removed models, send them to a celery task that will serialize and index into elasticsearch.

The question is how frequently does this happen - if a lot (> hundreds/second) you might want to then push the serialized documents back into some queue and have one process read from the queue and stream to elasticsearch via the bulk helpers (1). If not a lot you can push directly from the celery worker calling .save() on the DocType subclass.

0 - https://github.com/HonzaKral/es-django-example/blob/master/qa/models.py#L129-L146
1 - http://elasticsearch-py.readthedocs.io/en/master/helpers.html#bulk-helpers

robhudson · 2018-04-02T21:11:21Z

I'm interested in taking this on if nobody else has already started. I'll get the project set up and re-familiarize myself with the code base over the next few days and report back with any questions or comments.

brainwane · 2018-04-02T21:20:31Z

@robhudson Please feel free to go ahead and work on this! And as you probably know, we're available to talk in #pypa-dev on Freenode, or on the pypa-dev mailing list.

ewdurbin · 2018-04-16T10:53:53Z

first pass of this is complete in #3693!

ewdurbin · 2018-04-16T19:55:34Z

see #3700 and #3693 for ultimately an ultimately failed attempt as inspiration!

ewdurbin · 2018-07-21T15:23:20Z

Partial/Incremental index updates are now live again. We're monitoring for empty search results and will reopen if this attempt goes poorly.

dstufft mentioned this issue Sep 26, 2015

Implement Search #396

Closed

dstufft mentioned this issue Jun 9, 2016

Add search result ordering by download count #1182

Closed

nlhkabu added the requires triaging maintainers need to do initial inspection of issue label Jul 2, 2016

nlhkabu added the search Elasticsearch, search filters, and so on label May 13, 2017

di mentioned this issue Feb 8, 2018

[Bug] Can't find package on test.pypi.org #2899

Closed

brainwane added this to the 4: Launch: redirect pypi.python.org to pypi.org milestone Feb 17, 2018

brainwane modified the milestones: 4: Launch: redirect pypi.python.org to pypi.org, 6. Post Legacy Shutdown Mar 1, 2018

brainwane mentioned this issue Mar 29, 2018

add "search by new release" and "trending" sort orders, link to them from homepage #3447

Closed

brainwane added the feature request label Apr 2, 2018

ewdurbin mentioned this issue Apr 15, 2018

Partial search updates #3693

Merged

ewdurbin closed this as completed Apr 16, 2018

ewdurbin reopened this Apr 16, 2018

ewdurbin mentioned this issue May 14, 2018

pypi.org shows latest version, pip see's previous version #3962

Closed

ewdurbin mentioned this issue Jul 20, 2018

Incremental search index updates 2.0 #4328

Merged

ewdurbin closed this as completed Jul 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental search indexing #701

Incremental search indexing #701

dstufft commented Sep 26, 2015

robhudson commented Jun 2, 2016

dstufft commented Jun 2, 2016

brainwane commented Feb 17, 2018

brainwane commented Mar 1, 2018

brainwane commented Mar 29, 2018

honzakral commented Mar 29, 2018

robhudson commented Apr 2, 2018

brainwane commented Apr 2, 2018

ewdurbin commented Apr 16, 2018

ewdurbin commented Apr 16, 2018

ewdurbin commented Jul 21, 2018

Incremental search indexing #701

Incremental search indexing #701

Comments

dstufft commented Sep 26, 2015

robhudson commented Jun 2, 2016

dstufft commented Jun 2, 2016

brainwane commented Feb 17, 2018

brainwane commented Mar 1, 2018

brainwane commented Mar 29, 2018

honzakral commented Mar 29, 2018

robhudson commented Apr 2, 2018

brainwane commented Apr 2, 2018

ewdurbin commented Apr 16, 2018

ewdurbin commented Apr 16, 2018

ewdurbin commented Jul 21, 2018