This repository has been archived by the owner on Sep 30, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… finalize behavior.
Codecov Report
@@ Coverage Diff @@
## lsif-clean-sqlite #5477 +/- ##
=====================================================
+ Coverage 47.39% 47.47% +0.07%
=====================================================
Files 745 747 +2
Lines 45876 45919 +43
Branches 2711 2704 -7
=====================================================
+ Hits 21742 21799 +57
+ Misses 22112 22093 -19
- Partials 2022 2027 +5
|
@chrismwendt @lguychard I would actually like to merge this into the other sqlite branch so that we don't pollute master with two large commits (one that kind of undoes the other), and will also give @felixfbecker a chance to do another pass of https://github.com/sourcegraph/sourcegraph/pull/5332 without having to do a weird context switch (I assume it would be easier since he's been living in the other set of diffs). |
59beed5
to
a5232c1
Compare
41 tasks
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The Problem
The old SQLite database would store everything in a document blob, which is a gzipped and json-encoded set of the following data:
This allows us to have a simple SQL query that would return the blob of data that we needed to answer any query that doesn't involve in looking at two files.
However, after removing the result sets at import time in order to stop the server from having to trace graph edges at query time (which is undesirable -- why do it on every query when you can only do it once) the size of the definition and reference results became apparent. Note that this didn't cause a problem itself, but it did reveal the extent of the problems of one that already existed.
This created much larger document blobs, which will become a problem at some point. In order to answer queries quickly about a range in a large document, it may be necessary to pull multiple megabytes of unrelated information from a SQLite file.
The Solution
Re-model the database so that documents no longer track their own definition and reference results (but they do retain their ranges, monikers, package information, and hovers). Profiling has shown that the OVERWHELMING proportion of the data is in these two fields.
We now put definition and reference results in another table. However, experiments over the Labor day holidy showed that data at this scale will be infeasible to store per-row (the overhead for tuples is too high at insertion, and too large on disk). We need to do some similar gzipped and json-encoded trickery.
So far, we can't: store it all in one giant blog (it would be much larger than a document), store it along with a document or as a sibling of a document (it would not be easy to share the same definition or reference results between documents), or store it in individual rows (due to the required throughput of the converter and the rarity of rare earth materials required to produce enough disk space).
What we can do is shard definition and reference results over several rows, with a size that scales dynamically with the size of the input dump. Then, any identifier for a definition or reference result from a document will be able to determine (with the same hash function and the total number of chunks) the id of the result chunk. This requires loading a second blob for definition and reference results, but these can be cached in memory in the same manner as document blobs are cached in memory. See code for details!
Results
Uploading is now 2-3x faster 🎉 for select benchmarks from this document.