Remodel SQLite Schema #5477

efritz · 2019-09-04T22:54:26Z

The Problem

The old SQLite database would store everything in a document blob, which is a gzipped and json-encoded set of the following data:

ranges in the document
monikers attached to ranges
package information attached to monikers
definition results shared by multiple ranges in the document
reference results shared by multiple ranges in the document
hover results shared by multiple ranges in the document

This allows us to have a simple SQL query that would return the blob of data that we needed to answer any query that doesn't involve in looking at two files.

However, after removing the result sets at import time in order to stop the server from having to trace graph edges at query time (which is undesirable -- why do it on every query when you can only do it once) the size of the definition and reference results became apparent. Note that this didn't cause a problem itself, but it did reveal the extent of the problems of one that already existed.

This created much larger document blobs, which will become a problem at some point. In order to answer queries quickly about a range in a large document, it may be necessary to pull multiple megabytes of unrelated information from a SQLite file.

The Solution

Re-model the database so that documents no longer track their own definition and reference results (but they do retain their ranges, monikers, package information, and hovers). Profiling has shown that the OVERWHELMING proportion of the data is in these two fields.

We now put definition and reference results in another table. However, experiments over the Labor day holidy showed that data at this scale will be infeasible to store per-row (the overhead for tuples is too high at insertion, and too large on disk). We need to do some similar gzipped and json-encoded trickery.

So far, we can't: store it all in one giant blog (it would be much larger than a document), store it along with a document or as a sibling of a document (it would not be easy to share the same definition or reference results between documents), or store it in individual rows (due to the required throughput of the converter and the rarity of rare earth materials required to produce enough disk space).

What we can do is shard definition and reference results over several rows, with a size that scales dynamically with the size of the input dump. Then, any identifier for a definition or reference result from a document will be able to determine (with the same hash function and the total number of chunks) the id of the result chunk. This requires loading a second blob for definition and reference results, but these can be cached in memory in the same manner as document blobs are cached in memory. See code for details!

Results

Uploading is now 2-3x faster 🎉 for select benchmarks from this document.

… finalize behavior.

codecov · 2019-09-04T22:55:54Z

Codecov Report

Merging #5477 into lsif-clean-sqlite will increase coverage by 0.07%.
The diff coverage is 96.62%.

@@                  Coverage Diff                  @@
##           lsif-clean-sqlite    #5477      +/-   ##
=====================================================
+ Coverage              47.39%   47.47%   +0.07%     
=====================================================
  Files                    745      747       +2     
  Lines                  45876    45919      +43     
  Branches                2711     2704       -7     
=====================================================
+ Hits                   21742    21799      +57     
+ Misses                 22112    22093      -19     
- Partials                2022     2027       +5

Impacted Files	Coverage Δ
lsif/src/cache.ts	`98.24% <100%> (+0.98%)`	⬆️
lsif/src/xrepo.ts	`100% <100%> (ø)`	⬆️
lsif/src/default-map.ts	`100% <100%> (ø)`
lsif/src/backend.ts	`76.74% <100%> (-0.53%)`	⬇️
lsif/src/inserter.ts	`92.3% <100%> (ø)`	⬆️
lsif/src/database.ts	`84.57% <88.6%> (-1.46%)`	⬇️
lsif/src/util.ts	`90.9% <88.88%> (-9.1%)`	⬇️
lsif/src/importer.ts	`98.67% <98.61%> (+0.61%)`	⬆️
lsif/src/correlator.ts	`99.33% <99.33%> (ø)`
cmd/gitserver/server/serverutil.go	`51.21% <0%> (-1.7%)`	⬇️
... and 24 more

efritz · 2019-09-09T16:19:08Z

@chrismwendt @lguychard I would actually like to merge this into the other sqlite branch so that we don't pollute master with two large commits (one that kind of undoes the other), and will also give @felixfbecker a chance to do another pass of https://github.com/sourcegraph/sourcegraph/pull/5332 without having to do a weird context switch (I assume it would be easier since he's been living in the other set of diffs).

efritz added 23 commits September 3, 2019 12:55

Collapse resultSet data into range objects.

63771e8

Fix name conflict.

e8d5ddf

Simplify database.

bea5572

Rearrange definitions.

cca1e12

Fix undefined checks.

e982c5e

Remove flattenRange.

4cb3acd

Add DefaultMap to clean up some nasty nested lookups.

f6a6708

Remove console.time calls.

d33f7f2

Add DefaultMap test.

2f2823f

Split models into two different files.

845ee6a

Remove unnecessary entity classes.

7dea0dd

Move document proocessing to the end of data collection.

578766e

Split LsfiImporter into LsifCorrelator and functions that perform the…

6012023

… finalize behavior.

Reorganize functions of importLsif.

6e730ee

Reorganize files - split correlator from importer.

2a6f9e8

Add util test.

8e9023d

Fix unstable jsonify deduplication.

cd56934

Additional refactoring of import process.

6e41bea

Canonicalize range and result set items efficiently.

2788688

Small doc /style improvement.

2b8b41a

Committing a bunch of junk because it works.

92e6ff0

Refactor and redocument.

6a4051a

Merge with lsif-clean-sqlite.

96550d8

efritz requested a review from chrismwendt September 4, 2019 22:55

efritz added the lsif label Sep 4, 2019

efritz added 2 commits September 4, 2019 19:35

Add cache for result chunks.

ceaefe7

Update cache; respect document and result chunk sizes.

7e6f141

felixfbecker added the team/graph Graph Team (previously Code Intel/Language Tools/Language Platform) label Sep 5, 2019

Make the number of chunks dynamic with the scale of the dump.

1f928a0

efritz added 16 commits September 9, 2019 08:21

Rename fields in db models.

34d96c4

Rename assertDefined.

05b17e4

Rename database functions.

25dc7c2

Fix "range result" comment.

ea96227

Split long function.

a26ee3d

Update ticks around comment.

a4d4fd6

Update stale comment.

f27d003

Add document for data model.

d8e0e75

Fix bad json in data_model.md.

ae56f79

Fix temp directory generation in tests.

0bb09dd

Remove unused build script.

76d1ec3

Remove unnecessar type annotations.

ddead4a

Run prettier over md files.

b802a0b

Update confusing comment.

85c91e4

Make package versions nullable.

6118ee1

Remove unnecessary lambda indirection and async keyword.

93bd8d6

efritz added 6 commits September 9, 2019 11:30

Re-prettify md file.

9f9982b

Convert moniker list to a set in range entity.

35668fa

Make a test-utils file.

79b9829

Fix special-character in createLocation helper function.

f72e1b9

Reorganize entry.readers increment to be more explicit.

9d5e007

Remove special case rcache entry readers === -1 .

19e8311

efritz requested a review from beyang as a code owner September 9, 2019 19:45

efritz added 2 commits September 9, 2019 14:53

Do not require cache disposer to be async.

20bf1e5

Split pgetPackages/getReferences as it is in lsif-clean-sqlite branch.

a5232c1

efritz force-pushed the lsif-sqlite-simplify-db branch from 59beed5 to a5232c1 Compare September 9, 2019 19:53

efritz merged commit c97c9e6 into lsif-clean-sqlite Sep 9, 2019

efritz deleted the lsif-sqlite-simplify-db branch September 9, 2019 19:57

aidaeology mentioned this pull request Aug 17, 2020

Code Intelligence 3.20 Tracking issue #13063

Closed

41 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remodel SQLite Schema #5477

Remodel SQLite Schema #5477

efritz commented Sep 4, 2019 •

edited by chrismwendt

Loading

codecov bot commented Sep 4, 2019 •

edited

Loading

efritz commented Sep 9, 2019

Remodel SQLite Schema #5477

Remodel SQLite Schema #5477

Conversation

efritz commented Sep 4, 2019 • edited by chrismwendt Loading

The Problem

The Solution

Results

codecov bot commented Sep 4, 2019 • edited Loading

Codecov Report

efritz commented Sep 9, 2019

efritz commented Sep 4, 2019 •

edited by chrismwendt

Loading

codecov bot commented Sep 4, 2019 •

edited

Loading