Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document state storage #50298

Closed
wants to merge 5 commits into from
Closed

Document state storage #50298

wants to merge 5 commits into from

Conversation

tmat
Copy link
Member

@tmat tmat commented Jan 7, 2021

  • Refactors storage of document ids and states in ProjectState to a dedicated data structure: TextDocumentStates<TState>.
  • Remove sorting of the states by document id that was added to make serialization deterministic. Instead, use the order the documents were added to the project.
  • Use TextDocumentStates<TState> for source generated documents as well.
  • Improve performance of Project.Documents and Project.DocumentIds: up to 3.3x and 3.2x faster, respectively. Fixes Experiment with targetted perf improvements around Project.Documents. #48682.

Before

Method DocumentCount Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Project.DocumentIds 0 24.42 ns 0.194 ns 0.181 ns - - - -
Project.Documents 0 78.83 ns 0.503 ns 0.446 ns 0.0772 - - 128 B
Project.DocumentIds 100 8,172.07 ns 8.030 ns 7.118 ns 0.0305 - - 72 B
Project.Documents 100 24,651.12 ns 98.105 ns 91.767 ns 0.0916 - - 201 B
Project.DocumentIds 10000 859,548.56 ns 399.791 ns 373.965 ns - - - 80 B
Project.Documents 10000 4,203,826.43 ns 19,733.173 ns 18,458.422 ns - - - 256 B

After

Method DocumentCount Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Project.DocumentIds 0 84.48 ns 0.302 ns 0.283 ns 0.0434 - - 72 B
Project.Documents 0 151.68 ns 1.314 ns 1.230 ns 0.1206 - - 201 B
Project.DocumentIds 100 2,755.75 ns 1.143 ns 1.013 ns 0.0420 - - 72 B
Project.Documents 100 11,579.30 ns 56.896 ns 53.221 ns 0.1068 - - 201 B
Project.DocumentIds 10000 265,656.30 ns 143.748 ns 134.462 ns - - - 76 B
Project.Documents 10000 1,274,979.73 ns 12,721.797 ns 11,899.977 ns - - - 208 B

The cost of updating a document however is now very high and might need mitigation (*).

Before

Method DocumentCount Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Solution.WithDocumentText 100 12,661.10 ns 252.509 ns 319.343 ns 0.8240 0.2594 - 5172 B
Solution.WithDocumentText 10000 13,574.50 ns 270.030 ns 500.518 ns 0.8545 0.2289 - 5425 B

After (with SetItem re-hashing)

Method DocumentCount Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Solution.WithDocumentText 100 31,357.70 ns 620.325 ns 849.107 ns 1.8921 0.4883 - 11879 B
Solution.WithDocumentText 10000 1,318,279.70 ns 10,640.558 ns 9,953.185 ns 78.1250 23.4375 - 298152 B

After (with SetItem copying)

Method DocumentCount Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Solution.WithDocumentText 100 17,312.28 ns 344.128 ns 900.524 ns 1.4343 0.3662 - 9121 B
Solution.WithDocumentText 10000 86,724.05 ns 1,303.059 ns 1,155.128 ns 78.6133 24.0479 - 297676 B

Measured on

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
AMD Ryzen Threadripper 3960X, 1 CPU, 48 logical and 24 physical cores
  [Host]     : .NET Framework 4.8 (4.8.4300.0), X64 RyuJIT
  DefaultJob : .NET Framework 4.8 (4.8.4300.0), X64 RyuJIT

Implementation

Uses ImmutableSegmentedDictionary as a backing storage for a map of document id to the corresponding document state.

This data structure allows us to implement very fast lookup and ordered enumeration. The cost is shifted to mutation operations (an addition, removal and update of documents). In practice, this cost of additional/removal shouldn't be high since additions and removals of documents is batched.

(*) Additional potential memory optimizations of updating operations:

  • the segments that do not change when an item is added/removed/removed can be reused.
  • store values in a separate array, so that the overhead of copying the array on update is smaller
  • implement heuristic for re-hashing vs copying in Add, Remove and SetItem operations (e.g. only rehash when the fill of the table is above some percentage threshold)

Relies on ImmutableSegmentedDictionary storing entries in the order they were added until an item is removed. Documents this behavior as a guaranteed property of the data structure. The ordering can be restored after removal of an item if Compact is called.

Adds ImmutableSegmentedDictionary.GetAddedEntry(int index) method that allows to index into the underlying entries array. This method throws if an item has been removed from the dictionary (without Compact called). This API allows us to implement Project.GetXxxDocumentIds methods via a direct index into the underlying entries array of the ImmutableSegmentedDictionary and completely avoid building and maintaining a separate list of DocumentIds.

@tmat tmat requested a review from a team as a code owner January 7, 2021 01:04
@tmat
Copy link
Member Author

tmat commented Jan 7, 2021

@jasonmalinowski PTAL

@tmat tmat marked this pull request as draft January 8, 2021 00:59
@tmat tmat force-pushed the SGAPIs branch 2 times, most recently from 17ddaa7 to ab1d621 Compare January 8, 2021 23:01
@tmat tmat changed the title Source generated document project APIs Document state storage Jan 8, 2021
@tmat tmat marked this pull request as ready for review January 8, 2021 23:09
@CyrusNajmabadi
Copy link
Member

@jasonmalinowski any concern about:

Instead, use the order the documents were added to the project.

We dont' have anything happening ambiently that would cause that order to change, right? i.e. when typing, it's not lke we remove projects/re-add to the end?

@CyrusNajmabadi
Copy link
Member

Add internal API Solution.GetDocumentAsync that retrieves a source document, regardless of whether it's source generated or not.

I don't think i like that. Primarily because wouldn't that really be confusing when reviewing code? In some places, GetDocument will return only non-generated files, but in others it can return generated files? Honestly, i'd prefer we be consistent with the (wordy) GetAllRegularAndSourceGeneratedDocumentsAsync function as it's totally evident and helps bifurcate the API n clear lines.

@tmat
Copy link
Member Author

tmat commented Jan 12, 2021

@CyrusNajmabadi
How about Solution.GetDocumentAsync(DocumentId id, bool includeSourceGenerated = false) ?

Copy link
Member

@jasonmalinowski jasonmalinowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like it. A few more detail thoughts;

  • I like that the enumerator of states is always going to give you the ordered result, in that it prevents some bugs that otherwise might happen by accident. That said, I'm not sure about the perf impacts.
  • I do worry about the use of ImmutableArray<> for storing the IDs in order; I believe the use of ImmutableList was because the array can get large (easily multiple hundreds of items) and we might see overhead during some loads. I see @CyrusNajmabadi experimented with changing this in Experiment with targetted perf improvements around Project.Documents. #48682, so I'll defer to him and @sharwell as far as what they're seeing for the tradeoff and go with that. So treat this signoff as contingent on them also signing off.
  • I would like some of the methods on TextDocumentStates renamed; generally rename "value" to "state" since that's what it's returning.

@@ -821,7 +821,7 @@ public async Task TestAnalyzerConfigFile_Properties()

Assert.Equal(1, project.Documents.Count());
Assert.Equal(1, project.AnalyzerConfigDocuments.Count());
Assert.Equal(1, project.State.AnalyzerConfigDocumentIds.Count());
Assert.Equal(1, project.State.AnalyzerConfigDocumentStates.Count);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not even sure why we have this assert.

=> Map.TryGetValue(documentId, out var state) ? state : null;

public TState GetRequiredValue(DocumentId documentId)
=> Map.TryGetValue(documentId, out var state) ? state : throw ExceptionUtilities.Unreachable;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I'd throw unreachable here; that's usually used if we have code we think we can't hit but compiler flow analysis doesn't know better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you throw? It's the same InvalidOperationException as Contract.ThrowIfNull and friends. ExceptionUtilities.Unreachable means we don't think it's gonna be reached, but it's not something that's supposed to leak out of public API as an expected outcome of that API.

@tmat
Copy link
Member Author

tmat commented Jan 22, 2021

@jasonmalinowski Re perf: Planning to update the PR to use segmented dictionary/array Sam is working on.

@tmat
Copy link
Member Author

tmat commented Feb 15, 2021

@CyrusNajmabadi @sharwell @jasonmalinowski PTAL - implemented optimizations on top of previous changes.

@tmat tmat force-pushed the SGAPIs branch 2 times, most recently from 4196855 to 62a18c6 Compare February 16, 2021 18:27
Base automatically changed from master to main March 3, 2021 23:53
@tmat tmat closed this Mar 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Experiment with targetted perf improvements around Project.Documents.
5 participants