Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for branchable collections #3216

Merged
merged 13 commits into from
Nov 8, 2024

Conversation

AndrewSisley
Copy link
Contributor

@AndrewSisley AndrewSisley commented Nov 7, 2024

Relevant issue(s)

Resolves #3038

Description

Adds support for branchable collections.

Does not add support for syncing the collection level commits via the P2P system (broken out, lots of people working in the space, significant changes required, I'm not so familiar with that part of the code so will take longer). This does mean that the somewhat surprising (to me at least) implementation of Collection.Merge is currently untested.

Commits are queriable via the commits GQL queries.

Time travel queries do not work due to an existing bug: #3214 - once that bug is fixed I expect there to be some more work (definitely testing) in order to get it to work with branchable collections.

This is a breaking change due to the moving/namespacing of the existing document headstore keys.

@AndrewSisley AndrewSisley added feature New feature or request area/crdt Related to the (Merkle) CRDT system area/collections Related to the collections system labels Nov 7, 2024
@AndrewSisley AndrewSisley added this to the DefraDB v0.15 milestone Nov 7, 2024
@AndrewSisley AndrewSisley requested a review from a team November 7, 2024 02:05
@AndrewSisley AndrewSisley self-assigned this Nov 7, 2024
@AndrewSisley AndrewSisley changed the title feature: Add support for branchable collections feat: Add support for branchable collections Nov 7, 2024
Copy link

codecov bot commented Nov 7, 2024

Codecov Report

Attention: Patch coverage is 83.67347% with 48 lines in your changes missing coverage. Please review.

Project coverage is 77.45%. Comparing base (2f53878) to head (3d2ee10).
Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
internal/keys/headstore_collection.go 57.41% 20 Missing and 3 partials ⚠️
internal/core/crdt/ipld_union.go 72.22% 5 Missing ⚠️
internal/keys/headstore_doc.go 77.78% 3 Missing and 1 partial ⚠️
internal/db/collection.go 85.00% 2 Missing and 1 partial ⚠️
internal/db/collection_delete.go 85.00% 2 Missing and 1 partial ⚠️
internal/db/fetcher/dag.go 86.36% 2 Missing and 1 partial ⚠️
internal/keys/headstore.go 75.00% 2 Missing ⚠️
internal/merkle/crdt/collection.go 85.71% 2 Missing ⚠️
internal/request/graphql/schema/collection.go 83.33% 1 Missing and 1 partial ⚠️
internal/planner/commit.go 96.30% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #3216      +/-   ##
===========================================
+ Coverage    77.38%   77.45%   +0.06%     
===========================================
  Files          378      382       +4     
  Lines        34941    35178     +237     
===========================================
+ Hits         27039    27244     +205     
- Misses        6274     6304      +30     
- Partials      1628     1630       +2     
Flag Coverage Δ
all-tests 77.45% <83.67%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
client/collection_description.go 80.26% <100.00%> (+0.26%) ⬆️
internal/core/block/block.go 86.73% <100.00%> (+0.07%) ⬆️
internal/core/crdt/collection.go 100.00% <100.00%> (ø)
internal/core/crdt/composite.go 71.08% <100.00%> (ø)
internal/db/definition_validation.go 94.77% <100.00%> (+0.09%) ⬆️
internal/db/errors.go 61.01% <100.00%> (+0.59%) ⬆️
internal/db/fetcher/versioned.go 64.92% <100.00%> (-6.28%) ⬇️
internal/db/iterator.go 78.57% <100.00%> (ø)
internal/db/merge.go 80.93% <100.00%> (+4.09%) ⬆️
internal/keys/datastore_doc.go 82.18% <100.00%> (ø)
... and 17 more

... and 8 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2f53878...3d2ee10. Read the comment docs.

@@ -88,6 +88,21 @@ type CollectionDescription struct {
// At the moment this can only be set to `false` if this collection sources its data from
// another collection/query (is a View).
IsMaterialized bool

// IsMaterialized defines whether the history of this collection is tracked as a single,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: says IsMaterialized

Copy link
Contributor Author

@AndrewSisley AndrewSisley Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"typo" 😁 Cheers, fixing now

  • fix typo

Copy link
Contributor

@islamaliev islamaliev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Good work, Andy.

)

// Collection is a simple CRDT type that tracks changes to the contents of a
// collection in a similar way to a document composite commit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: this documentation doesn't it make clear how it's different from composite. Would be nice to add some more details.

Copy link
Contributor Author

@AndrewSisley AndrewSisley Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay - cheers, I'll see what I can do :)

  • Expand/clarify doc

@@ -31,13 +31,18 @@ type HeadFetcher struct {
func (hf *HeadFetcher) Start(
ctx context.Context,
txn datastore.Txn,
prefix keys.HeadStoreKey,
prefix immutable.Option[keys.HeadstoreKey],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I don't think making prefix optional makes sense here, as empty prefix has also a meaning and None means there is no value. I would say it's reasonalbe to always pass some value for prefix and it keeps the API cleaner.

Copy link
Contributor Author

@AndrewSisley AndrewSisley Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the implementations of HeadstoreKey would result in an empty prefix that allows scanning across the entire headstore (required in order to return collection and doc commits).

We could pass nil and check for that, or we could implement a Nil/All HeadstoreKey type (very weird), but both of those seem a lot less pleasant than using option.

It does sound like why this param is optional should be documented on Start though, so I'll add something there.

  • Doc prefix param

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If None, the entire headstore will be scanned - for example, in order to fetch document and collection heads.

Is that something we even want to allow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, most of the tests in this PR rely on that, and it is the default behaviour of commits queries.

Comment on lines +726 to +734
updateEvent := event.Update{
Cid: link.Cid,
SchemaRoot: c.Schema().Root,
Block: headNode,
}

txn.OnSuccess(func() {
c.db.events.Publish(event.NewMessage(event.UpdateName, updateEvent))
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: do we really need to send another Update event? Can't the receiver of the even simpli check on this side if the collection is branchable and handle correspondingly?

suggestion: if we need to send another event, then I would say it serves a different purpose and in this case we should introduce another event type.

Copy link
Contributor Author

@AndrewSisley AndrewSisley Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't the receiver of the even simpli check on this side if the collection is branchable and handle correspondingly?

I hadn't considered that, I'll think about that and get back to you :)

  • Consider event stuff

if we need to send another event, then I would say it serves a different purpose and in this case we should introduce another event type.

I disagree with this, it is an update event, an update of a collection. It needs to be handled for exactly the same reasons that an update (doc) event is currently handled, and in pretty much the same way (some rework required as P2P is very focused on docIDs etc atm as noted in PR description).

Copy link
Contributor Author

@AndrewSisley AndrewSisley Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't considered that, I'll think about that and get back to you

I think I prefer the extra events. It feels more correct to send independent updates for the doc and the collection, especially so as the event bus is public and can be subscribed to by users.

Keeping as-is, unless dissuaded by future argument :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also prefer the extra event but the reason for why I prefer it is missing from this implementation.

I would expect a CreateMany to generate only one collection block linking to all the docs created (same for updates or mix of creates and updates). This means that we would have one update event for every doc and one update event at the end for the collection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect a CreateMany to generate only one collection block linking to all the docs created (same for updates or mix of creates and updates). This means that we would have one update event for every doc and one update event at the end for the collection.

I was planning on building that here, but then I realised it is not specific to collection-level commits. The same system can (should) be applied to all crdt/blocks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same system can (should) be applied to all crdt/blocks.

Can you expand on this? Why would that apply to composites or field level CRDTs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as multiple collection blocks can be created in a single transaction, so too can multiple composite and field level blocks. Grouping any of these together into a single block should work consistently regardless of type.

E.g. if in a single transaction Age on Document1 in branchable collection 1 is updated twice, the same mechanic for rolling the would-be-two collection level commits into a single collection commit should also apply to the field and composite level commits.

There is nothing special about collection-level commits here, although the need for it is possibly higher.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say this is definitely not the case for field level CRDTs. They only point to their previous head (they don't have links to other items) plus they don't cause network updates on their own. It may apply to composites for the multiple updates to the same document in one transaction but I'm not sure there is much benefit to it as it would create multiple heads for the updated fields in question only to save a composite block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say this is definitely not the case for field level CRDTs.

We disagree, and it makes me very glad I did not try and do that work within this PR :)

Comment on lines +166 to +187
if c.def.Description.IsBranchable {
collectionCRDT := merklecrdt.NewMerkleCollection(
txn,
keys.NewCollectionSchemaVersionKey(c.Schema().VersionID, c.ID()),
keys.NewHeadstoreColKey(c.def.Description.RootID),
)

link, headNode, err := collectionCRDT.Save(ctx, []coreblock.DAGLink{{Link: link}})
if err != nil {
return err
}

updateEvent := event.Update{
Cid: link.Cid,
SchemaRoot: c.Schema().Root,
Block: headNode,
}

txn.OnSuccess(func() {
c.db.events.Publish(event.NewMessage(event.UpdateName, updateEvent))
})
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: extract duplicated code

Results: map[string]any{
"commits": []map[string]any{
{
"cid": testUtils.NewUniqueCid("collection"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (out of scope): this string makes me always think that it has a meaning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I understand that - I think when I next use NewUniqueCid I should force myself to use another type (probs int) to see how that feels.


// TODO: This test documents an unimplemented feature. Tracked by:
// https://github.com/sourcenetwork/defradb/issues/3212
func TestQueryCommitsBranchables_AcrossPeerConnection(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: for complex tests I would like to see more descriptive name, so that I don't need to dig into the test's body to understand what it tests.

Copy link
Contributor Author

@AndrewSisley AndrewSisley Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll expand this one, probably to TestQueryCommitsBranchables_SyncsAcrossPeerConnection

  • Expand test name, make sure you update the linked ticket too

GQL queries.

Currently this property is immutable and can only be set on collection creation, however
that will change in the future.`,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: We should probably add the todo comment with the linked issue above so that we don't forget to edit this once we add the feature.

Copy link
Contributor Author

@AndrewSisley AndrewSisley Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a problem, although that issue does not exist yet, so I'll be adding another item to sit in our long backlog.

  • Create and link issue for toggling branchableness

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it did already exist 🤷‍♂️

Copy link
Collaborator

@fredcarle fredcarle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is well done. I like the tests that document the unimplemented features as well.

Just a few comments to resolve before approval.

// collection in a similar way to a document composite commit, only simpler,
// without the need to track status and a simpler [Merge] function.
type Collection struct {
store datastore.DSReaderWriter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: This field can be removed as it's never used.

Copy link
Contributor Author

@AndrewSisley AndrewSisley Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice spot :)

  • Rm col.store

// schemaVersionKey is the schema version datastore key at the time of commit.
//
// It can be used to identify the collection datastructure state at the time of commit.
schemaVersionKey keys.CollectionSchemaVersionKey
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I don't think there is any reason to have to create this schema version key only to extract the schemaVersionID later on. I think going directly with the schemaVersionID would be preferable.

Copy link
Contributor Author

@AndrewSisley AndrewSisley Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will have a look, not a lot of thought went into this, I copy paasted from another crdt and dropped the unused props.

  • rm col.schemaversionKey?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving as is, changing it would introduce a significant difference between the usage patterns of this type and the other crdts. And the correctness of moving it also depends on the untested Merge function (interfaced, so cannot be easily changed).

return nil
}

func (c *Collection) Append() *CollectionDelta {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Set might be a better name for this. There is no reason for it to be different than the composite CRDT.

Copy link
Contributor Author

@AndrewSisley AndrewSisley Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I change the function name of the composite? Set is a really bad name for a function that doesn't set anything.

  • Rename something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed composite.Set to Append

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Append is equally a bad name for something that doesn't append anything though. Set does set fields in the delta object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new delta object, is sets nothing on the host. Although Append doesnt append to the host really either. Both names are bad, but they are now consistent at least.

The two functions are constructors, nothing more. They could be renamed NewDelta perhaps.

Note: I don't see this as something terribly important, and am not very keen on thinking much about it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave it as Set for now. It doesn't make sense to me to chose the objectively less accurate name Append when Set was there previously. I have a vague memory of John deciding to use Set because it would have been the correct naming in CRDT lingo. If that's not the case and you really don't want to use Set, NewDelta is definitely the better choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done :)

Copy link
Collaborator

@fredcarle fredcarle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think the CRDT code is starting to be much more approchable 👍

@AndrewSisley AndrewSisley merged commit 85bbdc0 into sourcenetwork:develop Nov 8, 2024
42 of 43 checks passed
@AndrewSisley AndrewSisley deleted the 3038-brchbl-cols branch November 8, 2024 20:53
@fredcarle
Copy link
Collaborator

bug bash: Manually tested running 2 nodes, creating docs on a branchable collection, running standard and time travelling queries on collection commits and inspected commit queries. All works as expected.

ChrisBQu pushed a commit to ChrisBQu/defradb that referenced this pull request Feb 21, 2025
## Relevant issue(s)

Resolves sourcenetwork#3038

## Description

Adds support for branchable collections.

Does not add support for syncing the collection level commits via the
P2P system (broken out, lots of people working in the space, significant
changes required, I'm not so familiar with that part of the code so will
take longer). This does mean that the somewhat surprising (to me at
least) implementation of `Collection.Merge` is currently untested.

Commits are queriable via the `commits` GQL queries.

Time travel queries do not work due to an existing bug:
sourcenetwork#3214 - once that bug is
fixed I expect there to be some more work (definitely testing) in order
to get it to work with branchable collections.

This is a breaking change due to the moving/namespacing of the existing
document headstore keys.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/collections Related to the collections system area/crdt Related to the (Merkle) CRDT system feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for branchable collections
4 participants