Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add contraints and embedding directives #3405

Merged
merged 34 commits into from
Feb 7, 2025

Conversation

fredcarle
Copy link
Collaborator

@fredcarle fredcarle commented Jan 25, 2025

Relevant issue(s)

Resolves #3350
Resolves #3351

Description

Sorry for having 3 technically separate features as part of 1 PR. The reason for this is that I started working on embeddings and the size contraint was initially part of the embedding. Since we discussed that it could be applied to all array fields (and later even to String and Blob), I extracted it into a contraints directive that has a size parameter (more contraints can be added in the future). Furthermore, embeddings returned from ML models are arrays of float32. This caused some precision issues because we only supported float64. When saving the float32 array, querying it would return an float64 array with slight precision issues. I decided to add the float32 type.

You can review the first commit for contraint and embedding related code and the second commit for the float related changes. Some float stuff might have leaked in the first commit. Sorry for this. I tried hard to separate the float32 related changes.

Note that the gql.Float type is now Float64 internally.

type User {
  points: Float
}

is the same as

type User {
  points: Float64
}

The embedding generation relies on a 3rd party package called chromem-go to call the model provider API. As long as one of the supported provider API is configured and accessible, the embeddings will be generated when adding new documents. I've added a step in the test workflow that will run the embedding specific tests on linux only (this is because installation on windows and mac is less straight forward) using Ollama (because it runs locally).

The call to the API has to be done synchronously otherwise the docID/CID won't be representative of the contents. The only alternative would be for the system to automatically update the document when returning from the API call but I see that as a inferior option as it hides the update step from the user. It could also make doc anchoring more complicated as the user would have to remember to wait on the doc update before anchoring the doc at the right CID.

We could avoid having embedding generation support and let the users do that call themselves and store the embedding vectors directly. However, having it as a feature allows us to support RAG vector search which would let users get started with AI with very little work. This seems to be something our partners are looking forward to.

I don't see the 3rd party API call inline with a mutation as a problem since this is something that has to be configured by users and users will expect the mutation calls to take more time as a result.

If you're interested in running it locally, install Ollama and define a schema like so

type User {
    name: String
    about: String
    name_v: [Float32!] @constraints(size: 768) @embedding(fields: ["name", "about"], provider: "ollama", model: "nomic-embed-text",  url: "http://localhost:11434/api") // contraint is optional and localhost:11434 is the default port for Ollama
}

Next steps:

  • Support templates for the content sent to the model.
  • Add the _similarity operation to calculate the cosine similarity between two arrays.

Tasks

  • I made sure the code is well commented, particularly hard-to-understand areas.
  • I made sure the repository-held documentation is changed accordingly.
  • I made sure the pull request title adheres to the conventional commit style (the subset used in the project can be found in tools/configs/chglog/config.yml).
  • I made sure to discuss its limitations such as threats to validity, vulnerability to mistake and misuse, robustness to invalidation of assumptions, resource requirements, ...

How has this been tested?

make test and manual testing

Specify the platform(s) on which this was tested:

  • MacOS

@fredcarle fredcarle added feature New feature or request area/query Related to the query component area/schema Related to the schema system labels Jan 25, 2025
@fredcarle fredcarle added this to the DefraDB v0.16 milestone Jan 25, 2025
@fredcarle fredcarle requested a review from a team January 25, 2025 00:35
@fredcarle fredcarle self-assigned this Jan 25, 2025
@fredcarle fredcarle changed the title feat: Add contraints and embedding directives and Float32 type support feat: Add contraints and embedding directives Jan 25, 2025
Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the calls out to 3rd party APIs needs expanded documentation and discussion with the team. I have reviewed very little so far but will continue whilst that issue is being resolved.

Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've given the production code a first-pass review, and will shortly look through the tests. The code looks good, most of the comments are about moving stuff around, or questions/thoughts RE the feature design.

thought: It looks very tedious and error prone to introduce new types atm, although numerics are probably most painful. Perhaps we should review certain aspects in the future to improve this.

Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed the tests, thanks for adding them, I think there are a couple of important test gaps though, please add tests for:

  1. Patching a size constraint, including removing it
  2. Patching an embedding property, and it's children:
    2.1 Adding an embedding description to a field that didnt have one before
    2.2 Mutating each of the embedding properties on a field that declared one on create
    2.3 Removing an embedding property from a field that had one on create
  3. Including a size constraint in a View declaration, if you don't/haven't blocked it off, please include a test that shows what happens if that constraint is violated
  4. Including an embedding property in a View declaration (I assume we want this to error for now).

jsimnz
jsimnz previously requested changes Jan 29, 2025
Copy link
Member

@jsimnz jsimnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking this as "request for changes" (you can dismiss at any time), even tho there aren't any explcit todos.

I have mainly focused on the definition/description code along with the collection/doc updating. Speed ran through the encoding section and the gql parsing.

Haven't reviewed the tests.

Main question/discussion points are "when" you are generating the embedding in the call path, and the location of the EmbeddingDescription.

@fredcarle fredcarle force-pushed the fredcarle/feat/embeddings branch 2 times, most recently from ae1d158 to 7f1cb9b Compare February 4, 2025 23:56
Copy link

codecov bot commented Feb 5, 2025

Codecov Report

Attention: Patch coverage is 83.48624% with 216 lines in your changes missing coverage. Please review.

Project coverage is 78.60%. Comparing base (34eab3f) to head (58ca89e).
Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
client/normal_new.go 55.45% 45 Missing ⚠️
internal/db/embedding.go 66.29% 24 Missing and 6 partials ⚠️
internal/planner/sum.go 36.84% 24 Missing ⚠️
client/document.go 69.01% 17 Missing and 5 partials ⚠️
internal/core/encoding.go 47.62% 20 Missing and 2 partials ⚠️
internal/request/graphql/schema/collection.go 81.82% 10 Missing and 4 partials ⚠️
internal/encoding/float.go 84.52% 10 Missing and 3 partials ⚠️
client/schema_field_description.go 62.07% 11 Missing ⚠️
internal/db/errors.go 0.00% 8 Missing ⚠️
internal/db/fetcher/indexer_matchers.go 58.82% 7 Missing ⚠️
... and 6 more
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #3405      +/-   ##
===========================================
+ Coverage    78.34%   78.60%   +0.26%     
===========================================
  Files          395      396       +1     
  Lines        36286    37333    +1047     
===========================================
+ Hits         28426    29342     +916     
- Misses        6194     6315     +121     
- Partials      1666     1676      +10     
Flag Coverage Δ
all-tests 78.60% <83.49%> (+0.26%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
client/collection_description.go 81.93% <100.00%> (+1.66%) ⬆️
client/collection_field_description.go 71.43% <100.00%> (+1.43%) ⬆️
client/ctype.go 93.55% <100.00%> (ø)
client/definitions.go 75.88% <100.00%> (+0.24%) ⬆️
client/errors.go 67.05% <100.00%> (+6.23%) ⬆️
client/normal_array.go 98.11% <100.00%> (+0.11%) ⬆️
client/normal_array_of_nillables.go 97.03% <100.00%> (+0.19%) ⬆️
client/normal_nil.go 95.12% <100.00%> (+0.84%) ⬆️
client/normal_nillable_array.go 98.80% <100.00%> (+0.09%) ⬆️
client/normal_nillable_array_of_nillables.go 100.00% <100.00%> (ø)
... and 34 more

... and 16 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 34eab3f...58ca89e. Read the comment docs.

@fredcarle fredcarle force-pushed the fredcarle/feat/embeddings branch 7 times, most recently from 117e1fd to e54b83e Compare February 5, 2025 17:31
@fredcarle fredcarle force-pushed the fredcarle/feat/embeddings branch from 9a2a8d4 to 58ca89e Compare February 7, 2025 19:34
Copy link
Member

@shahzadlone shahzadlone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing the requested suggestions :) great work overall!

LGTM

@fredcarle fredcarle merged commit 730eb15 into sourcenetwork:develop Feb 7, 2025
43 of 44 checks passed
@fredcarle fredcarle deleted the fredcarle/feat/embeddings branch February 7, 2025 20:35
@islamaliev
Copy link
Contributor

Bug bash result:
Launched ollama locally, ran different queries (create, update) embeddings work as expected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/query Related to the query component area/schema Related to the schema system feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add size constraint to array types Create embeddings based on document fields
5 participants