-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/kypher/embeddings #685
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This version is using an HD5-based vector store implemented by vecstore.py.
- new master store/column-based store scheme - allow multiple different storage options at the same time - support vector normalization - support custom query translation using new function API and vector functions - first reasonably complete version that supports everything we need except NN-indexing
New completely customizable SQL function translation and call API.
Initial revision of vector function definitions that support call-context-specific translations and optimizations for vector access, dot product and cosine similarity.
Currently we are translating in a hybrid mode using the new API where available and otherwise fall back on the old version. Future versions of the query translator will completely replace the old API with the new.
- removed obsolete experimental code - also merged in 'kgtk_type' function which we recently added to 'dev', since it will need to be accounted for in the refactoring of SQL function definitions.
This lets us know what exactly is happening in --debug output.
- added classes to support definition of built-in and aggregate functions - forward-declared all SQL function to their respective modules to minimize unnecessary imports
- refactored core functions to new SQL function API - added definitions for built-in aggregate functions - handle translation of special functions such as 'cast', etc.
- refactored KGTK literal functions to new SQL function API
- refactored KGTK math functions to new SQL function API
- removed all old-style user function definitions and moved them to their new homes using the new SQL function API - we kept the old-style registration function for backwards compatibility and added new low-level load functions that can be called by the new API
- found and fixed a long-standing bug that caused different results when we reused an NN search index - made NN search index size-limited based on index RAM spec
merge dev into feature/kypher/embeddings
Also removed now obsolete LRU-cache directives which don't seem to be necessary anymore for good performance and just consume RAM.
This greatly improves Faiss parallelism when nearest neighbors are searched. Unfortunately, we currently need to manually supply or compute a proper termination bound, otherwise the last batch won't be computed correctly. Future versions should address this through some automated process. Also improved translation and handling/documentation of default values for various control parameters.
…exes Otherwise appropriate vector indexes won't be built when necessary, e.g., when a nearest neighbor index is requested as part of an index redefinition.
…i-i2/kgtk into feature/kypher/embeddings
- use realpath on DB file so we can find associated derived files such as Faiss indexes from links - generalized vector import format guessing and parsing to support base64 and the generalized formats documented in the manuals - implemented random sampling for inline stores - use random sampling for single-batch and multi-batch clustering on inline stores - adjusted DEFAULT_CENTROID_SIZE to 4096 so we have less of a chance to get very large clusters
- generalized vector function translation support - implemented new vector functions for arithmetic, text and base64-based input and output - kvec_topk_cos_sim now inherits default nprobe from index spec - kvec_topk_cos_sim now raises an error if no ANNS index is defined - various other fixes and robustness improvements
Files needed for examples in Kypher-V manual.
Documented --append data option, auxiliary graph caches, --dont-optimize, kgtk_values and Kypher-V.
Use implicit anchors based on section headers where possible for cross-references.
Initially needed for join-controller insertion for Kypher-V full similarity joins.
'kvec_topk_cosine_similarity' needs a join controller virtual edge to compute full similarity joins via dynamic scaling. This addition adds this edge to the query if necessary and then retranslates.
Example dataset needed for Kypher-V documentation.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Merge Kypher-V and other Kypher updates back into dev.