Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/kypher/embeddings #685

Merged
merged 104 commits into from
Nov 18, 2022
Merged

Feature/kypher/embeddings #685

merged 104 commits into from
Nov 18, 2022

Conversation

chalypso
Copy link
Collaborator

Merge Kypher-V and other Kypher updates back into dev.

chalypso added 30 commits April 22, 2022 14:07
This version is using an HD5-based vector store implemented by vecstore.py.
- new master store/column-based store scheme
- allow multiple different storage options at the same time
- support vector normalization
- support custom query translation using new function API and vector functions
- first reasonably complete version that supports everything we need except NN-indexing
New completely customizable SQL function translation and call API.
Initial revision of vector function definitions that support
call-context-specific translations and optimizations for vector
access, dot product and cosine similarity.
Currently we are translating in a hybrid mode using the new API where
available and otherwise fall back on the old version.  Future versions
of the query translator will completely replace the old API with the new.
- removed obsolete experimental code
- also merged in 'kgtk_type' function which we recently added to 'dev',
  since it will need to be accounted for in the refactoring of
  SQL function definitions.
This lets us know what exactly is happening in --debug output.
- added classes to support definition of built-in and aggregate functions
- forward-declared all SQL function to their respective modules to minimize
  unnecessary imports
- refactored core functions to new SQL function API
- added definitions for built-in aggregate functions
- handle translation of special functions such as 'cast', etc.
- refactored KGTK literal functions to new SQL function API
- refactored KGTK math functions to new SQL function API
- removed all old-style user function definitions and moved them to
  their new homes using the new SQL function API
- we kept the old-style registration function for backwards
  compatibility and added new low-level load functions that can be
  called by the new API
chalypso and others added 29 commits September 22, 2022 14:34
- found and fixed a long-standing bug that caused different results
  when we reused an NN search index
- made NN search index size-limited based on index RAM spec
merge dev into feature/kypher/embeddings
Also removed now obsolete LRU-cache directives which don't seem to be
necessary anymore for good performance and just consume RAM.
This greatly improves Faiss parallelism when nearest neighbors are
searched.  Unfortunately, we currently need to manually supply or
compute a proper termination bound, otherwise the last batch won't be
computed correctly.  Future versions should address this through some
automated process.

Also improved translation and handling/documentation of default values
for various control parameters.
…exes

Otherwise appropriate vector indexes won't be built when necessary,
e.g., when a nearest neighbor index is requested as part of an index
redefinition.
- use realpath on DB file so we can find associated derived
  files such as Faiss indexes from links
- generalized vector import format guessing and parsing to
  support base64 and the generalized formats documented in
  the manuals
- implemented random sampling for inline stores
- use random sampling for single-batch and multi-batch clustering
  on inline stores
- adjusted DEFAULT_CENTROID_SIZE to 4096 so we have less of
  a chance to get very large clusters
- generalized vector function translation support
- implemented new vector functions for arithmetic, text and
  base64-based input and output
- kvec_topk_cos_sim now inherits default nprobe from index spec
- kvec_topk_cos_sim now raises an error if no ANNS index is defined
- various other fixes and robustness improvements
Files needed for examples in Kypher-V manual.
Documented --append data option, auxiliary graph caches, --dont-optimize,
kgtk_values and Kypher-V.
Use implicit anchors based on section headers where possible
for cross-references.
Initially needed for join-controller insertion for Kypher-V full
similarity joins.
'kvec_topk_cosine_similarity' needs a join controller virtual edge to
compute full similarity joins via dynamic scaling.  This addition adds
this edge to the query if necessary and then retranslates.
Example dataset needed for Kypher-V documentation.
@chalypso chalypso merged commit 0d5ec9b into dev Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants