Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option for SBT creation to be localized #925

Closed
wants to merge 103 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
103 commits
Select commit Hold shift + click to select a range
586395b
added base of localization test with sigleaf
phoenixAja Jan 30, 2020
e79b8ab
added asserts and node mapping for localization unit test
phoenixAja Jan 31, 2020
6a49712
Merge branch 'master' into phoenixaja/index_localization
olgabot Mar 31, 2020
b19fafa
Initial creation of LocalizedSBT class
olgabot Mar 31, 2020
c915e6f
Move LocalizedSBT test to test_sbtmh.py
olgabot Mar 31, 2020
e344cd6
Use signatures directly to add, as one would when creating the index
olgabot Mar 31, 2020
6f08bad
Use SBT super init and add minhash signature specific things
olgabot Mar 31, 2020
0f5049c
Get most similar leaf for new node
olgabot Mar 31, 2020
0ea6da5
Add option to return the actual leaf in "SBT.search"
olgabot Mar 31, 2020
f92a782
Initial attempt at adding new signature and displacing others
olgabot Mar 31, 2020
e8708fe
Whitespace
olgabot Mar 31, 2020
5210a55
Move LocalizedSBT to after SigLeaf definition
olgabot Mar 31, 2020
285eb04
Return tuple of leaf and leaf position in "SBT.find"
olgabot Mar 31, 2020
217c1ba
Update SBT.search tests to include check for leaf position
olgabot Mar 31, 2020
71a6e6e
Only allow for number of children per node to be 2, aka raise an erro…
olgabot Mar 31, 2020
e081de0
Add comment about what return_leaf does
olgabot Mar 31, 2020
79e8da0
Move all new child displacement into try/except
olgabot Mar 31, 2020
fcf7b47
Don't test for different number of children in LocalizedSBT
olgabot Mar 31, 2020
4995c55
Okay getting closer! adding a node is not failing but tests are
olgabot Mar 31, 2020
a760477
Okay well tests are failing and I'm not sure why...
olgabot Mar 31, 2020
3081f4e
Add comment of expected values
olgabot Mar 31, 2020
dc115a1
Add note about how parent node positions vs children is computed beca…
olgabot Mar 31, 2020
b9a69b2
Use sys.float_info.epsilon to set minimum threshold greater than 0
olgabot Mar 31, 2020
dbd234d
Update k-mers and add similarity matrix for reference
olgabot Apr 1, 2020
76f7f07
p --> parent for clarity
olgabot Apr 1, 2020
d730c6b
Fix example + formatting in docstring
olgabot Apr 1, 2020
ffd418c
Swap order of similarity matrices so track_abundance=True goes first
olgabot Apr 1, 2020
feb84c7
Extract adding a new parent into its own method
olgabot Apr 1, 2020
89b6a5d
Add ascii art of trees and check tree construction didn't happen inco…
olgabot Apr 1, 2020
c53dd5c
Capitalization
olgabot Apr 1, 2020
4d3d0a8
track_abundance=True test works so far!
olgabot Apr 1, 2020
52bcd56
Remove try/except becase it was catching errors I didn't want it to
olgabot Apr 1, 2020
9a32c0f
Add docstring to SearchMinHashesFindBest.search
olgabot Apr 1, 2020
42d2b6a
omg tests are working!!!
olgabot Apr 1, 2020
42c221f
Add option to SBT.find() about whether to return the integer position…
olgabot Apr 1, 2020
dea3b48
Add --localized option to 'sourmash index'
olgabot Apr 1, 2020
6fc5a1a
Add localized=False argument to create_sbt_index
olgabot Apr 1, 2020
2a5077b
Add --localized test for sourmash index commandline
olgabot Apr 1, 2020
b33b5d2
add localized fixture
olgabot Apr 1, 2020
d499675
Somehow --scaled got lost?
olgabot Apr 1, 2020
4346936
Remove unused code
olgabot Apr 1, 2020
55779bc
Undo whitespace changes
olgabot Apr 6, 2020
6a052d9
Expand docstring for LocalizedSBT
olgabot Apr 6, 2020
e6d1dde
Set localized as default SBT creation
olgabot Apr 6, 2020
dc1d31b
--localized --> --not-localized for 'sourmash index' flags
olgabot Apr 6, 2020
dc31e9c
Change test for index to test --not-localized
olgabot Apr 6, 2020
8c4a71a
Re-add return_pos fixture
olgabot Apr 6, 2020
b255cb6
broke apart localized SBT a tiny bit
phoenixAja Apr 6, 2020
4cbffdf
Update sourmash/sbtmh.py
olgabot Apr 7, 2020
338fbac
Update sourmash/sbtmh.py
olgabot Apr 7, 2020
fe4a956
Update sourmash/sbtmh.py
olgabot Apr 13, 2020
832861c
Update sourmash/sbtmh.py
olgabot Apr 13, 2020
b756008
Update sourmash/sbtmh.py
olgabot Apr 13, 2020
c1bdfa6
Merge branch 'phoenixaja/index_localization' into phoenix/refactor-lo…
olgabot Apr 13, 2020
512ebdf
Merge pull request #6 from czbiohub/phoenix/refactor-localized-sbt
olgabot Apr 13, 2020
6e90a8f
Revert "SBT localization refactor"
olgabot Apr 13, 2020
1c18072
Merge pull request #7 from czbiohub/revert-6-phoenix/refactor-localiz…
olgabot Apr 13, 2020
516da88
Re-do of SBT localization refactor (#9)
olgabot Apr 13, 2020
acef608
Add similarity matrix and ASCII tree drawings of what should happen a…
olgabot Apr 13, 2020
cb6a6cd
Refactor to maybe_displace_child method
olgabot Apr 15, 2020
0fb5caa
Add name and comment to SBT.insert
olgabot Apr 15, 2020
7e25102
Expand comment
olgabot Apr 15, 2020
12dec1f
Make _missing_nodes a property of SBT
olgabot Apr 15, 2020
a8cd435
Add test to ensure similar pairs share parents regardless of construc…
olgabot Apr 15, 2020
55dd7d9
Check for presence of all signatures in both sorted and randomized data
olgabot Apr 15, 2020
0d78ccd
Add random seed for checking adversarial signature ordering
olgabot Apr 15, 2020
24b854e
Refactor out insert_dissimilar_leaf and displace_child_with_new_leaf …
olgabot Apr 15, 2020
733d7f6
Skip adversarial signatures testing for adding multiple of the same s…
olgabot Apr 16, 2020
8be87ce
trying to fix the issue of adversarial indexes..
olgabot Apr 16, 2020
5385c34
Update sourmash/sbtmh.py
phoenixAja Apr 17, 2020
50329f9
Update sourmash/sbtmh.py
phoenixAja Apr 17, 2020
e5ac760
Update sourmash/commands.py
phoenixAja Apr 17, 2020
e5c48d5
Initial addition of unit testing signatures used in gather
olgabot Apr 17, 2020
459bf9f
Add comments explaining what calls should be happening
olgabot Apr 17, 2020
8471fc1
Localized SBT tests are working! Now to fix everything else..
olgabot Jun 14, 2020
36e8c3b
More conditions for building missing nodes
olgabot Jun 14, 2020
a4ffb66
Merge in sourmash/master
olgabot Jun 14, 2020
84de401
Fix half the tests with "import math" :)
olgabot Jun 14, 2020
b12e050
Move cached_property to utils.py
olgabot Jun 17, 2020
89e520c
Use cached_property and @luizirber's implementation of _missing_nodes
olgabot Jun 17, 2020
9c04a42
Remove setting of _missing_nodes for old trees
olgabot Jun 17, 2020
f49f385
Fix test_sourmash.test_do_sourmash_check_sbt_filenames by using md5su…
olgabot Jun 17, 2020
c0c9d4c
fix tests for saving index zipfiles
olgabot Jun 17, 2020
055685a
update comment
olgabot Jun 17, 2020
88f2801
Add separate test_search.py
olgabot Jun 17, 2020
f12506a
Refactor loading matching signatures into tree, to separate function
olgabot Jun 17, 2020
7ef803d
Ignore empty internal nodes for search
olgabot Jun 17, 2020
1663bcc
Add option to return number of signatures loaded
olgabot Jun 17, 2020
b9dcfde
Add separate tests for search_databases and gather_databases
olgabot Jun 17, 2020
438c3f5
Add ignore_empty to search_minhashes_containment
olgabot Jul 18, 2020
aad7425
Make naming more clear and always update internal nodes after pushing…
olgabot Jul 18, 2020
ac1ca07
Different cases for Leaf/SigLeaf
olgabot Jul 18, 2020
7afc419
Keep original filenaming thing
olgabot Jul 18, 2020
1e6d9c9
Update search and gather results
olgabot Jul 18, 2020
2c7b2fb
Updated tests, but who knows if this is correct...
olgabot Jul 18, 2020
c661af7
Merge in latest master
olgabot Jul 18, 2020
baa5b04
"fixed" more tests
olgabot Jul 18, 2020
4d1995e
Merge branch 'master' into phoenixaja/index_localization
olgabot Jul 29, 2020
05fbf59
Add @ctb sbt print code
olgabot Jul 29, 2020
92c8aa1
Refactor @ctb's printing code to separate out printing and string ret…
olgabot Jul 29, 2020
975ba88
Committing all changes for now, will undo black formatting later
olgabot Jul 29, 2020
0a6fefb
Keep trying to get sbt stuff to work
olgabot Aug 11, 2020
09f0456
always naively insert to next available leaf pos if no similarity found
phoenixAja Oct 24, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions sourmash/cli/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,13 @@ def subparser(subparsers):
'--scaled', metavar='FLOAT', type=float, default=0,
help='downsample signatures to the specified scaled factor'
)
subparser.add_argument(
'--not-localized', action='store_true', default=False,
help='Do not create a localized SBT index. A localized index guarantees that '
'any two leaves sharing a parent are more similar to leaves not sharing '
'a parent. Localized indices are required for building a nearest neighbor'
' graph from the SBT.'
)
Comment on lines +78 to +84
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this option is necessary: if d != 2 then build the current SBT, otherwise always build the LocalizedSBT.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, it's nice for debugging :)

add_moltype_args(subparser)


Expand Down
87 changes: 47 additions & 40 deletions sourmash/commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -345,10 +345,11 @@ def index(args):
set_quiet(args.quiet)
moltype = sourmash_args.calculate_moltype(args)

if args.append:
tree = load_sbt_index(args.sbt_name)
if args.traverse_directory:
inp_files = list(sourmash_args.traverse_find_sigs(args.signatures,
args.force))
else:
tree = create_sbt_index(args.bf_size, n_children=args.n_children)
inp_files = list(args.signatures)

if args.sparseness < 0 or args.sparseness > 1.0:
error('sparseness must be in range [0.0, 1.0].')
Expand All @@ -357,31 +358,34 @@ def index(args):
args.scaled = int(args.scaled)
notify('downsampling signatures to scaled={}', args.scaled)

inp_files = list(args.signatures)
if args.from_file:
more_files = sourmash_args.load_file_list_of_signatures(args.from_file)
inp_files.extend(more_files)
notify('loading {} files into SBT', len(inp_files))

if not inp_files:
error("ERROR: no files to index!? Supply on command line or use --from-file")
sys.exit(-1)
tree, n = load_matching_signatures_into_tree(
inp_files, args.ksize, moltype, args.scaled, args.append, args.sbt_name,
return_n=True)

notify('loading {} files into SBT', len(inp_files))
notify('loaded {} sigs; saving SBT under "{}"', n, args.sbt_name)
tree.save(args.sbt_name, sparseness=args.sparseness)

progress = sourmash_args.SignatureLoadingProgress()

def load_matching_signatures_into_tree(filenames, ksize, moltype, scaled=0,
append=False, sbt_name=None, bf_size=1e5,
n_children=2, return_n=False):
if append:
tree = load_sbt_index(sbt_name)
else:
tree = create_sbt_index(bf_size, n_children=n_children)

n = 0
ksizes = set()
moltypes = set()
nums = set()
scaleds = set()
for f in inp_files:
siglist = sourmash_args.load_file_as_signatures(f,
ksize=args.ksize,
select_moltype=moltype,
traverse=args.traverse_directory,
yield_all_files=args.force,
progress=progress)
for f in filenames:
if n % 100 == 0:
notify('\r...reading from {} ({} signatures so far)', f, n, end='')
siglist = sig.load_signatures(f, ksize=ksize,
select_moltype=moltype)

# load all matching signatures in this file
ss = None
Expand All @@ -390,8 +394,8 @@ def index(args):
moltypes.add(sourmash_args.get_moltype(ss))
nums.add(ss.minhash.num)

if args.scaled:
ss.minhash = ss.minhash.downsample_scaled(args.scaled)
if scaled:
ss.minhash = ss.minhash.downsample_scaled(scaled)
scaleds.add(ss.minhash.scaled)

tree.insert(ss)
Expand All @@ -400,32 +404,35 @@ def index(args):
if not ss:
continue

# check to make sure we aren't loading incompatible signatures
if len(ksizes) > 1 or len(moltypes) > 1:
error('multiple k-mer sizes or molecule types present; fail.')
error('specify --dna/--protein and --ksize as necessary')
error('ksizes: {}; moltypes: {}',
", ".join(map(str, ksizes)), ", ".join(moltypes))
sys.exit(-1)

if nums == { 0 } and len(scaleds) == 1:
pass # good
elif scaleds == { 0 } and len(nums) == 1:
pass # also good
else:
error('trying to build an SBT with incompatible signatures.')
error('nums = {}; scaleds = {}', repr(nums), repr(scaleds))
sys.exit(-1)

check_signature_compatibilty_to_tree(ksizes, moltypes, nums, scaleds)
notify('')

# did we load any!?
if n == 0:
error('no signatures found to load into tree!? failing.')
sys.exit(-1)
if return_n:
return tree, n
else:
return tree

notify('loaded {} sigs; saving SBT under "{}"', n, args.sbt_name)
tree.save(args.sbt_name, sparseness=args.sparseness)

def check_signature_compatibilty_to_tree(ksizes, moltypes, nums, scaleds):
Copy link
Contributor

@ctb ctb Jul 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in sourmash_args, we have check_tree_is_compatible; they do slightly different things and I think it's fine to use this new function, but we should name it something else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(maybe prefix it with an _ to indicate that it's a helper function that shouldn't be used outside this module?)

# check to make sure we aren't loading incompatible signatures
if len(ksizes) > 1 or len(moltypes) > 1:
error('multiple k-mer sizes or molecule types present; fail.')
error('specify --dna/--protein and --ksize as necessary')
error('ksizes: {}; moltypes: {}',
", ".join(map(str, ksizes)), ", ".join(moltypes))
sys.exit(-1)
if nums == {0} and len(scaleds) == 1:
pass # good
elif scaleds == {0} and len(nums) == 1:
pass # also good
else:
error('trying to build an SBT with incompatible signatures.')
error('nums = {}; scaleds = {}', repr(nums), repr(scaleds))
sys.exit(-1)


def search(args):
Expand Down
21 changes: 2 additions & 19 deletions sourmash/lca/lca_db.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,27 +4,11 @@
import json
import gzip
from collections import OrderedDict, defaultdict, Counter
import functools

import sourmash
from sourmash._minhash import get_max_hash_for_scaled
from sourmash.logging import notify, error, debug
from sourmash.logging import notify, debug
from sourmash.index import Index


def cached_property(fun):
"""A memoize decorator for class properties."""
@functools.wraps(fun)
def get(self):
try:
return self._cache[fun]
except AttributeError:
self._cache = {}
except KeyError:
pass
ret = self._cache[fun] = fun(self)
return ret
return property(get)
from sourmash.utils import cached_property


class LCA_Database(Index):
Expand Down Expand Up @@ -166,7 +150,6 @@ def __repr__(self):

def signatures(self):
"Return all of the signatures in this LCA database."
from sourmash import SourmashSignature
for v in self._signatures.values():
yield v

Expand Down
Loading