configuration file and user-provided lineages #24

taylorreiter · 2020-05-13T15:41:03Z

The current test data configuration file has an input for lineages:

# lineages CSV (see `sourmash lca index`) for signatures in query databases
lineages_csv: test-data/podar-lineage.csv

From the config file alone, it's unclear if it is necessary for the user to provide lineages, and if they do not provide lineages, what will happen/how the config file should be filled out.

The text was updated successfully, but these errors were encountered:

taylorreiter · 2020-05-13T15:46:02Z

I see from the tara-delmont conf file that lineages can be specified as:

# lineages CSV (see `sourmash lca index`) for signatures in query databases
lineages_csv: /home/ctbrown/sourmash_databases/gtdb/gtdb-lineages.csv

So the lineages csv contains the lineages that will be tested for presence/contamination in the MAG? Should gtdb-lineages.csv be the default? Can this db be downloaded by charcoal so the user doesn't have to think about it?

ctb · 2020-05-13T15:47:36Z

On Wed, May 13, 2020 at 08:41:19AM -0700, Taylor Reiter wrote: The current test data configuration file has an input for lineages: ``` # lineages CSV (see `sourmash lca index`) for signatures in query databases lineages_csv: test-data/podar-lineage.csv ``` >From the config file alone, it's unclear if it is necessary for the user to provide lineages, and if they do not provide lineages, what will happen/how the config file should be filled out.

these are lineages for the database(s), not the genomes - e.g. gtdb. Suggested wording welcome! `provided_lineages` is the set of optional overrides on input genomes. Hmm, one good addition might actually be to provide a separate ~system-wide config file that lists the query databases and lineages. That way you only have to specify them there, and they flow through to the rest of the projects.

taylorreiter · 2020-05-13T15:54:13Z

Yes, I'm partial to only having to worry about databases once. So some how setting it up where charcoal will automatically download and configure databases for the user the first time the tool is used, and then the paths of those databases are propagated to all charcoal uses unless the user overrides it/wants to switch databases.

As for wording, I think lineages_csv is fine, but maybe adding something to the comment above like

# lineages CSV containing reference lineages to test for contamination. 
# Must correspond to  signatures in query databases (e.g. gtdb.csv). 
# See `sourmash lca index` to generate your own.

Although that's kind of bad english and still not totally clear

ctb · 2020-05-13T15:54:43Z

I see from the tara-delmont conf file that lineages can be specified as:

# lineages CSV (see `sourmash lca index`) for signatures in query databases
lineages_csv: /home/ctbrown/sourmash_databases/gtdb/gtdb-lineages.csv

So the lineages csv contains the lineages that will be tested for presence/contamination in the MAG?

Exactly so.

Should gtdb-lineages.csv be the default? Can this db be downloaded by charcoal so the user doesn't have to think about it?

I think that's a pretty reasonable approach, yes!

maybe we can provide some commands like --

charcoal download_db - download databases
charcoal config check - check location etc of databases
charcoal config generate - generate a new config file

The trickiest bit(s) here are that we need to figure out good default locations for downloaded databases and so on. luckily w/sbt.zip support they're small enough that doing it on a per-install basis is probably ok, and we can support central installs if needed.

ctb · 2020-05-13T16:01:07Z

(side note, sourmash lca index consumes such files, but does not produce them. that's more of a sourmash taxonomy kinda thing (tho that doesn't yet exist))

taylorreiter · 2020-05-13T17:04:08Z

charcoal download_db - download databases
charcoal config check - check location etc of databases
charcoal config generate - generate a new config file

I love this idea!

The trickiest bit(s) here are that we need to figure out good default locations for downloaded databases and so on. luckily w/sbt.zip support they're small enough that doing it on a per-install basis is probably ok, and we can support central installs if needed.

Could we do something like charcoal download_db -p /home/tereiter/charcoal_db, where the user provides a path after -p and somehow charcoal then knows about that path?

ctb · 2020-05-13T17:13:11Z

On Wed, May 13, 2020 at 10:04:23AM -0700, Taylor Reiter wrote: >The trickiest bit(s) here are that we need to figure out good default locations for downloaded databases and so on. luckily w/sbt.zip support they're small enough that doing it on a per-install basis is probably ok, and we can support central installs if needed. Could we do something like `charcoal download_db -p /home/tereiter/charcoal_db`, where the user provides a path after `-p` and somehow charcoal then knows about that path?

yep. I think we would need to (try to) write to the central charcoal config file, which (in conda) would be user-writeable if it's in the package. where or not this is a good idea... less sure :). But we could provide a user-override environment variable, too. yay complexity.

ctb · 2020-05-18T14:39:25Z

remaining bits transferred to #61

This was referenced May 13, 2020

better config testing/error messages #28

Merged

provide project initiation and configuration subcommands #46

Merged

provide automatic download/install options for database(s) and lineage(s) #61

Open

ctb closed this as completed May 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configuration file and user-provided lineages #24

configuration file and user-provided lineages #24

taylorreiter commented May 13, 2020

taylorreiter commented May 13, 2020

ctb commented May 13, 2020 via email

taylorreiter commented May 13, 2020

ctb commented May 13, 2020

ctb commented May 13, 2020 •

edited

Loading

taylorreiter commented May 13, 2020

ctb commented May 13, 2020 via email

ctb commented May 18, 2020

configuration file and user-provided lineages #24

configuration file and user-provided lineages #24

Comments

taylorreiter commented May 13, 2020

taylorreiter commented May 13, 2020

ctb commented May 13, 2020 via email

taylorreiter commented May 13, 2020

ctb commented May 13, 2020

ctb commented May 13, 2020 • edited Loading

taylorreiter commented May 13, 2020

ctb commented May 13, 2020 via email

ctb commented May 18, 2020

ctb commented May 13, 2020 •

edited

Loading