Skip to content

Commit

Permalink
microbial_fraction: Improve docs, detail multi-copy plasmid issue.
Browse files Browse the repository at this point in the history
  • Loading branch information
wwood committed Aug 28, 2024
1 parent 55a9f22 commit badfe68
Show file tree
Hide file tree
Showing 3 changed files with 9 additions and 5 deletions.
6 changes: 4 additions & 2 deletions docs/preludes/microbial_fraction_prelude.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
The SingleM `microbial_fraction` ('SMF') subcommand estimates the fraction of reads in a metagenome that are microbial, compared to everything else e.g. eukaryote- or phage-derived. Here we define 'microbial' as either bacterial or archaeal, including their plasmids.

SingleM `microbial_fraction` also estimates the average genome size of microbial cells in the sample.
SingleM `microbial_fraction` also estimates the average genome size (AGS) of microbial cells in the sample.

The main conceptual advantage of this method over other tools is that it does not require reference sequences of the non-microbial genomes that may be present (e.g. those of an animal host). Instead, it uses a SingleM taxonomic profile of the metagenome to "add up" the components of the community which are microbial. The remaining components are non-microbial e.g. host, diet, or phage.

Roughly, the number of microbial bases is estimated by summing the genome sizes of each species/taxon after multiplying by their coverage in the taxonomic profile. The microbial fraction is then calculated as the ratio of microbial bases to the total metagenome size.
Roughly, the number of microbial bases is estimated by summing the genome sizes of each species/taxon after multiplying by their coverage in the taxonomic profile generated by SingleM [pipe](/tools/pipe). The microbial fraction is then calculated as the ratio of microbial bases to the total metagenome size.

One current limitation of the approach relates to multi-copy plasmids. In `microbial_fraction`, the genome size of each microbial species is estimated as the sum of the chromosome and plasmid sizes, since these are the sequences available for each genome. However, in a metagenome, a species' plasmid may occur in multiple copies per cell (e.g. if the plasmid is 'high copy number'). SingleM `microbial_fraction` does not account for plasmid copy number, leading to an underestimation of the microbial fraction when plasmids are multi-copy. However, we consider this to be a minor issue, since plasmids are typically small compared to chromosomes. The average genome size estimate is unaffected by this limitation since by definition each plasmid counts only once regardless of its copy number.

To run `microbial_fraction`, first run `pipe` on your metagenome.

Expand Down
6 changes: 4 additions & 2 deletions docs/tools/microbial_fraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,13 @@ title: SingleM microbial_fraction
# singlem microbial_fraction
The SingleM `microbial_fraction` ('SMF') subcommand estimates the fraction of reads in a metagenome that are microbial, compared to everything else e.g. eukaryote- or phage-derived. Here we define 'microbial' as either bacterial or archaeal, including their plasmids.

SingleM `microbial_fraction` also estimates the average genome size of microbial cells in the sample.
SingleM `microbial_fraction` also estimates the average genome size (AGS) of microbial cells in the sample.

The main conceptual advantage of this method over other tools is that it does not require reference sequences of the non-microbial genomes that may be present (e.g. those of an animal host). Instead, it uses a SingleM taxonomic profile of the metagenome to "add up" the components of the community which are microbial. The remaining components are non-microbial e.g. host, diet, or phage.

Roughly, the number of microbial bases is estimated by summing the genome sizes of each species/taxon after multiplying by their coverage in the taxonomic profile. The microbial fraction is then calculated as the ratio of microbial bases to the total metagenome size.
Roughly, the number of microbial bases is estimated by summing the genome sizes of each species/taxon after multiplying by their coverage in the taxonomic profile generated by SingleM [pipe](/tools/pipe). The microbial fraction is then calculated as the ratio of microbial bases to the total metagenome size.

One current limitation of the approach relates to multi-copy plasmids. In `microbial_fraction`, the genome size of each microbial species is estimated as the sum of the chromosome and plasmid sizes, since these are the sequences available for each genome. However, in a metagenome, a species' plasmid may occur in multiple copies per cell (e.g. if the plasmid is 'high copy number'). SingleM `microbial_fraction` does not account for plasmid copy number, leading to an underestimation of the microbial fraction when plasmids are multi-copy. However, we consider this to be a minor issue, since plasmids are typically small compared to chromosomes. The average genome size estimate is unaffected by this limitation since by definition each plasmid counts only once regardless of its copy number.

To run `microbial_fraction`, first run `pipe` on your metagenome.

Expand Down
2 changes: 1 addition & 1 deletion singlem/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -397,7 +397,7 @@ def add_less_common_pipe_arguments(argument_group):
summarise_output_args.add_argument('--unaligned-sequences-dump-file',
help="Output unaligned sequences from in put archive OTU table to this file. After each read name '~N' is added which corresponds to the order of the read in the archive OTU table, so that no two sequences have the same read name. N>1 can happen e.g. when the input file contains paired reads. ~0 does not necessarily correspond to the first read in the original input sequence set, but instead to the order in the input archive OTU table.")

read_fraction_description = 'Estimate the fraction of reads from a metagenome that are assigned to Bacteria and Archaea compared to e.g. eukaryote or phage.'
read_fraction_description = 'Estimate the fraction of reads from a metagenome that are assigned to Bacteria and Archaea compared to e.g. eukaryote or phage. Also estimate average genome size.'
read_fraction_parser = bird_argparser.new_subparser('microbial_fraction', read_fraction_description, parser_group='Tools')
read_fraction_io_args = read_fraction_parser.add_argument_group('input')
read_fraction_io_args.add_argument('-p', '--input-profile', help="Input taxonomic profile file [required]", required=True)
Expand Down

0 comments on commit badfe68

Please sign in to comment.