Skip to content

Commit 35c1e25

Browse files
authored
Merge pull request #156 from Joon-Klaps/shannon-entropy
Add column in custom mpileup - Shannon entropy
2 parents d7f8178 + 9dd62c1 commit 35c1e25

8 files changed

+95
-44
lines changed

CHANGELOG.md

+1
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ Initial release of Joon-Klaps/viralgenie, created with the [nf-core](https://nf-
2222
- Update docs ([#150](https://github.com/Joon-Klaps/viralgenie/pull/150))
2323
- Make custom-mpileup.py postion 1 index based and not 0 index to follow bcftools ([#153](https://github.com/Joon-Klaps/viralgenie/pull/153))
2424
- Update docs for more streamlined docs & figures ([#154](https://github.com/Joon-Klaps/viralgenie/pull/154))
25+
- Add column in custom mpileup - Shannon entropy ([#156](https://github.com/Joon-Klaps/viralgenie/pull/156))
2526

2627
### `Fixed`
2728

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@
3434
- Read UMI deduplication ([`HUMID`](https://humid.readthedocs.io/en/latest/usage.html))
3535
- Low complexity and quality filtering ([`bbduk`](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/), [`prinseq++`](https://github.com/Adrian-Cantu/PRINSEQ-plus-plus))
3636
- Host-read removal ([`BowTie2`](http://bowtie-bio.sourceforge.net/bowtie2/))
37-
3. Metagenomic diveristy mapping
37+
3. Metagenomic diversity mapping
3838
- Performs taxonomic classification and/or profiling using one or more of:
3939
- [`Kraken2`](https://ccb.jhu.edu/software/kraken2/)
4040
- [`Bracken`](https://ccb.jhu.edu/software/bracken/)(optional)

bin/custom_mpileup.py

+81-36
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ def parse_args(argv=None):
2626
parser.add_argument("--alignment", type=Path, help="Input BAM file prefix")
2727
parser.add_argument("--reference", type=Path, help="Reference FASTA file")
2828
parser.add_argument("--prefix", type=str, help="Name of the output file")
29+
parser.add_argument("--k", type=int, help="Pseudocount to add to the total for shannon entropy calculation", default=50)
2930
parser.add_argument(
3031
"-l",
3132
"--log-level",
@@ -36,51 +37,95 @@ def parse_args(argv=None):
3637
return parser.parse_args(argv)
3738

3839

39-
def process_mpileup(filename: Path, reference: Path) -> NDArray:
40+
def process_mpileup(filename: Path, reference: Path, k: int) -> NDArray:
4041
"""
4142
Process mpileup data using numpy vectorized operations.
42-
43-
Args:
44-
filename: Path to the alignment file (BAM/CRAM/SAM)
45-
reference: Path to the reference FASTA file
46-
47-
Returns:
48-
NDArray: Array with columns [position, A, C, G, T, insertions, deletions, consensus]
4943
"""
50-
# Initialize FASTA file properly
5144
fasta = pysam.FastaFile(str(reference))
52-
5345
alignment_file = pysam.AlignmentFile(filename, "rc" if filename.suffix == ".cram" else "rb", reference_filename=str(reference))
5446

5547
# Convert generator to structured numpy array
56-
data = np.array(
57-
[
58-
(r["pos"]+1, r["ref"], r["A"], r["C"], r["G"], r["T"], r["insertions"], r["deletions"], "N")
59-
for r in pysamstats.stat_variation(alignment_file, fafile=fasta)
60-
],
61-
dtype=[
62-
("pos", int),
63-
("ref", "U1"),
64-
("A", int),
65-
("C", int),
66-
("G", int),
67-
("T", int),
68-
("ins", int),
69-
("del", int),
70-
("consensus", "U1"),
71-
],
48+
stats = list(pysamstats.stat_variation(alignment_file, fafile=fasta))
49+
n_rows = len(stats)
50+
51+
# Create structured array in one go
52+
data = np.zeros(n_rows, dtype=[
53+
("pos", int), ("ref", "U1"), ("A", int), ("C", int),
54+
("G", int), ("T", int), ("ins", int), ("del", int),
55+
("consensus", "U1"), ("entropy", float), ("weighted_entropy", float)
56+
])
57+
58+
# Fill arrays using vectorized operations
59+
data["pos"] = np.array([r["pos"] + 1 for r in stats])
60+
data["ref"] = np.array([r["ref"] for r in stats])
61+
for base in "ACGT":
62+
data[base] = np.array([r[base] for r in stats])
63+
data["ins"] = np.array([r["insertions"] for r in stats])
64+
data["del"] = np.array([r["deletions"] for r in stats])
65+
data["consensus"] = "N"
66+
67+
# Create nucleotide matrix for vectorized operations
68+
nucleotides = np.stack([data[base] for base in "ACGT"], axis=1)
69+
total_coverage = np.sum(nucleotides, axis=1)
70+
71+
# Vectorized consensus calculation
72+
max_counts = np.max(nucleotides, axis=1)
73+
mask = np.divide(max_counts, total_coverage, where=total_coverage > 0) >= 0.7
74+
data["consensus"][mask] = np.array(["A", "C", "G", "T"])[np.argmax(nucleotides[mask], axis=1)]
75+
76+
# Calculate shannon entropy
77+
data["entropy"] = shannon_entropy(nucleotides, total_coverage)
78+
data["weighted_entropy"] = weighted_entropy(data["entropy"], total_coverage, k)
79+
80+
return data
81+
82+
def shannon_entropy(nucleotides: NDArray, total_coverage: NDArray) -> NDArray:
83+
"""
84+
Calculate the Shannon entropy of the nucleotide distribution
85+
"""
86+
# Define epsilon for numerical stability
87+
eps = 1e-10
88+
89+
# Calculate the frequency of each nucleotide
90+
# Add epsilon to avoid division by zero and set a minimum threshold
91+
frequencies = np.divide(
92+
nucleotides,
93+
total_coverage[:, np.newaxis],
94+
where=total_coverage[:, np.newaxis] > eps
7295
)
7396

74-
# Extract nucleotide counts for consensus calculation
75-
nucleotides = np.vstack([data[base] for base in "ACGT"]).T
76-
total_coverage = nucleotides.sum(axis=1)
77-
max_counts = nucleotides.max(axis=1)
97+
# Set very small frequencies to zero to avoid floating point errors
98+
frequencies = np.where(frequencies < eps, 0.0, frequencies)
7899

79-
# Update consensus column where conditions are met
80-
mask = np.divide(max_counts, total_coverage, where=total_coverage > 0) >= 0.7
81-
data["consensus"][mask] = np.array(["A", "C", "G", "T"])[nucleotides[mask].argmax(axis=1)]
100+
# Normalize frequencies to ensure they sum to 1
101+
row_sums = np.sum(frequencies, axis=1, keepdims=True)
102+
frequencies = np.divide(frequencies, row_sums, where=row_sums > eps)
82103

83-
return data
104+
# Calculate the Shannon entropy only for non-zero frequencies
105+
with np.errstate(divide='ignore', invalid='ignore'):
106+
entropy = -np.sum(
107+
np.where(
108+
frequencies > eps,
109+
frequencies * np.log2(frequencies),
110+
0.0
111+
),
112+
axis=1
113+
)
114+
115+
# Replace NaN values with 0.0
116+
entropy = np.nan_to_num(entropy)
117+
118+
# Round to 3 decimal places and ensure -0.0 is converted to 0.0
119+
entropy = np.where(entropy == -0.0, 0.0, np.round(entropy, 3))
120+
121+
return entropy
122+
123+
def weighted_entropy(entropy: NDArray, total_coverage: NDArray, k: int) -> NDArray:
124+
"""
125+
Correct the Shannon entropy by multiplying it with N/(N+k)
126+
"""
127+
correction = total_coverage / (total_coverage + k)
128+
return np.round(entropy * correction, 3)
84129

85130

86131
def write_csv(matrix: NDArray, prefix: str) -> None:
@@ -91,7 +136,7 @@ def write_csv(matrix: NDArray, prefix: str) -> None:
91136
matrix: NumPy array containing the mpileup results
92137
output: Path to the output file
93138
"""
94-
header = ["Position", "Reference", "A", "C", "G", "T", "Insertions", "Deletions", "Consensus"]
139+
header = ["Position", "Reference", "A", "C", "G", "T", "Insertions", "Deletions", "Consensus", "Entropy", "Weighted Entropy"]
95140
with open(f"{prefix}.tsv", "w", newline="", encoding="utf-8") as file:
96141
writer = csv.writer(file, delimiter="\t")
97142
writer.writerow(header)
@@ -101,7 +146,7 @@ def write_csv(matrix: NDArray, prefix: str) -> None:
101146
def main():
102147
args = parse_args()
103148
logger.info("Starting mpileup processing")
104-
matrix = process_mpileup(args.alignment, args.reference)
149+
matrix = process_mpileup(args.alignment, args.reference, args.k)
105150
write_csv(matrix, args.prefix)
106151
logger.info("Mpileup processing completed")
107152

docs/output.md

+7-2
Original file line numberDiff line numberDiff line change
@@ -577,7 +577,12 @@ __Summary statistics__:
577577

578578
#### Custom - mpileup like file
579579

580-
To facilitate the intra host analysis, a mpileup like file is generated. This file contains the depth of every nucleotide at each position of the reference.
580+
To facilitate the intra host analysis, a mpileup like file is generated. This file contains the depth of every nucleotide at each position of the reference as well as the shannon entropy and a weighted shannon entropy based on the following formulae.
581+
582+
- Shannon entropy: $$ H = -\sum_{i=1}^4 p_i \ln p_i $$
583+
- Weighted Shannon entropy: $$ w(H) = \frac{N}{N+k} \cdot H $$
584+
585+
Where $N$ is the total bases at a position, $k$ is the pseudocount (default 50), and $p_i$ is the frequency of the nucleotide $i$.
581586

582587
???- abstract "Output files - variants"
583588

@@ -656,7 +661,7 @@ Variant files are visualized in the MultiQC report.
656661

657662
The consensus sequences are generated by [`BCFTools`](http://samtools.github.io/bcftools/bcftools.html) or [`iVar`](https://andersen-lab.github.io/ivar/html/manualpage.html). The consensus sequences are stored in the directory `consensus/` or in the iterations directory `assembly/polishing/iterations/it#/consensus`.
658663

659-
`BCFtools` will use the filtered variants file whereas, `iVar` will redetermine the variants to collapse in the consensus using their own workflow, read more about their differences in the [consensus calling section](./workflow/variant_and_refinement.md#consensus-calling).
664+
`BCFtools` will use the filtered variants file whereas, `iVar` will redetermine the variants to collapse in the consensus using their own workflow, read more about their differences in the [consensus calling section](./workflow/variant_and_refinement.md#4-consensus-calling).
660665

661666
???- abstract "Output files - iterations & variants"
662667

docs/parameters.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ Options related to the trimming, low complexity and host removal steps of the re
5151
| `host_k2_library` | Kraken2 library(s) required to remove host and contamination <details><summary>Help</summary><small>Only used when no host kraken2 database is specified.</small></details>| human |
5252
| `skip_host_fastqc` | Skip the fastqc step after host & contaminants were removed | |
5353

54-
## Metagenomic diveristy
54+
## Metagenomic diversity
5555

5656
Parameters used to determine the metagenomic diversity of the sample
5757

docs/usage.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,7 @@ An example samplesheet file consisting of both single- and paired-end data may l
9090
Viralgenie can in addition to constructing de novo consensus genomes map the sample reads to a series of references. These references are provided through the parameter `--mapping_constraints`.
9191

9292
An example mapping constraint samplesheet file consisting of 5 references, may look something like the one below.
93-
> This is for 5 references, 2 of them being a multi-fasta file, only one of the multi-fasta needs to undergo [reference selection](./workflow/variant_and_refinement.md#selection-of-reference).
93+
> This is for 5 references, 2 of them being a multi-fasta file, only one of the multi-fasta needs to undergo [reference selection](./workflow/variant_and_refinement.md#1a-selection-of-reference).
9494
9595

9696
=== "TSV"

docs/workflow/metagenomic_diversity.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ Viralgenie offers two main tools for the classification of reads and a summary v
1313

1414
Feel free to reach out and suggest more classifiers. However, if the main goal of your project is to establish the presence of a virus within a sample and are therefore only focused on metagenomic diversity, have a look at [taxprofiler](https://nf-co.re/taxprofiler/)
1515

16-
> The read classification can be skipped with the argument `--skip_read_classification`, classifiers should be specified with the parameter `--read_classifiers 'kaiju,kraken2'` (no spaces, no caps). See the [parameters classification section](../parameters.md#read-classification) for all relevant arguments to control the classification steps.
16+
> The read classification can be skipped with the argument `--skip_read_classification`, classifiers should be specified with the parameter `--read_classifiers 'kaiju,kraken2'` (no spaces, no caps). See the [parameters classification section](../parameters.md#metagenomic-diversity) for all relevant arguments to control the classification steps.
1717
1818
## Kaiju
1919

nextflow_schema.json

+2-2
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,7 @@
203203
},
204204
"fa_icon": "fas fa-bahai"
205205
},
206-
"metagenomic_diveristy": {
206+
"metagenomic_diversity": {
207207
"title": "Metagenomic diversity",
208208
"type": "object",
209209
"description": "Parameters used to determine the metagenomic diversity of the sample",
@@ -903,7 +903,7 @@
903903
"$ref": "#/$defs/preprocessing_options"
904904
},
905905
{
906-
"$ref": "#/$defs/metagenomic_diveristy"
906+
"$ref": "#/$defs/metagenomic_diversity"
907907
},
908908
{
909909
"$ref": "#/$defs/assembly"

0 commit comments

Comments
 (0)