Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bam support for counting intervals #40

Merged
merged 96 commits into from
Dec 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
96 commits
Select commit Hold shift + click to select a range
b4c2115
first working pass at using noodles to read bam and report counts
donaldcampbelljr Oct 24, 2024
ef19243
add attempt at determining Map length (commented out)
donaldcampbelljr Oct 24, 2024
c7cb01b
add bedGraph count compression
donaldcampbelljr Oct 25, 2024
80bf9c5
finish adding bedGraph count compression
donaldcampbelljr Oct 25, 2024
65e1a62
fix mutability issue for bam reading into chrom vec
donaldcampbelljr Oct 25, 2024
a0dfb75
change sorted to start instead of all for BedGraphToBigWigArgs
donaldcampbelljr Oct 25, 2024
2410c30
Merge pull request #41 from databio/dev_bedGraph_compress_counts
donaldcampbelljr Oct 25, 2024
32f3072
Add bam and bai examples, cargo fmt
donaldcampbelljr Oct 28, 2024
b0dc3a4
begin refactoring to have bam processing be its own workflow
donaldcampbelljr Oct 28, 2024
de8be6d
refactor get_final_chromosomes to uniwig utils
donaldcampbelljr Oct 28, 2024
034fe51
add future instructions
donaldcampbelljr Oct 28, 2024
a944da1
add narrowpeak test
donaldcampbelljr Oct 29, 2024
fb5dccc
Merge remote-tracking branch 'origin/dev_bam_bedgraph_bigtools' into …
donaldcampbelljr Oct 29, 2024
36c2abb
change test back to using temp dir
donaldcampbelljr Oct 29, 2024
2fe8c81
add parallel reading of header from bam file
donaldcampbelljr Oct 29, 2024
ac1acc1
add out selection enum, begin adding match statements before counting
donaldcampbelljr Oct 29, 2024
a67409e
refactor to pass iterator to couting func for bam counting
donaldcampbelljr Oct 30, 2024
eb8f057
set up file for specific file output types immediately upon beginning…
donaldcampbelljr Oct 30, 2024
08d6922
some refactor to use new header/reader for each match of start end core
donaldcampbelljr Oct 31, 2024
64cc3fc
return a boxed file OR std_out
donaldcampbelljr Oct 31, 2024
5102119
cargo fmt
donaldcampbelljr Oct 31, 2024
7bf1df2
attempt to set up bedgraph streaming via stdin to bigtools (does not …
donaldcampbelljr Oct 31, 2024
1da7d0d
attempt using outb: BigWigWrite<File>
donaldcampbelljr Oct 31, 2024
158912d
Revert "attempt using outb: BigWigWrite<File>"
donaldcampbelljr Oct 31, 2024
ea4b141
rethink approach, some errors still need to be resolved
donaldcampbelljr Oct 31, 2024
ab4af2b
more work towards fixed_start_end_counts_bam_to_bw
donaldcampbelljr Oct 31, 2024
cc87462
fix compile errors
donaldcampbelljr Oct 31, 2024
7debb2b
cargo fmt
donaldcampbelljr Oct 31, 2024
2ecf009
cursor sort of works
donaldcampbelljr Nov 1, 2024
b32fd47
better, has erroneous overlaps
donaldcampbelljr Nov 1, 2024
1bab14a
fix zoom field missing and causing compilation error
donaldcampbelljr Nov 4, 2024
903c8fe
fix count issue, change to coordinate_position reporting
donaldcampbelljr Nov 4, 2024
bca6d50
add better error handling for processing bam records
donaldcampbelljr Nov 4, 2024
f48ee2b
attempt to write byte slices to cursor
donaldcampbelljr Nov 4, 2024
6b2377d
fix start path generation
donaldcampbelljr Nov 4, 2024
eb9ab0a
Revert "attempt to write byte slices to cursor"
donaldcampbelljr Nov 4, 2024
09c1583
debug lines for troubleshooting
donaldcampbelljr Nov 4, 2024
901826d
attempt to use new zoom attribute for struct, does not work
donaldcampbelljr Nov 4, 2024
356cf8b
fix missing zoom attribute
donaldcampbelljr Nov 4, 2024
a38737c
fix narrowpeak accumulation counting error
donaldcampbelljr Nov 4, 2024
c2db7b3
delete commented code from older implementation
donaldcampbelljr Nov 11, 2024
ecc8008
add new func create_bw_writer, add ends arm for bam to bw
donaldcampbelljr Nov 11, 2024
a2f7146
add func fixed_core_counts_bam_to_bw
donaldcampbelljr Nov 11, 2024
92d2767
remove debugging items
donaldcampbelljr Nov 11, 2024
24c4373
begin attempt to merge bw, some fields are private NOT public
donaldcampbelljr Nov 11, 2024
44c8926
more work towards merge, private fields need to be public
donaldcampbelljr Nov 12, 2024
e6096fa
all final bw merge using custom fork of bigtools
donaldcampbelljr Nov 12, 2024
64edd0c
attempt some refactor for pipes. Does not work due "does not live lon…
donaldcampbelljr Nov 14, 2024
7145f99
one error remaining, for file_path
donaldcampbelljr Nov 14, 2024
e88b89d
Finally compiles.
donaldcampbelljr Nov 14, 2024
a809a49
more refactoring into consumer thread, receive runtime error "broken …
donaldcampbelljr Nov 14, 2024
13622c2
better error handling but now it hangs
donaldcampbelljr Nov 15, 2024
55a7c39
this works for a bam and chromsizes file for 1 chrom
donaldcampbelljr Nov 15, 2024
d4a6387
More error handling, and removing incomplete files
donaldcampbelljr Nov 15, 2024
193079e
comment out bw merge for now
donaldcampbelljr Nov 15, 2024
073f1c7
limit number of error messages for easier debugging
donaldcampbelljr Nov 15, 2024
205b18c
move some error handling to a pre-processing step BEFORE spawning thr…
donaldcampbelljr Nov 15, 2024
a93ca42
attempt to drop writer if error, still causes hanging
donaldcampbelljr Nov 15, 2024
fe4c0cf
!this actually works! Survives no first record.
donaldcampbelljr Nov 19, 2024
5d1e58a
some clean up
donaldcampbelljr Nov 19, 2024
969e8b6
add END logic for fixed counting
donaldcampbelljr Nov 19, 2024
95c2697
add CORE logic for fixed counting
donaldcampbelljr Nov 19, 2024
3c77b7d
change default zoom to be 5, re-add merging bw files
donaldcampbelljr Nov 19, 2024
6820efc
begin refactor into wrapper functions to make code more readable
donaldcampbelljr Nov 19, 2024
fb08a66
more refactor into wrapper functions to make code more readable
donaldcampbelljr Nov 19, 2024
cbc0921
cargo fmt
donaldcampbelljr Nov 19, 2024
7be1267
begin adding variable_start_end_counts_bam_to_bw
donaldcampbelljr Nov 20, 2024
9260da2
some adjustments variable_start_end_counts_bam_to_bw
donaldcampbelljr Nov 20, 2024
27feeb6
fix logic for variable_start_end_counts_bam_to_bw to keep track of p…
donaldcampbelljr Nov 21, 2024
db05f30
add variable_core_counts_bam_to_bw
donaldcampbelljr Nov 21, 2024
3d16aee
edit error messages to aid in debug
donaldcampbelljr Nov 21, 2024
cd51417
add checking for first record to earlier in process of bam, rethink s…
donaldcampbelljr Nov 21, 2024
4763a60
add debug argument for more verbose messaging for non-existent chroms…
donaldcampbelljr Nov 21, 2024
c41dbfa
add cleaning up bigwig files AFTER merge
donaldcampbelljr Nov 21, 2024
d8de2d1
add output_bam_counts_non_bw
donaldcampbelljr Nov 21, 2024
8679df6
remove zoom args for now
donaldcampbelljr Nov 22, 2024
a75c766
change interval to use prev_count
donaldcampbelljr Nov 25, 2024
12610f3
begin refactor to support bed file as bam output
donaldcampbelljr Nov 25, 2024
e22794c
add bam_to_bed_no_counts and process_bed_in_threads
donaldcampbelljr Dec 2, 2024
e0d6f5c
add combining final bed files
donaldcampbelljr Dec 2, 2024
6c2a538
fix tests by adding debug bool
donaldcampbelljr Dec 2, 2024
a9ee5a5
add assessing flags via bit operations
donaldcampbelljr Dec 2, 2024
2b893de
some clean up
donaldcampbelljr Dec 2, 2024
6914ad7
more clean up
donaldcampbelljr Dec 2, 2024
14c6c3d
more more clean up
donaldcampbelljr Dec 2, 2024
02aaf86
more more more clean up
donaldcampbelljr Dec 2, 2024
2c6763d
last warnings cleaned
donaldcampbelljr Dec 2, 2024
2fe6a09
cargo fmt
donaldcampbelljr Dec 2, 2024
9227974
add counttype argument. Only works for bam processing
donaldcampbelljr Dec 2, 2024
ad0e96f
fix tests
donaldcampbelljr Dec 2, 2024
cc21a18
format
donaldcampbelljr Dec 2, 2024
86352d7
update docs
donaldcampbelljr Dec 3, 2024
211e2d3
update readme
donaldcampbelljr Dec 3, 2024
aa719c5
Merge pull request #47 from databio/dev_bam_rewrite
donaldcampbelljr Dec 3, 2024
534345e
use new release of bigtools
donaldcampbelljr Dec 3, 2024
7b28f0f
Merge branch 'dev' into dev_bam_bedgraph_bigtools
donaldcampbelljr Dec 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions gtars/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,15 @@ ndarray-npy = "0.8.1"
ndarray = "0.15.6"
tempfile = "3.10.1"
byteorder = "1.5.0"
noodles = { version = "0.83.0", features = ["bam"] }
noodles = { version = "0.83.0", features = ["bam", "sam", "bgzf"] }
bstr = "1.10.0"
rayon = "1.10.0"
indicatif = "0.17.8"
bigtools = "0.5.4"
tokio = "1.40.0"
os_pipe = "1.2.1"
glob = "0.3.1"
bigtools = "0.5.2"



[dev-dependencies]
Expand Down
35 changes: 23 additions & 12 deletions gtars/src/uniwig/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@

### Input Bed File

Currently, Uniwig accepts a single `.bed` file. It should be sorted by chromosome. This single bed file will be used to create 3 wiggle files (`.wig`):
`_start.wig` -> accumulations of start coordinates
`_end.wig` -> accumulations of end coordinates
`_core.wig` -> accumulations of peaks (between starts and ends)
Currently, Uniwig accepts a single `.bed` file, `.narrowPeak` file, `.bam` file. It should be sorted by chromosome. This single file will be used to create 3 output files:
`_start` -> accumulations of start coordinates
`_end` -> accumulations of end coordinates
`_core` -> accumulations of peaks (between starts and ends)

The below script can be used to create a sorted bed file from a directory of bed files:

Expand Down Expand Up @@ -39,30 +39,41 @@ The chrom.sizes reference is an optional argument. Uniwig will default to using

### Usage
```
Create wiggle files from a BED or BAM file
Create accumulation files from a BED or BAM file

Usage: gtars uniwig [OPTIONS] --file <file> --smoothsize <smoothsize> --stepsize <stepsize> --fileheader <fileheader> --outputtype <outputtype>

Options:
-f, --file <file> Path to the combined bed file we want to transform or a sorted bam file
-t, --filetype <filetype> 'bed' or 'bam' [default: bed]
-t, --filetype <filetype> Input file type, 'bed' 'bam' or 'narrowpeak' [default: bed]
-c, --chromref <chromref> Path to chromreference
-m, --smoothsize <smoothsize> Integer value for smoothing
-s, --stepsize <stepsize> Integer value for stepsize
-l, --fileheader <fileheader> Name of the file
-y, --outputtype <outputtype> Output as wiggle or npy
-h, --help
-u, --counttype <counttype> Select to only output start, end, or core. Defaults to all. [default: all]
-p, --threads <threads> Number of rayon threads to use for parallel processing [default: 6]
-o, --score Count via score (narrowPeak only!)
-z, --zoom <zoom> Number of zoom levels (for bw file output only [default: 5]
-d, --debug Print more verbose debug messages?
-h, --help Print help

```

### Create bigwig files from wiggle files
```

Once you have created wiggle files, you can convert them to bigWig files using `wigToBigWig` (see: https://genome.ucsc.edu/goldenPath/help/bigWig.html, https://github.com/ucscGenomeBrowser/kent/tree/master/src/utils/wigToBigWig):
### Processing bam files to bw

Example command
```
./wigToBigWig ./test_rust_wig/_end.wig ./sourcefiles/hg38.chrom.sizes ./end_rust.bw
gtars uniwig -f "test1_chr1_chr2.bam" -m 5 -s 1 -l /myoutput/directory/test_file_name -y bw -t bam -p 6 -c /genome/alias/hg38/fasta/default/hg38.chrom.sizes -u all

```


### Export types

Currently only `.wig` and `.npy` are supported as output types.
For Input types: `.bed` and `.narrowPeak`
Output types include `.wig`, `.npy`, `.bedGraph`, and `.bw`

For Input Types: `.bam`
Output types include `.bw` and `.bed`
19 changes: 17 additions & 2 deletions gtars/src/uniwig/cli.rs
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ use crate::uniwig::consts::UNIWIG_CMD;
pub fn create_uniwig_cli() -> Command {
Command::new(UNIWIG_CMD)
.author("DRC")
.about("Create wiggle files from a BED or BAM file")
.about("Create accumulation files from a BED or BAM file")
.arg(
Arg::new("file")
.long("file")
Expand Down Expand Up @@ -61,6 +61,14 @@ pub fn create_uniwig_cli() -> Command {
.help("Output as wiggle or npy")
.required(true),
)
.arg(
Arg::new("counttype")
.long("counttype")
.short('u')
.default_value("all")
.help("Select to only output start, end, or core. Defaults to all.")
.required(false),
)
.arg(
Arg::new("threads")
.long("threads")
Expand All @@ -81,9 +89,16 @@ pub fn create_uniwig_cli() -> Command {
Arg::new("zoom")
.long("zoom")
.short('z')
.default_value("0")
.default_value("5")
.value_parser(clap::value_parser!(i32))
.help("Number of zoom levels (for bw file output only")
.required(false),
)
.arg(
Arg::new("debug")
.long("debug")
.short('d')
.help("Print more verbose debug messages?")
.action(ArgAction::SetTrue),
)
}
Loading