Allow commented lines in fragment files. #83

ghuls · 2025-02-06T10:54:47Z

Allow commented lines in fragment files as CellRanger-ATAC/CellRangerARC puts some commented lines at the start of the file.

0.2.0 Release - Gibson Les Paul

This reverts commit 800a47a.

Allow commented lines in fragment files as CellRanger-ATAC/CellRangerARC puts some commented lines at the start of the file.

nleroy917 · 2025-02-06T13:12:03Z

Hey @ghuls thanks for the PR, I changed the base to dev since we try to avoid merging right into master. It helps with release of other things like python bindings.

A follow up question... are you using the fragment scoring code, or did you just stumble on the code and saw you could add this line?

ghuls · 2025-02-06T13:30:43Z

I was trying out the code but it crashed on my fragments file (failed to parse integer).

We have code that does something similar in pycisTopic: https://github.com/aertslab/pycisTopic/blob/polars_1xx/src/pycisTopic/fragments.py#L1132-L1337 , but that would make the counts per cell barcode (and returns a sparse matrix instead).
I wrote my own wrapper around https://github.com/pyranges/ncls (intersection code behind PyRanges) backed by Polars for doing the intersection work (which is much faster than PyRanges).

nleroy917 · 2025-02-06T13:53:02Z

Gotcha!

That's really interesting. This is similar, I think, but it does counts by pseudobulk not by cell barcodes (unless you set each barcode is its own pseudobulk 😀). I'm interested in identifying cell-type specific peaks, so that's why I was doing this by pseudobulk, but it truly was just a stopgap for me to move forward in my analysis and get the output I needed.

This implementation I wrote uses binary interval search (BITS). I'm using the rust-lapper crate for that. It's very fast, and it can be even faster if you parallelize it intelligently.

The pycisTopic implementation you link probably would have suited my needs preventing me from writing this -- just need to groupby psuedobulk at the end and aggregate.

Anyways... if you see value here or potential improvements, let me know! Thank's for the PR!

ghuls · 2025-02-10T08:52:18Z

Ah, I was not sure what gtars was using for intersecting regions (if it was using its own intersecting implementation or another rust crate.

In pycisTopic we also use pseudobulk, but only after having done topic modeling (to alleviate the problem of clustering cells as consensus region counts per cell are quite sparse in sc/snATAC data).

nleroy917 · 2025-02-10T15:31:04Z

We are actually in the process of moving away from BITS/rust-lapper to AIList which is our own implementation. It was originally in C, and we had to port it to Rust though.

In pycisTopic we also use pseudobulk, but only after having done topic modeling (to alleviate the problem of clustering cells as consensus region counts per cell are quite sparse in sc/snATAC data).

Interestingly, I do something super similar. It's by pseudobulk, but only after processing with SnapATAC2. Looking through the pyCisTopic docs, it seems like its assumed that the pseudobulks are already known? SnapATAC2 is really just to get the initial clustering/pseudobulks which I then call peaks on separately, then merge to get the final consensus set.

ghuls · 2025-02-11T09:18:13Z

We are actually in the process of moving away from BITS/rust-lapper to AIList which is our own implementation. It was originally in C, and we had to port it to Rust though.

How does it compare with rust-lapper in speed?

In pycisTopic we also use pseudobulk, but only after having done topic modeling (to alleviate the problem of clustering cells as consensus region counts per cell are quite sparse in sc/snATAC data).

Interestingly, I do something super similar. It's by pseudobulk, but only after processing with SnapATAC2. Looking through the pyCisTopic docs, it seems like its assumed that the pseudobulks are already known? SnapATAC2 is really just to get the initial clustering/pseudobulks which I then call peaks on separately, then merge to get the final consensus set.

The pseudobulks are not known in advance. First we create a binary matrix for all cell barcodes over an initial set of consensus regions, then this binary matrix is used in topic modeling which output will be used to cluster the cells (works better than clustering on just sparse binary/count matrix. From this clustering you can create your pseudobulks and you can refine your consensus regions by combining consensus regions made per pseudobulk (assuming this cell types are only a small percentage of all your cells and those regions would be missed when you just take full bulk consensus regions).

nleroy917 and others added 4 commits January 13, 2025 13:43

Merge pull request databio#51 from databio/dev

bde7a3d

0.2.0 Release - Gibson Les Paul

add npy to wig

800a47a

Revert "add npy to wig"

5ce8434

This reverts commit 800a47a.

Allow commented lines in fragment files.

3698cb7

Allow commented lines in fragment files as CellRanger-ATAC/CellRangerARC puts some commented lines at the start of the file.

nleroy917 changed the base branch from master to dev February 6, 2025 13:04

nleroy917 merged commit 9d2ed66 into databio:dev Feb 10, 2025
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow commented lines in fragment files. #83

Allow commented lines in fragment files. #83

ghuls commented Feb 6, 2025

nleroy917 commented Feb 6, 2025

ghuls commented Feb 6, 2025

nleroy917 commented Feb 6, 2025

ghuls commented Feb 10, 2025

nleroy917 commented Feb 10, 2025

ghuls commented Feb 11, 2025

Allow commented lines in fragment files. #83

Allow commented lines in fragment files. #83

Conversation

ghuls commented Feb 6, 2025

nleroy917 commented Feb 6, 2025

ghuls commented Feb 6, 2025

nleroy917 commented Feb 6, 2025

ghuls commented Feb 10, 2025

nleroy917 commented Feb 10, 2025

ghuls commented Feb 11, 2025