Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow commented lines in fragment files. #83

Merged
merged 4 commits into from
Feb 10, 2025

Conversation

ghuls
Copy link
Contributor

@ghuls ghuls commented Feb 6, 2025

Allow commented lines in fragment files as CellRanger-ATAC/CellRangerARC puts some commented lines at the start of the file.

nleroy917 and others added 4 commits January 13, 2025 13:43
0.2.0 Release - Gibson Les Paul
This reverts commit 800a47a.
Allow commented lines in fragment files as CellRanger-ATAC/CellRangerARC
puts some commented lines at the start of the file.
@nleroy917 nleroy917 changed the base branch from master to dev February 6, 2025 13:04
@nleroy917
Copy link
Member

Hey @ghuls thanks for the PR, I changed the base to dev since we try to avoid merging right into master. It helps with release of other things like python bindings.

A follow up question... are you using the fragment scoring code, or did you just stumble on the code and saw you could add this line?

@ghuls
Copy link
Contributor Author

ghuls commented Feb 6, 2025

I was trying out the code but it crashed on my fragments file (failed to parse integer).

We have code that does something similar in pycisTopic: https://github.com/aertslab/pycisTopic/blob/polars_1xx/src/pycisTopic/fragments.py#L1132-L1337 , but that would make the counts per cell barcode (and returns a sparse matrix instead).
I wrote my own wrapper around https://github.com/pyranges/ncls (intersection code behind PyRanges) backed by Polars for doing the intersection work (which is much faster than PyRanges).

@nleroy917
Copy link
Member

Gotcha!

That's really interesting. This is similar, I think, but it does counts by pseudobulk not by cell barcodes (unless you set each barcode is its own pseudobulk 😀). I'm interested in identifying cell-type specific peaks, so that's why I was doing this by pseudobulk, but it truly was just a stopgap for me to move forward in my analysis and get the output I needed.

This implementation I wrote uses binary interval search (BITS). I'm using the rust-lapper crate for that. It's very fast, and it can be even faster if you parallelize it intelligently.

The pycisTopic implementation you link probably would have suited my needs preventing me from writing this -- just need to groupby psuedobulk at the end and aggregate.

Anyways... if you see value here or potential improvements, let me know! Thank's for the PR!

@ghuls
Copy link
Contributor Author

ghuls commented Feb 10, 2025

Ah, I was not sure what gtars was using for intersecting regions (if it was using its own intersecting implementation or another rust crate.

In pycisTopic we also use pseudobulk, but only after having done topic modeling (to alleviate the problem of clustering cells as consensus region counts per cell are quite sparse in sc/snATAC data).

@nleroy917
Copy link
Member

We are actually in the process of moving away from BITS/rust-lapper to AIList which is our own implementation. It was originally in C, and we had to port it to Rust though.

In pycisTopic we also use pseudobulk, but only after having done topic modeling (to alleviate the problem of clustering cells as consensus region counts per cell are quite sparse in sc/snATAC data).

Interestingly, I do something super similar. It's by pseudobulk, but only after processing with SnapATAC2. Looking through the pyCisTopic docs, it seems like its assumed that the pseudobulks are already known? SnapATAC2 is really just to get the initial clustering/pseudobulks which I then call peaks on separately, then merge to get the final consensus set.

@nleroy917 nleroy917 merged commit 9d2ed66 into databio:dev Feb 10, 2025
3 of 4 checks passed
@ghuls
Copy link
Contributor Author

ghuls commented Feb 11, 2025

We are actually in the process of moving away from BITS/rust-lapper to AIList which is our own implementation. It was originally in C, and we had to port it to Rust though.

How does it compare with rust-lapper in speed?

In pycisTopic we also use pseudobulk, but only after having done topic modeling (to alleviate the problem of clustering cells as consensus region counts per cell are quite sparse in sc/snATAC data).

Interestingly, I do something super similar. It's by pseudobulk, but only after processing with SnapATAC2. Looking through the pyCisTopic docs, it seems like its assumed that the pseudobulks are already known? SnapATAC2 is really just to get the initial clustering/pseudobulks which I then call peaks on separately, then merge to get the final consensus set.

The pseudobulks are not known in advance. First we create a binary matrix for all cell barcodes over an initial set of consensus regions, then this binary matrix is used in topic modeling which output will be used to cluster the cells (works better than clustering on just sparse binary/count matrix. From this clustering you can create your pseudobulks and you can refine your consensus regions by combining consensus regions made per pseudobulk (assuming this cell types are only a small percentage of all your cells and those regions would be missed when you just take full bulk consensus regions).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants