Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak or stall? #40

Closed
schorlton opened this issue Sep 15, 2022 · 16 comments
Closed

Memory leak or stall? #40

schorlton opened this issue Sep 15, 2022 · 16 comments

Comments

@schorlton
Copy link

Thanks for falco!

Running v1.2.0 installed from bioconda on a nanopore FASTQ

falco --format fastq -skip-report -t 1 -skip-summary nanopore.fastq.gz 
[limits]	using default limit cutoffs (no file specified)
[adapters]	using default adapters (no file specified)
[contaminants]	using default contaminant list (no file specified)
[Thu Sep 15 10:26:59 2022] Started reading file nanopore.fastq.gz
[Thu Sep 15 10:27:00 2022] reading file as gzipped FASTQ format
[running falco|===================================================|100%]
[Thu Sep 15 10:27:13 2022] Finished reading file
[Thu Sep 15 10:27:13 2022] Writing text report to ./fastqc_data.txt

It looks like it generates the fastqc_data.txt properly, but then after that, it consumes over 32GB of RAM over several minutes until I kill it...
Is this expected? For comparison, FastQC processes the file in 18s within 1GB RAM.

Here are the stats of the file:

seqkit stats nanopore.fastq.gz 
file              format  type  num_seqs      sum_len  min_len  avg_len  max_len
nanopore.fastq.gz  FASTQ   DNA     30,720  204,534,138      100    6,658   35,768
@guilhermesena1
Copy link
Collaborator

Hi

this may be a strange thing to assess but would you be able to try running the same thing without --format fastq (let falco figure out the format) and let us know if you have the same problem? May help us with zero in on the problem

thank you!

@schorlton
Copy link
Author

Thanks for your quick response! Running without --format fastq hit the same problem. I also tried running on the ungzipped file without --format fastq and that also hit the same issue. Thanks!

@guilhermesena1
Copy link
Collaborator

guilhermesena1 commented Sep 15, 2022

Hm that's very strange indeed. I had a suspicion that --format fastq was forcing the program to interpret the gzipped file as uncompressed, thus leading to problems.

would it be too much to ask if you could either share the entire dataset or a small subset that causes the problem on your machine? I'd be very interested in seeing if I can reproduce too, this may be more complicated than I thought. I understand that it's tricky to ask those things for private data so no worries if you cannot, or if it's more convenient you can always send it to me at desenabr[at]usc[dot]edu.

finally - and this is admittedly a hail mary - does the program also fail if you don't use any flags? Just falco nanopore.fastq.gz?

Thank you!

@andrewdavidsmith
Copy link
Collaborator

@schorlton If you are only able to share part of the dataset, please include something with the same structure for the read names unless they are really short and contiguous. Also, try to include any special characters that would be in the original file, for example by obtaining the smaller part with command line tools. Also, the most helpful part would have similar variability in lengths of reads. Thanks!

@schorlton
Copy link
Author

Running falco nanopore.fastq.gz hits the same issue.
I've managed to reproduce on the first random public nanopore sequencing file I pulled from SRA:
SRR21529942.fastq.gz
Looking forward to your analysis.

@andrewdavidsmith
Copy link
Collaborator

andrewdavidsmith commented Sep 15, 2022

I can report that, using the provided file:

  • The most recent release worked for me (1.2.0)
  • The most recent source worked for me
  • The most recent version on conda did not work (1.2.0) same with (1.1.0) from conda

If @guilhermesena1 notices the same, we will need to find the problem and update conda. In the meantime, @schorlton if you are in an environment where you can build falco from source, maybe you could try that using the source here. If you want to do that and encounter problems we can try to help. Otherwise we hope you'll wait for the fix on conda, which we will try to get done asap.

@schorlton
Copy link
Author

Glad I'm not going crazy 😝
Thanks for your investigation - I'll stay tuned for a conda fix as I can use FastQC in the interim.
-Sam

@andrewdavidsmith
Copy link
Collaborator

Problem identified. Hopefully we can have a fix soon on conda. If you need it before we can do that, you'll have to try building from source -- preferably the release I linked above.

@guilhermesena1
Copy link
Collaborator

The problem can also be reproduced on a clone/release if we delete/rename the files in the Configuration directory, so there may be something to do with the defaults.

@schorlton if you need a quick fix you can point to the configuration files in conda as follows

$ falco -c /path/to/your/conda/opt/falco-1.2.0/Configuration/contaminant_list.txt \
        -a /path/to/your/conda/opt/falco-1.2.0/Configuration/adapter_list.txt \
        -l /path/to/your/conda/opt/falco-1.2.0/Configuration/limits.txt \
        [the rest of your parameters here]

I think when compiling falco internally through conda (since they only leave the binary), they're still not finding the Configuration folder under /opt. The way fastqc does it is compiling everything at the conda root directory, moving everything to opt then making a symlink under the bin directory that points to the binary under opt. I didn't like that approach very much and was trying to avoid it, but I guess if we want to emulate their behavior, we will emulate their behavior. We'll try to fix this but this may take a bit longer since it's done through a pull request.

@guilhermesena1
Copy link
Collaborator

I think I found the source of the problem and pushed a possible fix at bca5f11

When we used the default values the shortest adapter size wasn't being set properly, and decreasing 1 lead to underflow. You should also be able to run the code just using the -a flag in the command above. In any case we'll roll this into a new release asap and update on conda.

Sorry for the inconvenience and thank you again for sharing the issue with us!

@schorlton
Copy link
Author

Thanks for the quick fix! Will test out the conda update once released.

@schorlton
Copy link
Author

Unfortunately now getting a new error on 1.2.1 from bioconda:

falco -skip-summary -skip-report nanopore.fastq.gz 
[limitst]	using file /opt/conda/opt/falco-1.2.1/Configuration/limits.txt
instruction for limit duplication not found in file /opt/conda/opt/falco-1.2.1

@guilhermesena1
Copy link
Collaborator

yep trying to figure this one out.
what's weird is it is reading the correct file, and even more, so if you do

falco -l opt/conda/opt/falco-1.2.1/Configuration/limits.txt [etc etc]

it runs ok, but not if you don't specify the file, even if the URL is correct.

@guilhermesena1
Copy link
Collaborator

guilhermesena1 commented Sep 19, 2022

alright, I know better than to get my hopes up, but we made some changes to the most recent release and merged them to conda to address the issue.

As far as I can tell, a conda download of falco puts the config files at opt/falco/Configuration and the falco binary in conda correctly uses these files to process the data. @schorlton if you are still having issues and are still interested in testing the most recent changes, we would love to know if things are running correctly on your end too. Just bear in mind you may have to run conda remove falco prior to updating.

@schorlton
Copy link
Author

It works - thank you! The only remaining issue is specifying the file format. Specifying --format fastq when the file is compressed FASTQ causes a crash. Omitting this works.

@guilhermesena1
Copy link
Collaborator

We will create another issue to address this. Ideally fq and fq.gz should be treated equally, but the reading and process is a bit different, and technically calling --format fastq reports on the output that it is being read as uncompressed and failing makes sense since the input is technically malformatted. Regardless this does warrant clarification

I'll close this issue considering that the initial problem has been resolved, but any further problems regarding the stalling can always be added by reopening this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants