Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest currently blocked by fetch-from-ncbi-virus #178

Closed
joverlee521 opened this issue Sep 11, 2023 · 2 comments · Fixed by #179
Closed

Ingest currently blocked by fetch-from-ncbi-virus #178

joverlee521 opened this issue Sep 11, 2023 · 2 comments · Fixed by #179
Labels
bug Something isn't working

Comments

@joverlee521
Copy link
Contributor

Current Behavior

Because of the behavior described in nextstrain/ingest#18, the ingest pipeline does not include sequences in it's fetch from NCBI Virus. This results in all of the records being dropped in the pipeline and the final outputs to s3://nextstrain-data/files/workflows/monkeypox/ are empty. This was first flagged internally by downstream CZI consumers on Slack.

We don't have insight into the undocumented NCBI Virus API and whether this new behavior is intentional, so the best thing might be to just switch to the NCBI Datasets CLI to fetch data.

@joverlee521 joverlee521 added the bug Something isn't working label Sep 11, 2023
@tsibley
Copy link
Member

tsibley commented Sep 11, 2023

This sort of thing is why I want monitoring! (like what we did in SFS)

If we were monitoring the recordcount metadata we include, for example, we could have quickly caught this ourselves.

$ curl -I http://data.nextstrain.org/files/workflows/monkeypox/sequences.fasta.xz | grep recordcount:
x-amz-meta-recordcount: 0

@AngieHinrichs
Copy link

I just hit the same problem because I basically copied your undocumented API query format for my non-SARS-CoV-2 pipelines 😆 .

FWIW for SARS-CoV-2 I've been using NCBI's datasets command. I've been using some SARS-CoV-2-only features in datasets, but it seems to provide basic info for other virus genomes now too. For example, to get only FASTA and BioSample .jsonl for norovirus (metadata API query still works), I can run this command:

datasets download virus genome taxon norovirus --include genome,biosample

And then (after waiting through modem-like download speeds) I get a file ncbi_dataset.zip and then extract fasta from it like this:

unzip ncbi_dataset.zip

cp ncbi_dataset/data/genomic.fna ....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

3 participants