Ingest currently blocked by `fetch-from-ncbi-virus` #178

joverlee521 · 2023-09-11T21:26:12Z

Current Behavior

Because of the behavior described in nextstrain/ingest#18, the ingest pipeline does not include sequences in it's fetch from NCBI Virus. This results in all of the records being dropped in the pipeline and the final outputs to s3://nextstrain-data/files/workflows/monkeypox/ are empty. This was first flagged internally by downstream CZI consumers on Slack.

We don't have insight into the undocumented NCBI Virus API and whether this new behavior is intentional, so the best thing might be to just switch to the NCBI Datasets CLI to fetch data.

tsibley · 2023-09-11T22:04:53Z

This sort of thing is why I want monitoring! (like what we did in SFS)

If we were monitoring the recordcount metadata we include, for example, we could have quickly caught this ourselves.

$ curl -I http://data.nextstrain.org/files/workflows/monkeypox/sequences.fasta.xz | grep recordcount:
x-amz-meta-recordcount: 0

AngieHinrichs · 2023-09-11T22:18:17Z

I just hit the same problem because I basically copied your undocumented API query format for my non-SARS-CoV-2 pipelines 😆 .

FWIW for SARS-CoV-2 I've been using NCBI's datasets command. I've been using some SARS-CoV-2-only features in datasets, but it seems to provide basic info for other virus genomes now too. For example, to get only FASTA and BioSample .jsonl for norovirus (metadata API query still works), I can run this command:

datasets download virus genome taxon norovirus --include genome,biosample

And then (after waiting through modem-like download speeds) I get a file ncbi_dataset.zip and then extract fasta from it like this:

unzip ncbi_dataset.zip

cp ncbi_dataset/data/genomic.fna ....

joverlee521 added the bug Something isn't working label Sep 11, 2023

joverlee521 mentioned this issue Sep 11, 2023

ingest: Switch to NCBI Datasets CLI to fetch data #179

Merged

3 tasks

nextstrain-bot added this to Nextstrain planning (archived) Sep 12, 2023

github-project-automation bot moved this to New in Nextstrain planning (archived) Sep 12, 2023

joverlee521 closed this as completed in #179 Sep 12, 2023

github-project-automation bot moved this from New to Done in Nextstrain planning (archived) Sep 12, 2023

joverlee521 mentioned this issue Sep 12, 2023

Ingest currently blocked by NCBI Virus changes nextstrain/rsv#36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest currently blocked by `fetch-from-ncbi-virus` #178

Ingest currently blocked by `fetch-from-ncbi-virus` #178

joverlee521 commented Sep 11, 2023

tsibley commented Sep 11, 2023

AngieHinrichs commented Sep 11, 2023

Ingest currently blocked by fetch-from-ncbi-virus #178

Ingest currently blocked by fetch-from-ncbi-virus #178

Comments

joverlee521 commented Sep 11, 2023

Current Behavior

tsibley commented Sep 11, 2023

AngieHinrichs commented Sep 11, 2023

Ingest currently blocked by `fetch-from-ncbi-virus` #178

Ingest currently blocked by `fetch-from-ncbi-virus` #178