fetch-from-ncbi-virus does not include nucleotide sequences #18

joverlee521 · 2023-09-11T21:01:09Z

Current Behavior

Previously, the nucleotide sequence per record would be included as sequence, since we are pulling the nucleotide sequence as part of the NCBI Virus URL

ingest/ncbi-virus-url

Line 78 in c97df23

('sequence', 'Nucleotide_seq'),

However the monkeypox ingest workflow has been returning an empty values for sequences.
Looking back at previous versions of s3://nextstrain-data/files/workflows/monkeypox/genbank.ndjson.xz:

2023-09-05 (version id c.cdLtg8OxV1Pyl8SSlWE1_dqKpQBT.z) - still included sequences for all records
2023-09-06 (version id PaqGNfdlQXH7eV9b.WVpaOm5ioQ1pVD2) - 240/6751 records did not include sequence
2023-09-07 (version id nImnSdA8NDGCJdVuDuMmsoFB_hveCkCC) - 6071/6762 records did not include sequence
2023-09-08 (version id UZ9VwlVMqVfAeP0sMux9qE4H1e6dGZRP) - none of the 6809 records included sequences
2023-09-09 (version id VWxHnqlAUVEGRU4_ngYsJuctK7Tftyyn) - none of the 6807 records included sequences

I had wondered if there was a bug in the centralized ingest script, but running the recently deleted monkeypox fetch-from-genbank script returns the same results without sequences.

NCBI Virus observations

The nucleotide sequence field name has not changed since you still download the sequences in a FASTA file with the same field name:

https://www.ncbi.nlm.nih.gov/genomes/VirusVariation/vvsearch2/?fq={!tag=SeqType_s}SeqType_s:("Nucleotide")&fq=VirusLineageId_ss:(10244)&cmd=download&sort=SourceDB_s desc,CreateDate_dt desc,id asc&dlfmt=fasta&fl=AccVer_s,Definition_s,Nucleotide_seq

However downloading as CSV or JSON format results in empty column for Nucleotide_seq.

The text was updated successfully, but these errors were encountered:

joverlee521 · 2023-09-11T21:39:06Z

If the nucleotide sequences are only available for the FASTA file download, we could change the URL to download the FASTA format and then parse the FASTA file into NDJSON. However, I have not tested if all of the fields we currently download are available for the FASTA file download.

joverlee521 · 2023-09-14T00:09:27Z

The NCBI Virus API has worked ~well for us the past 3 years, but this change really makes it less appealing to work with.
With the switch over to NCBI Datasets in monkeypox and rsv, the only missing fields from the NCBI Virus API are title and publications which we do not generally use in our pipelines anyways.

I think we can drop this route for NCBI data and just recommend the NCBI Datasets CLI.

Due to #18, the NCBI Virus API is more of a hassle to use. The data from NCBI Datasets CLI covers the standard fields that we use in pathogen ingests, so we can drop the use of the undocumented NCBI Virus API. If any pathogen needs additional custom fields that are not available through NCBI Datasets, the pipeline can use fetch-from-ncbi-entrez and parse the GenBank file.

Dropping due to the changes of the NCBI Virus API described in nextstrain/ingest#18 and we are dropping the scripts for this in nextstrain/ingest#19.

@AngieHinrichs

Thanks to comment by @AngieHinrichs¹ which linked to an example URL that uses the `Strain_s` field. Based on this field, I was able to guess the fields for serotype and segment. Keeping the `isolate` field because some records use the `isolate` for the strain name instead of the `strain` field. Also removes the `sequence` field since that is no longer returned by the API.² ¹ <#37 (comment)> ² <nextstrain/ingest#18>

@AngieHinrichs

Thanks to comment by @AngieHinrichs¹ which linked to an example URL that uses the `Strain_s` field. Based on this field, I was able to guess the fields for serotype and segment. Keeping the `isolate` field because some records use the `isolate` for the strain name instead of the `strain` field. Also removes the `sequence` field since that is no longer returned by the API.² ¹ <#37 (comment)> ² <nextstrain/ingest#18>

@AngieHinrichs

Thanks to comment by @AngieHinrichs¹ which linked to an example URL that uses the `Strain_s` field. Based on this field, I was able to guess the fields for serotype and segment. Keeping the `isolate` field because some records use the `isolate` for the strain name instead of the `strain` field. Also removes the `sequence` field since that is no longer returned by the API.² ¹ <#37 (comment)> ² <nextstrain/ingest#18>

joverlee521 added the bug Something isn't working label Sep 11, 2023

joverlee521 mentioned this issue Sep 11, 2023

Ingest currently blocked by fetch-from-ncbi-virus nextstrain/mpox#178

Closed

nextstrain-bot added this to Nextstrain planning (archived) Sep 12, 2023

github-project-automation bot moved this to New in Nextstrain planning (archived) Sep 12, 2023

joverlee521 mentioned this issue Sep 12, 2023

Ingest currently blocked by NCBI Virus changes nextstrain/rsv#36

Closed

joverlee521 mentioned this issue Sep 14, 2023

Remove NCBI Virus #19

Merged

2 tasks

joverlee521 closed this as completed in #19 Sep 14, 2023

github-project-automation bot moved this from New to Done in Nextstrain planning (archived) Sep 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fetch-from-ncbi-virus does not include nucleotide sequences #18

fetch-from-ncbi-virus does not include nucleotide sequences #18

joverlee521 commented Sep 11, 2023

joverlee521 commented Sep 11, 2023

joverlee521 commented Sep 14, 2023

fetch-from-ncbi-virus does not include nucleotide sequences #18

fetch-from-ncbi-virus does not include nucleotide sequences #18

Comments

joverlee521 commented Sep 11, 2023

Current Behavior

NCBI Virus observations

joverlee521 commented Sep 11, 2023

joverlee521 commented Sep 14, 2023