Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetch-from-ncbi-virus does not include nucleotide sequences #18

Closed
joverlee521 opened this issue Sep 11, 2023 · 2 comments · Fixed by #19
Closed

fetch-from-ncbi-virus does not include nucleotide sequences #18

joverlee521 opened this issue Sep 11, 2023 · 2 comments · Fixed by #19
Labels
bug Something isn't working

Comments

@joverlee521
Copy link
Contributor

Current Behavior

Previously, the nucleotide sequence per record would be included as sequence, since we are pulling the nucleotide sequence as part of the NCBI Virus URL

('sequence', 'Nucleotide_seq'),

However the monkeypox ingest workflow has been returning an empty values for sequences.
Looking back at previous versions of s3://nextstrain-data/files/workflows/monkeypox/genbank.ndjson.xz:

2023-09-05 (version id c.cdLtg8OxV1Pyl8SSlWE1_dqKpQBT.z) - still included sequences for all records
2023-09-06 (version id PaqGNfdlQXH7eV9b.WVpaOm5ioQ1pVD2) - 240/6751 records did not include sequence
2023-09-07 (version id nImnSdA8NDGCJdVuDuMmsoFB_hveCkCC) - 6071/6762 records did not include sequence
2023-09-08 (version id UZ9VwlVMqVfAeP0sMux9qE4H1e6dGZRP) - none of the 6809 records included sequences
2023-09-09 (version id VWxHnqlAUVEGRU4_ngYsJuctK7Tftyyn) - none of the 6807 records included sequences

I had wondered if there was a bug in the centralized ingest script, but running the recently deleted monkeypox fetch-from-genbank script returns the same results without sequences.

NCBI Virus observations

The nucleotide sequence field name has not changed since you still download the sequences in a FASTA file with the same field name:

https://www.ncbi.nlm.nih.gov/genomes/VirusVariation/vvsearch2/?fq={!tag=SeqType_s}SeqType_s:("Nucleotide")&fq=VirusLineageId_ss:(10244)&cmd=download&sort=SourceDB_s desc,CreateDate_dt desc,id asc&dlfmt=fasta&fl=AccVer_s,Definition_s,Nucleotide_seq

However downloading as CSV or JSON format results in empty column for Nucleotide_seq.

@joverlee521
Copy link
Contributor Author

If the nucleotide sequences are only available for the FASTA file download, we could change the URL to download the FASTA format and then parse the FASTA file into NDJSON. However, I have not tested if all of the fields we currently download are available for the FASTA file download.

@joverlee521
Copy link
Contributor Author

The NCBI Virus API has worked ~well for us the past 3 years, but this change really makes it less appealing to work with.
With the switch over to NCBI Datasets in monkeypox and rsv, the only missing fields from the NCBI Virus API are title and publications which we do not generally use in our pipelines anyways.

I think we can drop this route for NCBI data and just recommend the NCBI Datasets CLI.

joverlee521 added a commit that referenced this issue Sep 14, 2023
Due to #18, the NCBI Virus
API is more of a hassle to use. The data from NCBI Datasets CLI covers
the standard fields that we use in pathogen ingests, so we can drop
the use of the undocumented NCBI Virus API.

If any pathogen needs additional custom fields that are not available
through NCBI Datasets, the pipeline can use fetch-from-ncbi-entrez
and parse the GenBank file.
@joverlee521 joverlee521 mentioned this issue Sep 14, 2023
2 tasks
joverlee521 added a commit to nextstrain/pathogen-repo-guide that referenced this issue Sep 14, 2023
Dropping due to the changes of the NCBI Virus API described in
nextstrain/ingest#18 and we are dropping the
scripts for this in nextstrain/ingest#19.
joverlee521 added a commit to nextstrain/avian-flu that referenced this issue May 21, 2024
Thanks to comment by @AngieHinrichs¹ which linked to an example URL that
uses the `Strain_s` field. Based on this field, I was able to guess the
fields for serotype and segment.

Keeping the `isolate` field because some records use the `isolate` for
the strain name instead of the `strain` field.

Also removes the `sequence` field since that is no longer returned by
the API.²

¹ <#37 (comment)>
² <nextstrain/ingest#18>
joverlee521 added a commit to nextstrain/avian-flu that referenced this issue May 24, 2024
Thanks to comment by @AngieHinrichs¹ which linked to an example URL that
uses the `Strain_s` field. Based on this field, I was able to guess the
fields for serotype and segment.

Keeping the `isolate` field because some records use the `isolate` for
the strain name instead of the `strain` field.

Also removes the `sequence` field since that is no longer returned by
the API.²

¹ <#37 (comment)>
² <nextstrain/ingest#18>
joverlee521 added a commit to nextstrain/avian-flu that referenced this issue May 29, 2024
Thanks to comment by @AngieHinrichs¹ which linked to an example URL that
uses the `Strain_s` field. Based on this field, I was able to guess the
fields for serotype and segment.

Keeping the `isolate` field because some records use the `isolate` for
the strain name instead of the `strain` field.

Also removes the `sequence` field since that is no longer returned by
the API.²

¹ <#37 (comment)>
² <nextstrain/ingest#18>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

1 participant