-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fetch-from-ncbi-virus does not include nucleotide sequences #18
Comments
If the nucleotide sequences are only available for the FASTA file download, we could change the URL to download the FASTA format and then parse the FASTA file into NDJSON. However, I have not tested if all of the fields we currently download are available for the FASTA file download. |
The NCBI Virus API has worked ~well for us the past 3 years, but this change really makes it less appealing to work with. I think we can drop this route for NCBI data and just recommend the NCBI Datasets CLI. |
Due to #18, the NCBI Virus API is more of a hassle to use. The data from NCBI Datasets CLI covers the standard fields that we use in pathogen ingests, so we can drop the use of the undocumented NCBI Virus API. If any pathogen needs additional custom fields that are not available through NCBI Datasets, the pipeline can use fetch-from-ncbi-entrez and parse the GenBank file.
Dropping due to the changes of the NCBI Virus API described in nextstrain/ingest#18 and we are dropping the scripts for this in nextstrain/ingest#19.
Thanks to comment by @AngieHinrichs¹ which linked to an example URL that uses the `Strain_s` field. Based on this field, I was able to guess the fields for serotype and segment. Keeping the `isolate` field because some records use the `isolate` for the strain name instead of the `strain` field. Also removes the `sequence` field since that is no longer returned by the API.² ¹ <#37 (comment)> ² <nextstrain/ingest#18>
Thanks to comment by @AngieHinrichs¹ which linked to an example URL that uses the `Strain_s` field. Based on this field, I was able to guess the fields for serotype and segment. Keeping the `isolate` field because some records use the `isolate` for the strain name instead of the `strain` field. Also removes the `sequence` field since that is no longer returned by the API.² ¹ <#37 (comment)> ² <nextstrain/ingest#18>
Thanks to comment by @AngieHinrichs¹ which linked to an example URL that uses the `Strain_s` field. Based on this field, I was able to guess the fields for serotype and segment. Keeping the `isolate` field because some records use the `isolate` for the strain name instead of the `strain` field. Also removes the `sequence` field since that is no longer returned by the API.² ¹ <#37 (comment)> ² <nextstrain/ingest#18>
Current Behavior
Previously, the nucleotide sequence per record would be included as
sequence
, since we are pulling the nucleotide sequence as part of the NCBI Virus URLingest/ncbi-virus-url
Line 78 in c97df23
However the monkeypox ingest workflow has been returning an empty values for sequences.
Looking back at previous versions of s3://nextstrain-data/files/workflows/monkeypox/genbank.ndjson.xz:
2023-09-05 (version id
c.cdLtg8OxV1Pyl8SSlWE1_dqKpQBT.z
) - still included sequences for all records2023-09-06 (version id
PaqGNfdlQXH7eV9b.WVpaOm5ioQ1pVD2
) - 240/6751 records did not include sequence2023-09-07 (version id
nImnSdA8NDGCJdVuDuMmsoFB_hveCkCC
) - 6071/6762 records did not include sequence2023-09-08 (version id
UZ9VwlVMqVfAeP0sMux9qE4H1e6dGZRP
) - none of the 6809 records included sequences2023-09-09 (version id
VWxHnqlAUVEGRU4_ngYsJuctK7Tftyyn
) - none of the 6807 records included sequencesI had wondered if there was a bug in the centralized ingest script, but running the recently deleted monkeypox fetch-from-genbank script returns the same results without sequences.
NCBI Virus observations
The nucleotide sequence field name has not changed since you still download the sequences in a FASTA file with the same field name:
However downloading as CSV or JSON format results in empty column for
Nucleotide_seq
.The text was updated successfully, but these errors were encountered: