You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm really enjoying the abstar program. I love how easy it is to access various parameters in the JSON output file using python. I have come across an oddity that makes abstar output difficult to use in downstream applications: often times either the N-term, the C-term, or both are truncated. This occurs when running a standard command on the test data (abstar -o ./output -t ./temp --use-test-file). Here's a specific example dealing with VRC26.15 heavy chain from the test_hiv_bnab_hcs.fasta test set (VRC26.15 is just one of many examples where this seems to occur.):
For VRC26.15, the N-terminal residue should be a glutamate (E) encoded by the first 3 nucleotides of the raw query sequence (GAG). Instead, alignment appears to begin with AG... meaning that the N-terminal G nucleotide does not appear to contribute to the alignment. This results in an amino acid sequence that starts with "VQLV..." instead of the expected "EVQLV..." I'm wondering if this is somehow connected to python slicing 0 vs 1 (maybe the query start parameter needs to be 0 instead of 1?).
VRC26.15 is a heavy chain variable domain, so I would expect an ending of ~...TVSS. However, the vdj_aa sequence is truncated by one S to read ...TVS. The program is able to identify the correct ending: "J-GENE AA SEQUENCE: IWGQGTMVTVSS"; however, for the VDJ assembly, the coding region appears to have been truncated and the vdj_aa sequence now reads ...TVS.
I'm not very experienced with python/coding in general, and although I have spent several days looking through the code, I can't figure out how this truncation is occurring. I'm wondering whether this is an issue with how abstar decides where the coding region is, or is the chopping of ends an inherent issue with blastn in general?
thanks!
The text was updated successfully, but these errors were encountered:
I'm really enjoying the abstar program. I love how easy it is to access various parameters in the JSON output file using python. I have come across an oddity that makes abstar output difficult to use in downstream applications: often times either the N-term, the C-term, or both are truncated. This occurs when running a standard command on the test data (abstar -o ./output -t ./temp --use-test-file). Here's a specific example dealing with VRC26.15 heavy chain from the test_hiv_bnab_hcs.fasta test set (VRC26.15 is just one of many examples where this seems to occur.):
For VRC26.15, the N-terminal residue should be a glutamate (E) encoded by the first 3 nucleotides of the raw query sequence (GAG). Instead, alignment appears to begin with AG... meaning that the N-terminal G nucleotide does not appear to contribute to the alignment. This results in an amino acid sequence that starts with "VQLV..." instead of the expected "EVQLV..." I'm wondering if this is somehow connected to python slicing 0 vs 1 (maybe the query start parameter needs to be 0 instead of 1?).
VRC26.15 is a heavy chain variable domain, so I would expect an ending of ~...TVSS. However, the vdj_aa sequence is truncated by one S to read ...TVS. The program is able to identify the correct ending: "J-GENE AA SEQUENCE: IWGQGTMVTVSS"; however, for the VDJ assembly, the coding region appears to have been truncated and the vdj_aa sequence now reads ...TVS.
I'm not very experienced with python/coding in general, and although I have spent several days looking through the code, I can't figure out how this truncation is occurring. I'm wondering whether this is an issue with how abstar decides where the coding region is, or is the chopping of ends an inherent issue with blastn in general?
thanks!
The text was updated successfully, but these errors were encountered: