-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A few minor corrections #9
Open
arendsee
wants to merge
4
commits into
The-Sequence-Ontology:master
Choose a base branch
from
arendsee:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the ID's here we won't be able to keep these two alternative CDS's separate, since they are both in the same mRNA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand. The parent mRNAs are different (
mRNA00001
,mRNA00002
, andmRNA00003
).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but
cds00003
andcds00004
both happen to be on the same mRNA in this example (mRNA00003
).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. With
cds00003
andcds00004
referring to different proteins translated from the same mRNA (different start codons). OK, you're right.The CDS ID should refer to the common ID shared by all genomic intervals in the CDS. This makes sense. So the CDS is considered a discontinuous feature? If this is the right way to do it then pretty much everyone is doing it wrong, giving every CDS row a unique ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually in my limited experience (NCBI and Ensembl downloads for human and mouse), they seem to be doing it "right", giving all segments of the discontinuous CDS feature the same ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good. Phytozome10, where I get most of my plant genomes, forces unique IDs. Incidentally, I think a group at my university is planning some giant analysis of GFF files, as part of a test of a new domain specific language they are designing. Maybe I can get them look into conventions in the "wild".
But anyway, I agree about the CDS IDs here. I'll push a commit in a bit that reverts the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since a CDS is discontinuous, are their IDs strictly required, even in mRNAs that contain only a single CDS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. Certainly not all CDSs are discontinuous, so I'd guess those would have even less use for an ID. For the discontinuous case I would probably personally still want to see all the chunks having the same ID just for explicitness, but I'm not sure the spec should mandate that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So perhaps CDS IDs should be strictly required only when multiple CDS share one mRNA.
When CDS IDs are used, should it be strictly required that each interval in the CDS share the same ID? For example, is this incorrect:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My fantasy: a registry of all GFF3 files in the wild (or at least a start with a representative sample), and a Jenkins or other CI job that regularly runs a GFF3 validator and reports variations from normative standard and de-facto conventions....
@nathandunn how hard would this be?