-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A few minor corrections #9
base: master
Are you sure you want to change the base?
Conversation
This example was meant to show how ID tags are not needed for features without children. The example removed the ID entries for the exons, but not for the CDSs.
ctg123 . CDS 1201 1500 . + 0 ID=cds00002;Parent=mRNA00002 | ||
ctg123 . CDS 5000 5500 . + 0 ID=cds00002;Parent=mRNA00002 | ||
ctg123 . CDS 7000 7600 . + 0 ID=cds00002;Parent=mRNA00002 | ||
ctg123 . CDS 3301 3902 . + 0 ID=cds00003;Parent=mRNA00003 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the ID's here we won't be able to keep these two alternative CDS's separate, since they are both in the same mRNA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand. The parent mRNAs are different (mRNA00001
, mRNA00002
, and mRNA00003
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but cds00003
and cds00004
both happen to be on the same mRNA in this example (mRNA00003
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. With cds00003
and cds00004
referring to different proteins translated from the same mRNA (different start codons). OK, you're right.
The CDS ID should refer to the common ID shared by all genomic intervals in the CDS. This makes sense. So the CDS is considered a discontinuous feature? If this is the right way to do it then pretty much everyone is doing it wrong, giving every CDS row a unique ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually in my limited experience (NCBI and Ensembl downloads for human and mouse), they seem to be doing it "right", giving all segments of the discontinuous CDS feature the same ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good. Phytozome10, where I get most of my plant genomes, forces unique IDs. Incidentally, I think a group at my university is planning some giant analysis of GFF files, as part of a test of a new domain specific language they are designing. Maybe I can get them look into conventions in the "wild".
But anyway, I agree about the CDS IDs here. I'll push a commit in a bit that reverts the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since a CDS is discontinuous, are their IDs strictly required, even in mRNAs that contain only a single CDS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. Certainly not all CDSs are discontinuous, so I'd guess those would have even less use for an ID. For the discontinuous case I would probably personally still want to see all the chunks having the same ID just for explicitness, but I'm not sure the spec should mandate that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So perhaps CDS IDs should be strictly required only when multiple CDS share one mRNA.
When CDS IDs are used, should it be strictly required that each interval in the CDS share the same ID? For example, is this incorrect:
Chr1 Araport11 gene 3631 5899 . + . ID=AT1G01010;Name=AT1G01010;Note=NAC domain containing protein 1
Chr1 Araport11 mRNA 3631 5899 . + . ID=AT1G01010.1;Parent=AT1G01010;Note=NAC domain containing protein 1
Chr1 Araport11 five_prime_UTR 3631 3759 . + . ID=AT1G01010:five_prime_UTR:1;Parent=AT1G01010.1
Chr1 Araport11 exon 3631 3913 . + . ID=AT1G01010:exon:1;Parent=AT1G01010.1
Chr1 Araport11 CDS 3760 3913 . + 0 ID=AT1G01010:CDS:1;Parent=AT1G01010.1
Chr1 Araport11 exon 3996 4276 . + . ID=AT1G01010:exon:2;Parent=AT1G01010.1
Chr1 Araport11 CDS 3996 4276 . + 2 ID=AT1G01010:CDS:2;Parent=AT1G01010.1
Chr1 Araport11 exon 4486 4605 . + . ID=AT1G01010:exon:3;Parent=AT1G01010.1
Chr1 Araport11 CDS 4486 4605 . + 0 ID=AT1G01010:CDS:3;Parent=AT1G01010.1
Chr1 Araport11 exon 4706 5095 . + . ID=AT1G01010:exon:4;Parent=AT1G01010.1
Chr1 Araport11 CDS 4706 5095 . + 0 ID=AT1G01010:CDS:4;Parent=AT1G01010.1
Chr1 Araport11 exon 5174 5326 . + . ID=AT1G01010:exon:5;Parent=AT1G01010.1
Chr1 Araport11 CDS 5174 5326 . + 0 ID=AT1G01010:CDS:5;Parent=AT1G01010.1
Chr1 Araport11 CDS 5439 5630 . + 0 ID=AT1G01010:CDS:6;Parent=AT1G01010.1
Chr1 Araport11 exon 5439 5899 . + . ID=AT1G01010:exon:6;Parent=AT1G01010.1
Chr1 Araport11 three_prime_UTR 5631 5899 . + . ID=AT1G01010:three_prime_UTR:1;Parent=AT1G01010.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I can get them look into conventions in the "wild".
My fantasy: a registry of all GFF3 files in the wild (or at least a start with a representative sample), and a Jenkins or other CI job that regularly runs a GFF3 validator and reports variations from normative standard and de-facto conventions....
@nathandunn how hard would this be?
6f5195d
to
fe73505
Compare
Just a few minor corrections.