A few minor corrections #9

arendsee · 2017-06-18T16:05:53Z

Just a few minor corrections.

This example was meant to show how ID tags are not needed for features without children. The example removed the ID entries for the exons, but not for the CDSs.

tmgreen · 2017-11-30T20:35:00Z

gff3.md

-    ctg123 . CDS             1201 1500  .  +  0  ID=cds00002;Parent=mRNA00002
-    ctg123 . CDS             5000 5500  .  +  0  ID=cds00002;Parent=mRNA00002
-    ctg123 . CDS             7000 7600  .  +  0  ID=cds00002;Parent=mRNA00002
-    ctg123 . CDS             3301 3902  .  +  0  ID=cds00003;Parent=mRNA00003


Without the ID's here we won't be able to keep these two alternative CDS's separate, since they are both in the same mRNA.

I don't quite understand. The parent mRNAs are different (mRNA00001, mRNA00002, and mRNA00003).

Yeah but cds00003 and cds00004 both happen to be on the same mRNA in this example (mRNA00003).

Ah, I see. With cds00003 and cds00004 referring to different proteins translated from the same mRNA (different start codons). OK, you're right.

The CDS ID should refer to the common ID shared by all genomic intervals in the CDS. This makes sense. So the CDS is considered a discontinuous feature? If this is the right way to do it then pretty much everyone is doing it wrong, giving every CDS row a unique ID.

Actually in my limited experience (NCBI and Ensembl downloads for human and mouse), they seem to be doing it "right", giving all segments of the discontinuous CDS feature the same ID.

That's good. Phytozome10, where I get most of my plant genomes, forces unique IDs. Incidentally, I think a group at my university is planning some giant analysis of GFF files, as part of a test of a new domain specific language they are designing. Maybe I can get them look into conventions in the "wild".

But anyway, I agree about the CDS IDs here. I'll push a commit in a bit that reverts the change.

Since a CDS is discontinuous, are their IDs strictly required, even in mRNAs that contain only a single CDS?

I'm not sure. Certainly not all CDSs are discontinuous, so I'd guess those would have even less use for an ID. For the discontinuous case I would probably personally still want to see all the chunks having the same ID just for explicitness, but I'm not sure the spec should mandate that.

So perhaps CDS IDs should be strictly required only when multiple CDS share one mRNA.

When CDS IDs are used, should it be strictly required that each interval in the CDS share the same ID? For example, is this incorrect:

Chr1 Araport11 gene 3631 5899 . + . ID=AT1G01010;Name=AT1G01010;Note=NAC domain containing protein 1 Chr1 Araport11 mRNA 3631 5899 . + . ID=AT1G01010.1;Parent=AT1G01010;Note=NAC domain containing protein 1 Chr1 Araport11 five_prime_UTR 3631 3759 . + . ID=AT1G01010:five_prime_UTR:1;Parent=AT1G01010.1 Chr1 Araport11 exon 3631 3913 . + . ID=AT1G01010:exon:1;Parent=AT1G01010.1 Chr1 Araport11 CDS 3760 3913 . + 0 ID=AT1G01010:CDS:1;Parent=AT1G01010.1 Chr1 Araport11 exon 3996 4276 . + . ID=AT1G01010:exon:2;Parent=AT1G01010.1 Chr1 Araport11 CDS 3996 4276 . + 2 ID=AT1G01010:CDS:2;Parent=AT1G01010.1 Chr1 Araport11 exon 4486 4605 . + . ID=AT1G01010:exon:3;Parent=AT1G01010.1 Chr1 Araport11 CDS 4486 4605 . + 0 ID=AT1G01010:CDS:3;Parent=AT1G01010.1 Chr1 Araport11 exon 4706 5095 . + . ID=AT1G01010:exon:4;Parent=AT1G01010.1 Chr1 Araport11 CDS 4706 5095 . + 0 ID=AT1G01010:CDS:4;Parent=AT1G01010.1 Chr1 Araport11 exon 5174 5326 . + . ID=AT1G01010:exon:5;Parent=AT1G01010.1 Chr1 Araport11 CDS 5174 5326 . + 0 ID=AT1G01010:CDS:5;Parent=AT1G01010.1 Chr1 Araport11 CDS 5439 5630 . + 0 ID=AT1G01010:CDS:6;Parent=AT1G01010.1 Chr1 Araport11 exon 5439 5899 . + . ID=AT1G01010:exon:6;Parent=AT1G01010.1 Chr1 Araport11 three_prime_UTR 5631 5899 . + . ID=AT1G01010:three_prime_UTR:1;Parent=AT1G01010.1

Maybe I can get them look into conventions in the "wild".

My fantasy: a registry of all GFF3 files in the wild (or at least a start with a representative sample), and a Jenkins or other CI job that regularly runs a GFF3 validator and reports variations from normative standard and de-facto conventions....

@nathandunn how hard would this be?

arendsee added 4 commits June 18, 2017 11:05

Fix typo in fasta header

ab51466

Convert "name" to "Name" in attribute fields

a6eeb1b

Add missing newline

fd740b3

Remove CDS ID's from the simplified GFF example

b887c86

This example was meant to show how ID tags are not needed for features without children. The example removed the ID entries for the exons, but not for the CDSs.

arendsee changed the title ~~Fix typo in fasta header~~ A few minor corrections Jun 18, 2017

tmgreen reviewed Nov 30, 2017

View reviewed changes

alexhenrie force-pushed the master branch 6 times, most recently from 6f5195d to fe73505 Compare August 18, 2020 23:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A few minor corrections #9

A few minor corrections #9

arendsee commented Jun 18, 2017 •

edited

Loading

tmgreen Nov 30, 2017

arendsee Nov 30, 2017

tmgreen Nov 30, 2017

arendsee Nov 30, 2017

tmgreen Dec 1, 2017

arendsee Dec 1, 2017

arendsee Dec 1, 2017

tmgreen Dec 1, 2017

arendsee Dec 1, 2017

cmungall Dec 1, 2017

A few minor corrections #9

Are you sure you want to change the base?

A few minor corrections #9

Conversation

arendsee commented Jun 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arendsee commented Jun 18, 2017 •

edited

Loading