Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A few minor corrections #9

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 34 additions & 33 deletions gff3.md
Original file line number Diff line number Diff line change
Expand Up @@ -164,19 +164,19 @@ Note that several of the features, including the gene, its mRNAs and the CDSs, a
ctg123 . exon 3000 3902 . + . Parent=mRNA00001,mRNA00003
ctg123 . exon 5000 5500 . + . Parent=mRNA00001,mRNA00002,mRNA00003
ctg123 . exon 7000 9000 . + . Parent=mRNA00001,mRNA00002,mRNA00003
ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001
ctg123 . CDS 3000 3902 . + 0 ID=cds00001;Parent=mRNA00001
ctg123 . CDS 5000 5500 . + 0 ID=cds00001;Parent=mRNA00001
ctg123 . CDS 7000 7600 . + 0 ID=cds00001;Parent=mRNA00001
ctg123 . CDS 1201 1500 . + 0 ID=cds00002;Parent=mRNA00002
ctg123 . CDS 5000 5500 . + 0 ID=cds00002;Parent=mRNA00002
ctg123 . CDS 7000 7600 . + 0 ID=cds00002;Parent=mRNA00002
ctg123 . CDS 3301 3902 . + 0 ID=cds00003;Parent=mRNA00003
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the ID's here we won't be able to keep these two alternative CDS's separate, since they are both in the same mRNA.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand. The parent mRNAs are different (mRNA00001, mRNA00002, and mRNA00003).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but cds00003 and cds00004 both happen to be on the same mRNA in this example (mRNA00003).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. With cds00003 and cds00004 referring to different proteins translated from the same mRNA (different start codons). OK, you're right.

The CDS ID should refer to the common ID shared by all genomic intervals in the CDS. This makes sense. So the CDS is considered a discontinuous feature? If this is the right way to do it then pretty much everyone is doing it wrong, giving every CDS row a unique ID.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually in my limited experience (NCBI and Ensembl downloads for human and mouse), they seem to be doing it "right", giving all segments of the discontinuous CDS feature the same ID.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good. Phytozome10, where I get most of my plant genomes, forces unique IDs. Incidentally, I think a group at my university is planning some giant analysis of GFF files, as part of a test of a new domain specific language they are designing. Maybe I can get them look into conventions in the "wild".

But anyway, I agree about the CDS IDs here. I'll push a commit in a bit that reverts the change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since a CDS is discontinuous, are their IDs strictly required, even in mRNAs that contain only a single CDS?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. Certainly not all CDSs are discontinuous, so I'd guess those would have even less use for an ID. For the discontinuous case I would probably personally still want to see all the chunks having the same ID just for explicitness, but I'm not sure the spec should mandate that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So perhaps CDS IDs should be strictly required only when multiple CDS share one mRNA.

When CDS IDs are used, should it be strictly required that each interval in the CDS share the same ID? For example, is this incorrect:

Chr1    Araport11       gene    3631    5899    .       +       .       ID=AT1G01010;Name=AT1G01010;Note=NAC domain containing protein 1
Chr1    Araport11       mRNA    3631    5899    .       +       .       ID=AT1G01010.1;Parent=AT1G01010;Note=NAC domain containing protein 1
Chr1    Araport11       five_prime_UTR  3631    3759    .       +       .       ID=AT1G01010:five_prime_UTR:1;Parent=AT1G01010.1
Chr1    Araport11       exon    3631    3913    .       +       .       ID=AT1G01010:exon:1;Parent=AT1G01010.1
Chr1    Araport11       CDS     3760    3913    .       +       0       ID=AT1G01010:CDS:1;Parent=AT1G01010.1
Chr1    Araport11       exon    3996    4276    .       +       .       ID=AT1G01010:exon:2;Parent=AT1G01010.1
Chr1    Araport11       CDS     3996    4276    .       +       2       ID=AT1G01010:CDS:2;Parent=AT1G01010.1
Chr1    Araport11       exon    4486    4605    .       +       .       ID=AT1G01010:exon:3;Parent=AT1G01010.1
Chr1    Araport11       CDS     4486    4605    .       +       0       ID=AT1G01010:CDS:3;Parent=AT1G01010.1
Chr1    Araport11       exon    4706    5095    .       +       .       ID=AT1G01010:exon:4;Parent=AT1G01010.1
Chr1    Araport11       CDS     4706    5095    .       +       0       ID=AT1G01010:CDS:4;Parent=AT1G01010.1
Chr1    Araport11       exon    5174    5326    .       +       .       ID=AT1G01010:exon:5;Parent=AT1G01010.1
Chr1    Araport11       CDS     5174    5326    .       +       0       ID=AT1G01010:CDS:5;Parent=AT1G01010.1
Chr1    Araport11       CDS     5439    5630    .       +       0       ID=AT1G01010:CDS:6;Parent=AT1G01010.1
Chr1    Araport11       exon    5439    5899    .       +       .       ID=AT1G01010:exon:6;Parent=AT1G01010.1
Chr1    Araport11       three_prime_UTR 5631    5899    .       +       .       ID=AT1G01010:three_prime_UTR:1;Parent=AT1G01010.1

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I can get them look into conventions in the "wild".

My fantasy: a registry of all GFF3 files in the wild (or at least a start with a representative sample), and a Jenkins or other CI job that regularly runs a GFF3 validator and reports variations from normative standard and de-facto conventions....

@nathandunn how hard would this be?

ctg123 . CDS 5000 5500 . + 1 ID=cds00003;Parent=mRNA00003
ctg123 . CDS 7000 7600 . + 1 ID=cds00003;Parent=mRNA00003
ctg123 . CDS 3391 3902 . + 0 ID=cds00004;Parent=mRNA00003
ctg123 . CDS 5000 5500 . + 1 ID=cds00004;Parent=mRNA00003
ctg123 . CDS 7000 7600 . + 1 ID=cds00004;Parent=mRNA00003
ctg123 . CDS 1201 1500 . + 0 Parent=mRNA00001
ctg123 . CDS 3000 3902 . + 0 Parent=mRNA00001
ctg123 . CDS 5000 5500 . + 0 Parent=mRNA00001
ctg123 . CDS 7000 7600 . + 0 Parent=mRNA00001
ctg123 . CDS 1201 1500 . + 0 Parent=mRNA00002
ctg123 . CDS 5000 5500 . + 0 Parent=mRNA00002
ctg123 . CDS 7000 7600 . + 0 Parent=mRNA00002
ctg123 . CDS 3301 3902 . + 0 Parent=mRNA00003
ctg123 . CDS 5000 5500 . + 1 Parent=mRNA00003
ctg123 . CDS 7000 7600 . + 1 Parent=mRNA00003
ctg123 . CDS 3391 3902 . + 0 Parent=mRNA00003
ctg123 . CDS 5000 5500 . + 1 Parent=mRNA00003
ctg123 . CDS 7000 7600 . + 1 Parent=mRNA00003

<dl>
<dt>NOTE 1<dd>
Expand Down Expand Up @@ -560,7 +560,7 @@ ttaggcttaggcttaggcttaggcttaggcttaggcttaggcttaggctt
aggcttaggcttaggcttaggcttaggcttaggcttaggcttaggcttag
aatctagctagctatccgaaattcgaggcctgaaaagtgtgacgccattc
...
&gt;cnda0123
&gt;cdna0123
ttcaagtgctcagtcaatgtgattcacagtatgtcaccaaatattttggc
agctttctcaagggatcaaaattatggatcattatggaatacctcggtgg
aggctcagcgctcgatttaactaaaagtggaaagctggacgaaagtcata
Expand All @@ -582,13 +582,13 @@ The following section discusses how to represent "pathological" cases that arise
<pre>-----&gt;XXXXXXX*------&gt;</pre>
<p>The preferred representation is to create a gene, a transcript, an exon and a CDS:</p>
<pre>
chrX . gene XXXX YYYY . + . ID=gene01;name=resA
chrX . gene XXXX YYYY . + . ID=gene01;Name=resA
chrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01
chrX . exon XXXX YYYY . + . Parent=tran01
chrX . CDS XXXX YYYY . + . Parent=tran01</pre>
<p>Some groups will find this redundant. A valid alternative is to omit the exon feature:</p>
<pre>
chrX . gene XXXX YYYY . + . ID=gene01;name=resA
chrX . gene XXXX YYYY . + . ID=gene01;Name=resA
chrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01
chrX . CDS XXXX YYYY . + . Parent=tran01</pre>
<p>It is not recommended to parent the CDS directly onto the gene, because this will make it impossible to determine the UTRs (since the gene may validly include untranscribed regulatory regions).</p>
Expand All @@ -600,10 +600,10 @@ chrX . CDS XXXX YYYY . + . Parent=tran01</pre>
<pre>-----&gt;XXXXXXX*--&gt;BBBBBB*---&gt;ZZZZ*--&gt;AAAAAA*-----</pre>
<p>Since the single transcript corresponds to multiple genes that can be identified by genetic analysis, the recommended solution here is to create four "gene" objects and make them the parent for a single transcript. The transcript will contain a single exon (in the unspliced case) and four separate CDSs:</p>
<pre>
chrX . gene XXXX YYYY . + . ID=gene01;name=resA
chrX . gene XXXX YYYY . + . ID=gene02;name=resB
chrX . gene XXXX YYYY . + . ID=gene03;name=resX
chrX . gene XXXX YYYY . + . ID=gene04;name=resZ
chrX . gene XXXX YYYY . + . ID=gene01;Name=resA
chrX . gene XXXX YYYY . + . ID=gene02;Name=resB
chrX . gene XXXX YYYY . + . ID=gene03;Name=resX
chrX . gene XXXX YYYY . + . ID=gene04;Name=resZ
chrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01,gene02,gene03,gene04
chrX . exon XXXX YYYY . + . ID=exon00001;Parent=tran01
chrX . CDS XXXX YYYY . + . Parent=tran01;Derives_from=gene01
Expand All @@ -620,7 +620,7 @@ chrX . CDS XXXX YYYY . + . Parent=tran01;Derives_from=gene04</pre>
(yyyyyy is the intein)</pre>
<p>The preferred representation is to create one gene, one transcript, one exon, and one CDS. The CDS produces a pre-polypeptide using the "Derives_from" tag, and this polypeptide in turn gives rise to two mature_polypeptides, one each for the intein and the flanking protein:</p>
<pre>
chrX . gene XXXX YYYY . + . ID=gene01;name=resA
chrX . gene XXXX YYYY . + . ID=gene01;Name=resA
chrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01
chrX . exon XXXX YYYY . + . Parent=tran01
chrX . CDS XXXX YYYY . + . ID=cds01;Parent=tran01
Expand All @@ -631,8 +631,8 @@ chrX . intein XXXX YYYY . + . ID=poly03;Parent=poly01</pre>
<p>Because the flanking mature_polypeptide has discontinuous coordinates on the genome, it appears twice with the same ID.</p>
<p>If the intein is immediately degraded, you may not wish to annotate it explicitly, and its line would be deleted from the example. However, if it has molecular activity, it may correspond to a gene, in which case:</p>
<pre>
chrX . gene XXXX YYYY . + . ID=gene01;name=resA
chrX . gene XXXX YYYY . + . ID=gene02;name=inteinA
chrX . gene XXXX YYYY . + . ID=gene01;Name=resA
chrX . gene XXXX YYYY . + . ID=gene02;Name=inteinA
chrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01,gene02
chrX . exon XXXX YYYY . + . Parent=tran01
chrX . CDS XXXX YYYY . + . ID=cds01;Parent=tran01
Expand All @@ -651,20 +651,21 @@ leader
=======&gt;-----&gt;XXXXXXX*------&gt;</pre>
<p>The simplest way to represent this is to show the mRNA as being split across two discontinuous genomic locations:</p>
<pre>
chrX . gene XXXX YYYY . + . ID=gene01;name=my_gene
chrX . gene XXXX YYYY . + . ID=gene01;Name=my_gene
chrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01
chrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01
chrX . exon XXXX YYYY . + . Parent=tran01
chrX . CDS XXXX YYYY . + . ID=cds01;Parent=tran01</pre>
<p>However, this does not indicate which part of the transcript comes from the spliced leader. A preferred representation explicitly adds features for the spliced leader gene, the primary_transcript and the spliced_leader_RNA:</p>
<pre>
chrX . gene XXXX YYYY . + . ID=gene01;name=my_gene
chrX . gene XXXX YYYY . + . ID=gene02;name=leader_gene
chrX . gene XXXX YYYY . + . ID=gene01;Name=my_gene
chrX . gene XXXX YYYY . + . ID=gene02;Name=leader_gene
chrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01,gene02
chrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01,gene02
chrX . primary_transcript XXXX YYYY . + . ID=pt01;Parent=tran01;Derives_from=gene01
chrX . spliced_leader_RNA XXXX YYYY . + . ID=sl01;Parent=tran01;Derives_from=gene02
chrX . exon XXXX YYYY . + . Parent=tran01 chrX . CDS XXXX YYYY . + . ID=cds01;Parent=tran01</pre>
chrX . exon XXXX YYYY . + . Parent=tran01
chrX . CDS XXXX YYYY . + . ID=cds01;Parent=tran01</pre>
<p>As shown here, the mRNA derives from two genes ("my_gene" and the leader gene) and occupies disjunct coordinates on the genome. The primary_transcript, which encodes the body of the mRNA, is part of (has as its Parent) this mRNA. The same relationship applies to the spliced leader RNA. The Derives_from relationship is used to indicate which genes produced the primary transcript and spliced leader respectively.</p>
<p>The exon and CDS features follow in the normal fashion.</p>
</dd>
Expand All @@ -677,7 +678,7 @@ chrX . exon XXXX YYYY . + . Parent=tran01 chrX . CDS XXXX YYYY
============* CDS</pre>
<p>The representation of this is to make the CDS discontinuous:</p>
<pre>
chrX . gene XXXX YYYY . + . ID=gene01;name=my_gene
chrX . gene XXXX YYYY . + . ID=gene01;Name=my_gene
chrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01;Ontology_term=SO:1000069
chrX . exon XXXX YYYY . + . Parent=tran01
chrX . CDS XXXX YYYY 0 + . ID=cds01;Parent=tran01
Expand All @@ -694,12 +695,12 @@ regulatory element
-----&gt;XXXXXXX*--&gt;BBBBBB*---&gt;ZZZZ*--&gt;AAAAAA*-----</pre>
<p>It can be indicated in GFF3 in this way:</p>
<pre>
chrX . operon XXXX YYYY . + . ID=operon01;name=my_operon
chrX . operon XXXX YYYY . + . ID=operon01;Name=my_operon
chrX . promoter XXXX YYYY . + . Parent=operon01
chrX . gene XXXX YYYY . + . ID=gene01;Parent=operon01;name=resA
chrX . gene XXXX YYYY . + . ID=gene02;Parent=operon01;name=resB
chrX . gene XXXX YYYY . + . ID=gene03;Parent=operon01;name=resX
chrX . gene XXXX YYYY . + . ID=gene04;Parent=operon01;name=resZ
chrX . gene XXXX YYYY . + . ID=gene01;Parent=operon01;Name=resA
chrX . gene XXXX YYYY . + . ID=gene02;Parent=operon01;Name=resB
chrX . gene XXXX YYYY . + . ID=gene03;Parent=operon01;Name=resX
chrX . gene XXXX YYYY . + . ID=gene04;Parent=operon01;Name=resZ
chrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01,gene02,gene03,gene04
chrX . exon XXXX YYYY . + . ID=exon00001;Parent=tran01
chrX . CDS XXXX YYYY . + . Parent=tran01;Derives_from=gene01
Expand Down