Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A few minor corrections #9

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

arendsee
Copy link

@arendsee arendsee commented Jun 18, 2017

Just a few minor corrections.

arendsee added 4 commits June 18, 2017 11:05
This example was meant to show how ID tags are not needed for features
without children. The example removed the ID entries for the exons, but
not for the CDSs.
@arendsee arendsee changed the title Fix typo in fasta header A few minor corrections Jun 18, 2017
ctg123 . CDS 1201 1500 . + 0 ID=cds00002;Parent=mRNA00002
ctg123 . CDS 5000 5500 . + 0 ID=cds00002;Parent=mRNA00002
ctg123 . CDS 7000 7600 . + 0 ID=cds00002;Parent=mRNA00002
ctg123 . CDS 3301 3902 . + 0 ID=cds00003;Parent=mRNA00003
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the ID's here we won't be able to keep these two alternative CDS's separate, since they are both in the same mRNA.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand. The parent mRNAs are different (mRNA00001, mRNA00002, and mRNA00003).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but cds00003 and cds00004 both happen to be on the same mRNA in this example (mRNA00003).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. With cds00003 and cds00004 referring to different proteins translated from the same mRNA (different start codons). OK, you're right.

The CDS ID should refer to the common ID shared by all genomic intervals in the CDS. This makes sense. So the CDS is considered a discontinuous feature? If this is the right way to do it then pretty much everyone is doing it wrong, giving every CDS row a unique ID.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually in my limited experience (NCBI and Ensembl downloads for human and mouse), they seem to be doing it "right", giving all segments of the discontinuous CDS feature the same ID.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good. Phytozome10, where I get most of my plant genomes, forces unique IDs. Incidentally, I think a group at my university is planning some giant analysis of GFF files, as part of a test of a new domain specific language they are designing. Maybe I can get them look into conventions in the "wild".

But anyway, I agree about the CDS IDs here. I'll push a commit in a bit that reverts the change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since a CDS is discontinuous, are their IDs strictly required, even in mRNAs that contain only a single CDS?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. Certainly not all CDSs are discontinuous, so I'd guess those would have even less use for an ID. For the discontinuous case I would probably personally still want to see all the chunks having the same ID just for explicitness, but I'm not sure the spec should mandate that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So perhaps CDS IDs should be strictly required only when multiple CDS share one mRNA.

When CDS IDs are used, should it be strictly required that each interval in the CDS share the same ID? For example, is this incorrect:

Chr1    Araport11       gene    3631    5899    .       +       .       ID=AT1G01010;Name=AT1G01010;Note=NAC domain containing protein 1
Chr1    Araport11       mRNA    3631    5899    .       +       .       ID=AT1G01010.1;Parent=AT1G01010;Note=NAC domain containing protein 1
Chr1    Araport11       five_prime_UTR  3631    3759    .       +       .       ID=AT1G01010:five_prime_UTR:1;Parent=AT1G01010.1
Chr1    Araport11       exon    3631    3913    .       +       .       ID=AT1G01010:exon:1;Parent=AT1G01010.1
Chr1    Araport11       CDS     3760    3913    .       +       0       ID=AT1G01010:CDS:1;Parent=AT1G01010.1
Chr1    Araport11       exon    3996    4276    .       +       .       ID=AT1G01010:exon:2;Parent=AT1G01010.1
Chr1    Araport11       CDS     3996    4276    .       +       2       ID=AT1G01010:CDS:2;Parent=AT1G01010.1
Chr1    Araport11       exon    4486    4605    .       +       .       ID=AT1G01010:exon:3;Parent=AT1G01010.1
Chr1    Araport11       CDS     4486    4605    .       +       0       ID=AT1G01010:CDS:3;Parent=AT1G01010.1
Chr1    Araport11       exon    4706    5095    .       +       .       ID=AT1G01010:exon:4;Parent=AT1G01010.1
Chr1    Araport11       CDS     4706    5095    .       +       0       ID=AT1G01010:CDS:4;Parent=AT1G01010.1
Chr1    Araport11       exon    5174    5326    .       +       .       ID=AT1G01010:exon:5;Parent=AT1G01010.1
Chr1    Araport11       CDS     5174    5326    .       +       0       ID=AT1G01010:CDS:5;Parent=AT1G01010.1
Chr1    Araport11       CDS     5439    5630    .       +       0       ID=AT1G01010:CDS:6;Parent=AT1G01010.1
Chr1    Araport11       exon    5439    5899    .       +       .       ID=AT1G01010:exon:6;Parent=AT1G01010.1
Chr1    Araport11       three_prime_UTR 5631    5899    .       +       .       ID=AT1G01010:three_prime_UTR:1;Parent=AT1G01010.1

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I can get them look into conventions in the "wild".

My fantasy: a registry of all GFF3 files in the wild (or at least a start with a representative sample), and a Jenkins or other CI job that regularly runs a GFF3 validator and reports variations from normative standard and de-facto conventions....

@nathandunn how hard would this be?

@alexhenrie alexhenrie force-pushed the master branch 6 times, most recently from 6f5195d to fe73505 Compare August 18, 2020 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants