Skip to content

Output files

Jakub Vasicek edited this page Jul 8, 2024 · 12 revisions

Concatenated FASTA file

The main result of the pipeline is the concatenated FASTA file, consisting of the ProHap and/or ProVar output, reference sequences from Ensembl, and contaminant sequences. The resulting file has the following format:

>tag|accession|<positions_within_protein> <protein_starts> <matching_proteins> <reading_frames>
PROTEINSEQUENCE

The header of the protein entry is formatted as >tag|accession|description. The accession field is used as the identifier of the entry when annotating the peptide-spectrum matches (PSMs). The description field is then used by the annotation pipeline to align the peptide with transcripts, genes, and variant coordinates. Please note that there can be multiple protein entries sharing the same sequence - therefore, the description field may contain information about multiple proteins. The tag can be used to quickly distinguish between contaminants, canonical, haplotype, and variant sequences.

Optionally, the concatenated FASTA can be simplified, and the tag and the information contained in the description fields extracted to a tab-separated file. The simplified FASTA file will then contain only the protein accession, and the name of the associated gene. Contaminant sequences are additionally rather marked as contaminants. This option is recommended for compatibility with search engines and other tools.

The simplified FASTA file has the following format:

>accession GN=<gene_name>
PROTEINSEQUENCE
>accession CONTAMINANT GN=<contaminant protein name (e.g., CAH1_HUMAN)>
PROTEINSEQUENCE

Possible tag values are:

  • generic_cont: At least one of the matching sequences is a contaminant.
  • generic_ensref: No matching contaminant, at least one of the matching sequences is an Ensembl canonical protein.
  • generic_var: No matching contaminant or canonical protein, at least one of the matching sequences is a variant protein (obtained by ProVar).
  • generic_enshap: No matching contaminant, canonical or variant protein, all of the matching sequences are non-canonical protein haplotypes (obtained by ProHap).

The fields included in the description of the FASTA elements are the following:

  • positions_within_protein: position of this sub-sequence within the whole protein sequences, delimited by semicolon
  • protein_starts: positions of the first residue (usually M) within the whole cDNA translation
  • matching_proteins: IDs of the whole protein sequences matching to this sub-sequence. Variant and haplotype IDs can be mapped to the metadata table provided.
  • reading_frames: Reading frames in which the matching sequences are translated, if known.

Furthermore, ProHap and ProVar produce additional files:

ProHap output

  1. Haplotype table file provided in a tab-separated text-file format. The columns given by ProHap are:
  • chromosome
  • TranscriptID: Identifier of the transcript in Ensembl format (ENSTxxx)
  • transcript_biotype: Biotype of the matching transcript in Ensembl.
  • HaplotypeID: ID of the haplotype sequence, matching to the ID in the FASTA entry description.
  • VCF_IDs: IDs of the matching lines in the VCF file if provided
  • DNA_changes: List of changes in the format position:REF>ALT, mapped to the DNA coordinates within the chromosome
  • allele_frequencies: List of allele frequencies of the variants included in the haplotype
  • cDNA_changes: List of changes in the format position:REF>ALT, mapped to the coordinates within the cDNA sequence of this transcript
  • all_protein_changes: List of amino acid changes in the format position:REF>position:ALT, mapped to the coordinates within the protein sequence. The start codon is at position 0, so if a change happens in the 5' untranslated region (UTR), its coordinates within the protein are negative.
  • variant_types: Consequence type of variant (e.g., SAV, inframe-indel, synonymous, ...) for every variant on the protein level
  • protein_changes: List of amino acid changes in the protein excluding synonymous variants.
  • reading_frame: Canonical reading frame for this transcript, if known.
  • protein_prefix_length: Number of codons in the 5' UTR
  • start_missing: Boolean - is the canonical annotation of the start codon missing for this transcript?
  • start_lost: Boolean - does one of the variants cause a loss of the start codon?
  • splice_sites_affected: List of splice sites affected by a variant, if any. (Splice site 0 happens between exon 1 and 2)
  • occurrence_count: Number of occurrences of this haplotype within the participants of the 1000 Genomes project (or within the cohort provided in the phased genotype VCF)
  • frequency: Frequency of this haplotype within the participants of the 1000 Genomes project (or within the cohort provided in the phased genotype VCF)
  • frequency_population: Frequency of this haplotype among populations (assignment of samples to populations given as input)
  • frequency_superpopulation: Frequency of this haplotype among superpopulations (assignment of samples to superpopulations given as input)
  1. FASTA file containing the translations of all haplotype cDNA sequences, before the removal of UTR sequences. The haplotype ID matching the table below is specified in the description part of the header line. Identical sequences are merged.

  2. Tab-separated file containing the list of samples in which has each of the protein haplotype sequences been predicted. For example, if the file contains the following:

HaplotypeID     samples
haplo_chr1_4    HG02572:2;HG02717:1

The haplotype sequence haplo_chr1_4 is encoded by the second copy of respective gene in sample HG02572 on strand 2, and the first copy in sample HG02717.

ProVar output

  1. Metadata file provided in a tab-separated text-file format. The columns given by ProVar are:
  • chromosome
  • TranscriptID: Identifier of the transcript in Ensembl format (ENSTxxx)
  • transcript_biotype: Biotype of the matching transcript in Ensembl.
  • variantID: ID of the variant sequence (unique per transcript x allele), matching to the ID in the FASTA entry description.
  • vcfID: ID of the matching line in the VCF file of provided
  • DNA_change: Change in the format position:REF>ALT, mapped to the DNA coordinates within the chromosome
  • cDNA_change: Change in the format position:REF>ALT, mapped to the coordinates within the cDNA of this transcript
  • protein_change: Amino acid change in the format position:REF>ALT, mapped to the coordinates within the protein sequence. The start codon is at position 0, so if the change happens in the 5' untranslated region (UTR), its coordinates within the protein are negative.
  • reading_frame: Canonical reading frame for this transcript, if known.
  • protein_prefix_length: Number of codons in the 5' UTR
  • start_missing: Boolean - is the canonical annotation of the start codon missing for this transcript?
  • start_lost: Boolean - does the variant cause a loss of the start codon?
  • splice_site_affected: Which splicing site, if any is affected by the variant. (Splice site 0 happens between exon 1 and 2)
  1. FASTA file containing the translations of all variant cDNA sequences, before the removal of UTR sequences. The variant ID matching the table below is specified in the description part of the header line. Identical sequences are merged.
Clone this wiki locally