Name	Name	Last commit message	Last commit date
parent directory ..
data	data
examples	examples
Phenopolis data import flowchart.png	Phenopolis data import flowchart.png
README.md	README.md
RUNME.md	RUNME.md
WGS_pipeline.sh	WGS_pipeline.sh
WGS_submission_script.sh	WGS_submission_script.sh
makesupportfile.sh	makesupportfile.sh
run.sh	run.sh

Whole Genome Sequencing Pipeline

Example

First create the support file which points to where the fastq files lie.

bash makesupportfile.sh data/ fq.gz > examples/support.txt

This should look like this:

code	f1	f2
sim_reads	data//sim_reads_1.fq.gz	data//sim_reads_2.fq.gz

Then do the alignment:

bash WGS_pipeline.sh 
     --mode align
     --supportFrame examples/support.txt
     --reference 1kg
     --aligner-tparam 320
     --inputFormat STDFQ 
     --projectID Project1

This creates the following directory structure:

Project1/
    README
    support.txt
     align/
        scripts/
        data/
        out/
        err/

support.txt is just a copy of the support file that was specified. README should get automatically updated but this has not been implemented yet (also not sure what information would be useful). If all the fastq.gz files specified in support.txt are found then this generates the script Project1/align/scripts/align.sh containing an SGE job array, where each job does alignment for each sample. Check the output in Project1/align/scripts. This script can then be submitted to the cluster:

qsub Project1/align/scripts/align.sh

If certain BAM files already exist, they will not be regenerated unless the --force option is used. This --force argument applies to the modes too: gvcf, combinegvcf etc. Once the jobs have been submitted use qstat to check the running status. Each job in the array will write logs, to both Project1/align/out/ and Project1/align/err/ respectively.

Once the job array finishes you can check that all files exists by running the bash script again:

bash WGS_pipeline.sh 
     --mode align
     --supportFrame examples/support.txt
     --reference 1kg
     --aligner-tparam 320
     --inputFormat STDFQ 
     --projectID Project1
     --extraID Project1_

The job array should now be empty if all files have been successfully created. If they are still scripts to be submitted, first check the log files the type of error. If its a transient error like a stale NFS handle due to a lost connection then submit the jobs again.

Once the alignment is complete you can move on to the gvcf generation:

bash WGS_pipeline.sh 
     --mode gvcfs
     --supportFrame examples/support.txt
     --reference 1kg
     --aligner-tparam 320
     --inputFormat STDFQ 
     --projectID Project1

This will follow a similar process to what has been described here with the same dir structure created.

Overview

The two main scripts are:

WGS_submission_script.sh
WGS_pipeline.sh

WGS_submission_script defines the command line arguments to call WGS_pipeline with.

bash WGS_pipeline.sh --mode [align|gvcf|jointvcf] --supportFrame < > --reference [hg38_noAlt|1kg] --aligner-tparam 320 --inputFormat STDFQ --extraID < > --projectID < >

The three modes of running the pipeline are either for alignment, for single variant calling or for joint variant calling:

align
gvcf
jointvcf

If ran for alignment, WGS_pipeline will generate the following bash scripts:

cluster/submission/align.sh
cluster/submission/align_table.sh

If ran for variant calling, the following bash scripts will be generated:

cluster/submission/makeVCF.sh
cluster/submission/makeVCF_table.sh

The first bash script is submitted to the cluster and uses the second script to specify the SGE job array. Some parameters are common to both modes. Following is an explanation of those parameters.

Options

supportFrame

The pipeline submission script assumes that you have written a "support" data frame. There's a convenience script to generate it:

bash makesupport.sh ../data/ fq.gz > ../examples/support.txt

If you want to generate manually then the format of the first row is:

code f1 f2

where f1 and f2 are the paths to the paired end fastq files and code is an identifier for the exome of interest. The code is used to create a subdirectory under aligned where the output goes.

reference

The reference build on which to do the alignment. Two reference builds are supported:

hg38_noAlt
1kg

extraID

The parameter "extraID" defines the batch. It is added to the RG header of the BAM files and is used to distinguish samples in the merged VCF files.

Alignment (align mode)

This generates the scripts:

cluster/submission/align.sh
cluster/submission/align_table.sh

align.sh is submitted to the queue using qsub and reads its job from the job array in align_table.sh.

The alignment is done using novoalign on the specified reference build which generates a SAM. The SAM is piped to samblaster to mark duplicates and extract discordant and split reads. The output is then piped to samtools which generates the BAM file. Finally novosort is ran on the BAM.

The default output folder is:

aligned

There are two output files:

*_sorted_unique.bam
*_disc_sorted.bam

sorted_unique.bam contains sorted unique reads, this is the file that should be used assess sequencing depth. disc_sorted.bam contains discordant reads and should be used for the purpose of CNV calling.

Single sample variant calling

The variant calling is done by GATK using HaplotypeCaller.

The output of HaplotypeCaller will need to be filtered by variant recalibration (best) or hard-filtering before use in downstream analyses.

The VCF file contains the following information:

##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared again
st the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for eac
h ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for ea
ch ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">

Note: Joint calling is done using the full set of gvcf files in the next step, not usually done within a batch.

Merging of samples

This is done by CombineGVCFs

Joint variant calling (jointvcf mode)

Single samples VCFs called by HaplotypeCaller are combined per chromosome using GenotypeGVCFs. After this step, the generated joint VCFs cannot be merged using CombineGVCF.

Variant selection, filtering and recalibration

All these steps are handled by the msample script.

Dependencies

Alignment
- novoalign
- samblaster
- samtools
- Picard
Variant calling
- GATK

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

WGS

WGS

README.md

Whole Genome Sequencing Pipeline

Example

Overview

Options

supportFrame

reference

extraID

Alignment (align mode)

Single sample variant calling

Merging of samples

Joint variant calling (jointvcf mode)

Variant selection, filtering and recalibration

Dependencies

Collapse file tree

Files

WGS

Directory actions

More options

Directory actions

More options

Latest commit

History

WGS

Folders and files

parent directory

README.md

Whole Genome Sequencing Pipeline

Example

Overview

Options

supportFrame

reference

extraID

Alignment (align mode)

Single sample variant calling

Merging of samples

Joint variant calling (jointvcf mode)

Variant selection, filtering and recalibration

Dependencies