First create the support file which points to where the fastq files lie.
bash makesupportfile.sh data/ fq.gz > examples/support.txt
This should look like this:
code f1 f2
sim_reads data//sim_reads_1.fq.gz data//sim_reads_2.fq.gz
Then do the alignment:
bash WGS_pipeline.sh
--mode align
--supportFrame examples/support.txt
--reference 1kg
--aligner-tparam 320
--inputFormat STDFQ
--projectID Project1
This creates the following directory structure:
Project1/
README
support.txt
align/
scripts/
data/
out/
err/
support.txt
is just a copy of the support file that was specified.
README
should get automatically updated but this has not been implemented yet (also not sure what information would be useful).
If all the fastq.gz files specified in support.txt are found then this generates the script Project1/align/scripts/align.sh
containing an SGE job array, where each job does alignment for each sample.
Check the output in Project1/align/scripts
.
This script can then be submitted to the cluster:
qsub Project1/align/scripts/align.sh
If certain BAM files already exist, they will not be regenerated unless the --force
option is used. This --force
argument applies to the modes too: gvcf, combinegvcf etc.
Once the jobs have been submitted use qstat
to check the running status.
Each job in the array will write logs, to both Project1/align/out/
and Project1/align/err/
respectively.
Once the job array finishes you can check that all files exists by running the bash script again:
bash WGS_pipeline.sh
--mode align
--supportFrame examples/support.txt
--reference 1kg
--aligner-tparam 320
--inputFormat STDFQ
--projectID Project1
--extraID Project1_
The job array should now be empty if all files have been successfully created. If they are still scripts to be submitted, first check the log files the type of error. If its a transient error like a stale NFS handle due to a lost connection then submit the jobs again.
Once the alignment is complete you can move on to the gvcf generation:
bash WGS_pipeline.sh
--mode gvcfs
--supportFrame examples/support.txt
--reference 1kg
--aligner-tparam 320
--inputFormat STDFQ
--projectID Project1
This will follow a similar process to what has been described here with the same dir structure created.
The two main scripts are:
- WGS_submission_script.sh
- WGS_pipeline.sh
WGS_submission_script defines the command line arguments to call WGS_pipeline with.
bash WGS_pipeline.sh --mode [align|gvcf|jointvcf] --supportFrame < > --reference [hg38_noAlt|1kg] --aligner-tparam 320 --inputFormat STDFQ --extraID < > --projectID < >
The three modes of running the pipeline are either for alignment, for single variant calling or for joint variant calling:
- align
- gvcf
- jointvcf
If ran for alignment, WGS_pipeline will generate the following bash scripts:
- cluster/submission/align.sh
- cluster/submission/align_table.sh
If ran for variant calling, the following bash scripts will be generated:
- cluster/submission/makeVCF.sh
- cluster/submission/makeVCF_table.sh
The first bash script is submitted to the cluster and uses the second script to specify the SGE job array. Some parameters are common to both modes. Following is an explanation of those parameters.
The pipeline submission script assumes that you have written a "support" data frame. There's a convenience script to generate it:
bash makesupport.sh ../data/ fq.gz > ../examples/support.txt
If you want to generate manually then the format of the first row is:
code f1 f2
where f1 and f2 are the paths to the paired end fastq files and code is an identifier for the exome of interest. The code is used to create a subdirectory under aligned where the output goes.
The reference build on which to do the alignment. Two reference builds are supported:
- hg38_noAlt
- 1kg
The parameter "extraID" defines the batch. It is added to the RG header of the BAM files and is used to distinguish samples in the merged VCF files.
This generates the scripts:
- cluster/submission/align.sh
- cluster/submission/align_table.sh
align.sh is submitted to the queue using qsub and reads its job from the job array in align_table.sh.
The alignment is done using novoalign on the specified reference build which generates a SAM. The SAM is piped to samblaster to mark duplicates and extract discordant and split reads. The output is then piped to samtools which generates the BAM file. Finally novosort is ran on the BAM.
The default output folder is:
aligned
There are two output files:
*_sorted_unique.bam
*_disc_sorted.bam
sorted_unique.bam
contains sorted unique reads, this is the file that should be used assess sequencing depth.
disc_sorted.bam
contains discordant reads and should be used for the purpose of CNV calling.
The variant calling is done by GATK using HaplotypeCaller.
The output of HaplotypeCaller will need to be filtered by variant recalibration (best) or hard-filtering before use in downstream analyses.
The VCF file contains the following information:
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared again
st the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for eac
h ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for ea
ch ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
Note: Joint calling is done using the full set of gvcf files in the next step, not usually done within a batch.
This is done by CombineGVCFs
Single samples VCFs called by HaplotypeCaller are combined per chromosome using GenotypeGVCFs. After this step, the generated joint VCFs cannot be merged using CombineGVCF.
All these steps are handled by the msample script.