GIP steps¶
Prepare genome and annotation¶
GIP prepares the genome assembly and annotation files in the first two processes: processGeneFunction and prepareGenome The output is stored in the gipOut/genome directory and includes:
db/ |
BWA database |
geneFunction.tsv |
gene functions list |
genome.chrSize |
chromosome size (bp) |
genome.dict |
genome sequence dictionary |
genome.fa |
genome sequence |
genome.fa.fai |
genome sequence index |
genome.gaps.gz |
genome gaps |
repeats/ |
repeats coordinates |
snpEff/ |
snpEff database |
--geneFunction parameter. This file must:include all the genes listed in the
--annotationfile;be a <Tab> separated list with the gene identifier in the first column, and the function in the second:
--geneFunction is not specified by default the geneFunction.tsv file reports the gene list with not available (NA) functions.--genome file where the repetitive positions have been lowercased.the ability to detect both transposable elements and simple repeats;
fast execution;
accuracy when dealing with genomes with unusual nucleotide composition (e.g. Leishmania).
--repeatLibraryMap reads and collect alignment statistics¶
BWA to map the reads on the reference genome;
Samtools modules including fixmate, sort and index to reformat the alignment files;
Picard MarkDuplicates with option VALIDATION_STRINGENCY=LENIENT to detect and remove optical or PCR read duplicates.
delDup parameter can be set to false to just label instead of removing reads duplicates. The files generated at this step are placed in the gipOut/samples/sampleId folder, and include:sampleId.bam |
alignment file |
sampleId.bam.bai |
alignment index file |
sampleId.MarkDup.table |
picard MarkDuplicates tabular output |
sampleId.MarkDup.histData |
picard MarkDuplicates histogram data output |
sampleId.MarkDup.hist.png |
plot of picard MarkDuplicates histogram data |
chrCoverageMedians_sampleId |
chromosome coverage table |
sampleId.alignmentMetrics.table |
picard CollectInsertSizeMetrics tabular output |
sampleId.insertSize.histData |
picard CollectInsertSizeMetrics histogram data output |
sampleId.insertSize.hist.png |
plot of picard CollectInsertSizeMetrics histogram data |
sampleId.insertSize.table |
picard CollectInsertSizeMetrics tabular output |
--bigWigOPT parameter. The default is bigWigOPT="--normalizeUsing RPKM --ignoreDuplicates --binSize 10 --smoothLength 30". The first two option indicates that the coverage values will be generated ignoring duplicated reads and applying an RPKM normalization. The two latter options control the size of the bigWig bins (bp) and the size of the window to average the number of reads. Please refer to the bamCoverage documentation for more details.Evaluate chromosome coverage¶
CHR |
chromosome identifier |
MEDIANCOV |
median chromosome sequencing coverage |
MEDIANCOVminus2MAD |
MEDIANCOV plus 2 median absolute deviation |
MEDIANCOVplus2MAD |
MEDIANCOV minus 2 median absolute deviation |
--chrs, listing the identifiers of the chromosomes of interest. Please note that the chromosome identifiers must be separated by white spaces and provided as a text string enclosed in backslash escaped quotation marks. For instance if the user what to limit the analysis to chromosomes chr1, chr2 and chr3 then parameter should be chrs="\"1 2 3\"".Measure genomic bin sequencing coverage¶
--binSize parameter (default 300) controls the bin size (i.e. the number of nucleotides for each bin).Computes the sequencing depth of each nucleotide without normalizing
Divides the genome in contiguous genomic bins whose size is determined by the
--binSizeparameter (default 300bp)Computes mean sequencing coverage scores for each bin
Normalizes the mean bin coverage by median chromosome sequencing coverage
Applies a GC-content correction on the normalized mean bin coverage (optional)
Estimates the mean MAPQ score for each bin
CGcorrect = true and is meant to limit potential sequencing biases during DNA amplification. Given the distribution of the normalized bin mean coverage scores and their GC-content, GIP fits a loess regression using a 5 folds cross validation to explore the loess span parameter (which relates with the fraction of points used to fit the local regressions, and influence the model smoothness).--covPerBinSigOPT parameter accepts a string of 3 parameters, and can be used to customize the detection of bin and segments of interest.–minLen - minimum segment length (bp) [int]
–pThresh - adjusted p-value threshold [num]
–padjust - multiple-testing correction method [num]
--covPerBinSigOPT default is "--minLen 0 --pThresh 0.001 --padjust BY". The available methods for multiple testing corrections are: “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”. Please refer to documentation of the p.adjust R function for more details.--customCoverageLimits parameter can be used to enforce an additional custom coverage cut-offs on the statistically significant bins and segments (and genes, see below). This parameter accepts two numbers: N1, N2 (default 1.5 0.5). Significant CNV bins and segments are selected to have a coverage > N1 (for amplifications) or < N2 (for depletions).sampleId.covPerBin.gz |
genomic bin coverage |
sampleId.covPerBin.plot.all.png |
bin coverage genome overview |
sampleId.covPerBin.plot.byChr.pdf |
bin coverage chromosome overview (slides) |
sampleId.covPerBin.plot.faceting.png |
bin coverage chromosome overview (multi-panel) |
sampleId.covPerBin.plot.tsv.gz |
bin coverage plots data |
sampleId.covPerBin.significant.bins.tsv.gz |
significant bins |
sampleId.covPerBin.significant.segments.tsv.gz |
significant segments |
sampleId.covPerBin.significant.stats |
statistical test info |
sampleId.bed |
mapped reads in bed format |
--MAPQ are shown in gray. The statistically significant bins corresponding to amplifications and depletions are shown respectively in orange and blue. The y-axis minimum and maximum limits can be set with the parameter --binPlotYlim (default "0 3"). Depending on the genome size the overview plots may result too small and unreadable. The parameter --binOverviewSize accepts two integers controlling respectively the plots heights and the widths (default "400 1000"). The values specified with the --customCoverageLimits parameter will be highlighted with red dashed lines. The sampleId.bed file is an intermediate file used by GIP from the quantification of genomic intervals. It is not automatically removed by GIP because it allows the user to re-execute the pipeline with the -resume option. However, if the user is not planning on re-executing GIP he/she can simply delete this file from the work/ directory to save disk space.--MAPQ value are not considered.sampleId.karyotype.medianCoverage |
median coverage of all bins |
sampleId.karyotype.allMedians.tsv |
chromosomes median somy scores |
sampleId.karyotype.boxplot.png |
somy scores boxplot |
sampleId.karyotype.ridges.png |
somy scores ridge plot |
Measure gene sequencing coverage¶
CGcorrect = true, see above). To detect statistically significant CNV genes GIP fits a Gaussian mixture distribution with 2 components. One distribution accounting for the vast majority of observations fitting the coverage of non-CNV genes (central distribution), and another distribution fitting the CNV genes (outliers distribution). The central distributions represents the-null hypothesis under which a given coverage value is merely caused by artefact fluctuations in sequencing depth, rather than a genuine, biologically meaningful gene amplification or depletion. To test CNV significance GIP uses the mean and the standard deviation of the central distribution and assigns a z-score and a p-value to all genes. Significant genes with a mean MAPQ score lower than --MAPQ are discarded. In the same way as for genomic bins, the parameter --customCoverageLimits can be used to enforce custom coverage threshold on significant genes. The parameter --covPerGeSigOPT accepts a string of 3 parameters and can be used to control the statical test.–pThresh - adjusted p-value threshold [num]
–padjust - method for multiple testing correction [num]
–minLen - minimum gene size (bp) [int]
covPerGeSigOPT="--pThresh 0.001 --padjust BH --minLen 0". As for genomic bins, the available methods for multiple testing corrections are: “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”. Please refer to documentation of the p.adjust R function for more details.sampleId.covPerGe.gz |
gene sequencing coverage |
sampleId.covPerGe.significant.tsv |
significant gene CNVs |
sampleId.covPerGe.significant.stats |
statistical test info |
sampleId.covPerGeKaryoplot/ |
folder with CNV genes plots |
The sampleId.covPerGeKaryoplot/ folder includes plot generated with the karyoploteR package. Only chromosomes hosting significant gene CNVs are shown. Amplified genes are shown in orange, whereas depleted genes are shown in blue. If any, the repetitive elements located in proximity of gene CNVs are marked in the bottom part of the plots. The --covPerGeRepeatRange parameter can be used to set the maximum distance (in nucleotides) from each gene CNVs in which repeats are labelled. To ease visualization, for each gene CNV are shown just the closest upstream and downstream repeats (if these are within the --covPerGeRepeatRange interval). To put the gene CNVs in context of possible larger CNV regions the figure also reports a gray slope indicating the normalized bin coverage scores. In most cases the normalized coverage values of genes and bins are very close. However, for certain genes much shorter than the bin size, the plots may show a discrepancy between bin and gene readouts.
Detect, annotate and filter single nucleotide variants¶
--MAPQ are not used for detection. The user can specify freebayes options through the --freebayesOPT parameter. Its default is:--freebayesOPT="--read-indel-limit 1 --read-mismatch-limit 3 --read-snp-limit 3 \
--min-alternate-fraction 0.05 --min-base-quality 5 --min-alternate-count 2 --pooled-continuous"
Please refer to the freebayes manual for more details. | GIP returns the following outputs in the gipOut/samples/sampleId/ folder:
sampleId.vcf.gz |
SNVs (gzip compressed vcf file) |
sampleId.vcf.gz.tbi |
tabix vcf index |
snpEff_summary_sampleId.genes.txt.gz |
SNVs per gene, snpEff summary table |
snpEff_summary_sampleId.html |
snpEff summary (html) |
--filterFreebayesOPT parameter can be used to set the following variables:–minFreq - Min. variant read frequency (VRF) [num]
–maxFreq - Max. VRF [num]
–minAO - Min. number of reads supporting the alternate allele [int]
–minMQMR - Min. mean mapping quality of observed reference alleles [num]
–minMQM - Min. mean mapping quality of observed alternate alleles [num]
–MADrange - Discard SNVs whose sequencing depth is > or < MADrange MADs from the chromosome median coverage [num]
–minAOhomopolymer - Min. number of reads supporting the alternate allele mapping inside an homopolymer [int]
–contextSpan - Size on each side of SNV genomic context (bp) [int]
–homopolymerFreq - Base frequency cut-off to consider a genomic context a homopolymer [num]
–hexagons - Replace SNV scatterplots with density hexagons
–randomSNVtoShow - Max number of random SNVs to be shown in scatterplots [num]
filterFreebayesOPT="--minFreq 0.1 --maxFreq 1 --minAO 2 --minAOhomopolymer 20 \
--contextSpan 5 --homopolymerFreq 0.4 --minMQMR 20 --minMQM 20 --MADrange 4 \
--randomSNVtoShow 50000"
singleVariants.df.gz |
SNVs (table) |
singleVariants.vcf.gz |
SNVs (gzip compressed vcf file) |
singleVariants.vcf.gz.tbi |
tabix vcf index |
single_allDensities.png |
VRF density plot |
single_allHists.png |
VRF histogram plot |
single_allHistsSqrt.png |
VRF histogram plot (sqrt scale) |
single_combinedDotPlotAndDistribution.pdf |
position/VRF plot with marginal distribution |
single_depthVsVRF.png |
VRF/depth plot |
single_depthVsVRFletters.png |
VRF/depth plot SNV chromosomes are mapped to different colors and letters |
single_onePlotPerChr.pdf |
position/VRF and density plots per chromosome |
single_onePlotPerChr_colouredByVariantType.pdf |
position/VRF colored by SNV type |
single_totVarPerChr.png |
num. SNVs per chromsome kb |
single_variantType.png |
occurrence of different SNV types |
single_variantTypeCombined.png |
occurrence of different SNV types equivalent variants combined |
single_VRFvsAO.png |
VRF/alternate allele read support |
single_VRFvsAOletters.png |
VRF/alternate allele read support SNV chromosomes are mapped to different colors and letters |
single_VRFvsPosFaceting.png |
position/VRF plot with different chromosomes in different panels |
snpEff_summary_sampleId.genes.txt.gz |
SNVs per gene, snpEff summary table |
snpEff_summary_sampleId.html |
snpEff summary (html) |
NS.stats |
NS analysis statistics |
NStable.tsv.gz |
NS analysis per gene |
pseudoReference.fa.gz |
genome sequence incorporating alternate alleles |
context/ |
folder containing the nucleotide frequency logo plots of the genomic contexts of different SNV types |
synonymous_variant
stop_retained_variant
start_retained
missense_variant
start_lost
stop_gained
stop_lost
coding_sequence_variant
Detect and filter structural variants¶
--MAPQ filter. The outputs are the .vcf bgzip compressed file gipOut/samples/sampleId/sampleId.delly.vcf.gz and its tabix index with .tbi extension.--filterDellyOPT parameter, and setting the following variables:–minMAPQ - minimum median mapping quality of paired-ends supporting the SV [int]
–chrEndFilter - number of bases spanning from the chromosome ends inwards. SVs overlapping such telomeric or sub-telomeric regions are discarded [int]
–rmLowQual - Remove delly predictions labelled as LowQual
–rmImprecise - Keep just delly predictions labelled as PRECISE
–topRc[Bnd|Ins|Del|Dup|Inv] - Select top SVs based on RC score [int]
–topHqCount[Bnd|Ins|Del|Dup|Inv] - Select top SVs based on DV+RV score [int]
–topHqPercent[Bnd|Ins|Del|Dup|Inv] - Select top SVs based on (DV+RV/DV+RV+DR+RR)*100 score [int]
--filterDellyOPT, none of these filter is used. To use these filters it is needed to specify the filter type with a suffix indicating the SV of interest: “Bnd” (break ends), “Ins” (insertions), “Del” (deletions), “Dup” (duplications) and “Inv” (inversions).--topHqCountInv 50 would select the 50 predicted inversions with the best DV+RV score.RC: Raw high-quality read counts or base counts for the SV
DV: # high-quality variant pairs
RV: # high-quality variant junction reads
DR: # high-quality reference pairs
RR: # high-quality reference junction reads
filterDellyOPT="--rmLowQual --chrEndFilter 100 --minMAPQ 50 --topHqPercentBnd 150 \
--topHqPercentIns 150 --topHqPercentDel 150 --topHqPercentDup 150 --topHqPercentInv 150"
output.vcf.gz |
compressed vcf file |
output.vcf.gz.tbi |
tabix index |
DEL.bed |
deletions coordinates |
DUP.bed |
tandem duplications coordinates |
INV.bed |
inversions coordinates |
BND.bed |
break end coordinates |
INS.bed |
insertions coordinates |
sampleId_circosData/ |
data for circos plot |
sampleId.SV.circos.png |
circos plot |
For circos plot representation the chromosomes of interest are binned in into genomic intervals whose size (bp) is regulated by --binSizeCircos (default 25000). In the inner part of circos plot the predicted break ends translocations events are shown as black lines. The karyotype color reflects the mean reads MAPQ score calculated for each genomic bin. Black indicates a MAPQ < 2, gray indicates a MAPQ ≥ 2 and < 20 and white indicates a MAPQ ≥ 20. Ticks positions and ticks labels are automatically assigned by GIP depending on genome size. If any, the position of insertions is indicated by red stripes on the karyotype.
Moving outwards the circos plot shows the tracks relative to predicted duplications (orange), deletions (blue) and inversions (green). The outmost track shows the genomic bin sequencing coverage (light blue bars) normalized by genome median coverage and ranging from 0 to 3. To ease visualization, amplifications with normalized coverage greater than 3 are shown with a value of 3.
Define and quantify gene clusters¶
Depending on the sequencing technology and the experimental design, annotated genes presenting very high levels of sequence similarity may be difficult to quantify.The length of the genomic reads and the fragment size influence the read MAPQ scores, thus the unicity of the read alignment.Instead of quantifying individual genes, GIP allows to quantify homologous genes as clusters. Given the set of gene coverage (.covPerGe.gz) files generated for each sample, GIP:
Selects genes that cannot be directly quantified, i.e. have a mean MAPQ lower than the
--MAPQvalue in all samplesRuns cd-hit-est with option “-g 1” to cluster these genes by sequence similarity
Evaluates the sequencing coverage of the genes belonging to clusters
Computes mean sequencing coverage for each gene cluster
The gene clusters analysis is run in the covPerClstr process, and the results are stored in the gipOut/covPerClstr folder.
clstrAnn.tsv |
predicted gene clusters (list format) |
clstrAnnFormat2.tsv |
predicted gene clusters (table format) |
sampleId.covPerClstr.gz |
mean sequencing gene cluster coverage (gzip compressed) |
lowMapq.clstr/ |
folder storing the gene cluster sequences |
Genes with low mean MAPQ in all samples but not clustering by sequence similarity are kept and part of the output. Normally these genes get a low MAPQ score either because they present internal repetitive sequences, or because their gene or pseudogene homologue is not annotated.