Parameters guidelines

GIP offers several non-architecture-related parameters that can be adjusted and are already described in the GIP steps section. This page provides some tips on how to set these parameters for the analysis of your genomic sequencing experiment. These are recommendations based on our work experience. The best parametrization that suits your purpose highly depends on your data and your organism of interest. It is therefore recommendable to explore and test different parametrizations.

  • MAPQ: Use high values (e.g. 30-50) for a stringent parametrization. The lowest value is 0. Set to 0 to consider all genomic positions when calling genomic variants. GIP relies on the BWA mapper which, in case of multiple equally scoring mapping positions, randomly assigns the read to one of the possible locations.

  • delDup: It is recommendable to set “true” to remove duplicated reads which, in the context of WGS, mostly represent technical artifacts. However, it is worth mentioning that at very high levels of sequencing depth duplicated reads could represent genuine DNA fragments originated from two different cells or two different chromatids.

  • CGcorrect: It is recommendable to set “true” if interested in stressing copy number differences between genes or genomic bins of the same sample, especially if the library preparation included PCR amplification steps. This is because the measured differences could be explained by different CG content (rather than a genuine biological signal) causing DNA amplification biases.

  • chrs: For unfinished genome assemblies it is recommended to define the list of chromosomes identifiers of interest. While sequencing reads will be mapped to the entire genome, downstream analyses will be limited to the genomic regions of interest (i.e. discarding scaffolds or contigs with potentially sub-optimal annotations). This will improve both computing performance and visualization. If the genome assembly is of good quality just set this parameter to “all”.

  • chrPlotYlim: These are the y-axis minimum and maximum values for the chromosome ploidy plot. The default, “0 8” is large and meant to be inclusive. For better visualizations is recommended to adjust this values range based on the target species. For instance, it is rare to observe Leishmania isolates with more than 5 copies for a given chromosome.

  • binPlotYlim: The y-axis minimum and maximum limits of the genomic bins sequencing coverage plots. The default “0 3” provides a compact visualization where bin amplifications with normalized coverage values >3 are shown as 3.

  • binOverviewSize: Graphical parameter controlling the heights and the widths of the genomic bin coverage visualizations (default “400 1000”).

  • customCoverageLimits: This parameter can be used to enforce additional custom sequencing coverage thresholds. Significant CNV genomic bins and genes will be retained just if presenting a normalized coverage above the first number, or below the second one (default “1.5 0.5”). Assuming diploidy, 1.5 indicates that one of the two chromosomes presents one additional, amplified copy of the gene (bin). Likewise, 0.5 indicates that one of the two chromosomes lost a copy of the gene (bin). This parameter also determines the thresholds above which genes/bins are colored in orange (amplification) and below which are colored in blue (depletion). It is often convenient to use this threshold to limit the number of predictions and focus just on the most relevant ones.

  • binSize: This parameter governs the resolution of genomic bin analyses. The smaller the value, the higher the number of genomic bins. GIP evaluates sequencing coverage at single nucleotide level (i.e. each nucleotide of each bin). The binSize value can be set smaller than the read size without causing measurement problems. It is recommended to adjust the –binSize value based on the reference genome size. The default, 300 (nucleotides), was successfully applied for the analysis of Leishmania, Candida and Plasmodium genomes. For the analysis of the human genome we used a binSize of 50000 nucleotides.

  • covPerBinSigOPT: For the identification of significant genomic CNV segments (i.e. “collapsed bins”) it is recommended to set the “–padjust BY” to execute the Benjamini & Yekutieli (BY) multiple testing correction (as in the default). BY accounts for the variable dependence that arises from measuring sequencing coverage in adjacent genomic bins that possibly took part to the same DNA recombination event (i.e. physically connected, so amplified or depleted “together”). In this scenario, bin coverage estimates cannot be considered independent.

  • covPerGeSigOPT: For the identification of significant CNV genes it is possible to control false discovery rate using the Benjamini & Hochberg (BH) correction, which is valid assuming that genes are separate and their sequencing coverage estimates can be measured independently.

  • covPerGeRepeatRange: This parameter defines the maximum distance (in nucleotides) from each gene CNVs in which repeats labels are shown in the relevant gene CNV plot. The default is 1000 nucleotides.

  • freebayesOPT: These are the options directly passed to Freebayes to call SNVs. See Freebayes documentation at: https://github.com/freebayes/freebayes.

  • filterFreebayesOPT: It defines the filtering options to be applied to Freebayes’ predictions. It is possible to remove high frequency SNVs by reducing the value of “–maxFreq”. Such predictions can either represent genome assembly flaws, or genuine homozygous variants. Similarly is possible to remove predictions with variant allele frequency lower than “–minFreq”. While some of these predictions represent sequencing artifacts, some others could represent genuine SNVs present in rare sub-clonal cell populations, therefore potentially relevant biomarker. Depending on the sample and the species, in determinate situations is uninformative to display all predicted SNVs. The “–randomSNVtoShow” option is recommended to control the number of displayed SNVs when dealing with cancer WGS datasets or in all cases with significant numbers of SNVs. The “–MADrange” option is useful to remove SNVs located in CNVs regions and whose frequency may be confounding as it does not follow the one expected based on chromosome copy numbers.

  • filterDellyOPT: It is recommended to use the “–chrEndFilter” option to filter predictions located at the chromosome ends. The rational is that often, due to the presence of highly repetitive sequences, telomeric and sub-telomeric regions are hard to assemble and as a consequence SV predictions mapping on these regions may be unreliable.

  • binSizeCircos: For circos representations the chromosomes of interest are binned in into genomic intervals whose size (bp) is regulated by this parameter. The normalized sequencing coverage of each of these bins is shown as a track in the circos plot. Circos does not allow large numbers of genomic bins, or anyway the visualization of large numbers of bins is not informative. Thus this parameter needs to be adjusted according to the genome reference size. We successfully utilized a value of 25000 for Leishmania, Candida and Plasmodium genomes, while for the human genome we used a value of 2500000.

  • bigWigOPT: The string provided with this parameter is directly passed to the bamCoverage program of the deepTools2 suit. The tool documentation is available from https://deeptools.readthedocs.io/en/develop/content/tools/bamCoverage.html?highlight=bamcoverage. GIP default is “–normalizeUsing RPKM –ignoreDuplicates –binSize 10 –smoothLength 30”.