Input and output organization¶
Input¶
GIP requires the following mandatory input parameters:
--genome |
multi-FASTA genome reference file |
--annotation |
gene coordinates file (GTF format) |
--index |
list of sequencing data files |
-c |
nextflow configuration file |
params{} scope.process{} scope can be used to customize the configuration of all GIP processes, including the allocation of memory or CPUs.executor- indicates whether the processes must be executed on the local machine (default) or on a computing cluster (e.g. ‘slurm’);clusterOptions- provides optional cluster configurations, like the nodes partition where to allocate jobs (e.g. ‘-p hubbioit –qos hubbioit’);container- sets the absolute path of the giptools singularity image to use to execute the processes.
singularity{} scope contains the configuration options to interface the nextflow pipeline with the singularity container. Singularity allows to mount (a.k.a. bind) host input data at specific container locations defined by the user. GIP container (i.e. the giptools file) comes with a set of built-in folders that can be used as access points for data:/fq
/genome
/annotation
/repLib
/geneFunction
/gipOut
/mnt
runOptions parameter can be used to specify bind points with the --bind (or -B) option and the following syntax:runOtions='-B /home/user/data/assemblies:/genome' in the configuration file.--genome /genome/reference.fa.singularity {
enabled = true
runOptions = '-B /home/projects -B /home/user/data/assemblies:/genome -B /home/user/sequencingData:/fq'
}
autoMounts=true and bind just a top-level folder of all data folders.singularity {
enabled = true
autoMounts = true
runOptions = '--bind /pasteur'
}
Tsv format (i.e. <Tab> separated)
First header row with the labels: sampleId read1 read2
All the following rows must indicate the sample identifier, and the file names first and second pair-end sequencing data files in fastq.gz format
To combine the reads originating from multiple technical replicates the fastq.gz files must be comma-separated and in the same order between read1 and read2
Output¶
genome/ |
reference genome data |
samples/ |
individual samples results |
covPerClstr/ |
gene cluster quantification |
reports/ |
report files |
this example.-resume to the command line, GIP will re-run only the processes that are affected by the parameter change, and use the cached results of all the other processes.--resultDir parameter can be used to set a name alternative to “gipOut” for the result directory.In the following part we provide a description of GIP steps operated by the Nextflow processes and all result files.