The DNAscan pipeline consists of four stages: Alignment, Analysis, Annotation and Report generation, and can be run in three modes: Fast, Normal and Intensive, according to user requirements (Fig. 1 and Table 1). These modes have been designed to optimize computational effort without compromising performance for the type of genetic variant the user is testing (see mode recommendations in Table 2). The user can restrict the analysis to any sub-region of the human genome by proving either a region file in bed format, a list of gene names, or using the whole-exome option, reducing the processing time and generating region specific reports.
DNAscan accepts sequencing data in fastq.gz and as a Sequence Alignment Map (SAM) file (and its compressed version BAM). HISAT2 and BWA mem [6, 7] are used to map the reads to the reference genome (Fig. 1, left panel). This step is skipped if the user provides data in SAM or BAM formats. HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads to a reference genome. HISAT2 uses a new reference indexing scheme called a Hierarchical Graph FM index (HGFM) [8], thanks to which it can guarantee a high performance, comparable to state-of-the-art tools, in approximately one quarter of the time of BWA and Bowtie2 [9] (see Additional file 1).
Variant calling pipelines based on HISAT2 generally perform poorly on indels [10]. To address this issue, DNAscan uses BWA to realign soft-clipped and unaligned reads. This alignment refinement step is skipped if DNAscan is run in Fast mode.
Samblaster [11] is used to mark duplicates during the alignment step and Sambamba [12] to sort the aligned reads. Both the variant callers, Freebayes [13] and GATK Haplotype Caller (HC) [5] used in the following step, are duplicate-aware, meaning that they automatically ignore reads marked as duplicate. The user can optionally exclude it from the workflow according to the study design, e.g. when an intensive Polymerase Chain Reaction (PCR) amplification of small regions is required.
Various analyses are performed on the mapped sequencing data (Fig. 1, right panel): SNV and small indel calling is performed using Freebayes, whose reliability is well reported [14, 15]. However, taking advantage of the documented better performance of GATK HC in small indel calling, we decided to add a customised indel calling step to DNAscan, called Intensive mode. This step firstly extracts the genome positions for which an insertion or a deletion is present on the cigar of at least one read, and secondly calls indels using GATK HC on these selected positions. The reduced number of positions where this occurs allows for a targeted use of GATK HC, limiting the required computational effort and time. The resulting SNVs and small indel calls with genotype quality smaller then 20 and depth smaller than 10 are discarded. The user can customize these filters according to their needs (see GitHub [16] for details and a complete list of available filters).
Two Illumina developed tools, Manta [17] and Expansion Hunter [18] are used for detecting medium and large structural variants (> 50 bp) including insertions, deletions, translocations, duplications and known repeat expansions. These tools are optimised for high speed and can analyse a 40x WGS sample in about one hour using 4 threads, maintaining a very high performance.
DNAscan also has options to scan the sequencing data for microbial genetic material. It performs a computational subtraction of human host sequences to identify sequences of infectious agents including viruses, bacteria or fungi, by aligning the non-human or unaligned reads to the whole NCBI database [19,20,21] of known viral, bacterial or any custom set of microbial genomes and reporting the number of reads aligned to each non-human genome, its length and the number of bases covered by at least one read.
Variant calls are then annotated using Annovar [22]. The annotation includes the use of databases such as ClinVar [23], Exac [24], dbSNP [25] and dbNSFP [26] (more information about how to customise the annotation, e.g. by selecting alternative databases and/or focusing on specific genome regions, are available on GitHub).
DNAscan produces a wide set of quality control (QC) and result reports and provides utilities for visualisation and interpretation of the results.
MultiQC [27] is used to wrap up and visualise QC results. FastQC [28], Samtools [29] and Bcftools [30] are used to perform QC on the sequencing data, its alignment and the called variants. An example is available on GitHub [31]. A tab delimited file including all variants found within the selected region is also generated [32]. This report would include all annotations performed by Annovar [22] in a format that is easy to handle with any Excel-like software by users of all levels of expertise.
Three iobio services (bam.iobio, vcf.iobio and gene.iobio) are locally provided with the pipeline allowing for the visualisation of the alignment file [33], the called variants [34] and for a gene based visualisation and interpretation of the results [35].
Benchmarking every DNAscan component is not needed since a range of literature is available [14, 15, 17, 36, 37]. However, to our knowledge, none exists assessing HISAT2 [8] (the short-read mapper used by the pipeline) either for DNA read mapping or as part of DNA variant calling pipelines. In this manuscript, we both assess the performance of HISAT2 with BWA and Bowtie2 [9] mapping 1.25 billion WGS reads sequenced with the Illumina Hiseq X and 150 million simulated reads (see Additional file 1), and compare our SNV/indel calling pipeline in Fast, Normal and Intensive modes with the GATK BPW [5] and SpeedSeq [4] over the whole exome sequencing of NA12878. Illumina platinum calls are used as true positives [38].
We also show how DNAscan represents a powerful tool for medical and scientific use by analysing real DNA sequence data from two patients affected by Amyotrophic Lateral Sclerosis (ALS) and of HIV infected human cells. For the ALS patients we use both a gene panel of 10 ALS-related genes, whose feasibility for diagnostic medicine has been previously investigated [2], sequenced with the Illumina Miseq platform, and the WGS data from the Project MinE sequencing dataset [39]. DNAscan was used to look for SNVs, small indels, structural variants, and known repeat expansions. The WGS of an HIV infected human cell sample [40] was used to test DNAscan for virus detection.
To assess the performance of DNAscan in calling SNVs and indels, we used the Illumina Genome Analyzer II whole exome sequencing of NA12878. Illumina platinum calls [38] were used as true positives.
GATK BPW calls were generated using default parameters and following the indications on the GATK website [41] for germline SNVs and indels calling. These include the pre-processing and variant discovery steps for single sample, i.e. skipping the Merge and Join Genotype steps.
SpeedSeq calls were generated running the “align” and “var” commands as described on GitHub [42]. RTG Tools [43] (“vcfeval” command) was used to evaluate the calls. F-measure, Precision and Sensitivity are defined as in the following: , and , where Tp is true positives, Fp false positives and Fn false negatives.
Using DNAscan in Fast mode, we analysed real DNA sequence data from two ALS patients (case A and case B). Case A carries a non-synonymous mutation in the FUS gene [44] (variant C1561T, amino acid change R521C, variant dbSNP id rs121909670 [45]) known to be a cause of ALS (ClinVar id RCV000017611.25). A panel of 10 ALS related genes was sequenced with the Illumina Miseq platform for case A. The Miseq gene panel was designed and tested for diagnostic purposes [2]. For these 10 genes (BSCL2, CEP112, FUS, MATR3, OPTN, SOD1, SPG11, TARDBP, UBQLN2, and VCP), the full exon set was sequenced, generating over 825,000,222-base-long paired reads. DNAscan was used to call SNVs, indels, and structural variants on case A.
Case B has a confirmed C9orf72 expansion mutation on one allele, also known to be causative of ALS [46]. This expansion mutation is thousands of repeats long. and 40x WGS data was generated with the Illumina Hiseq X for case B. The WGS sample (paired reads, read length = 150, average coverage depth = 40) was sequenced as part of the Project MinE sequencing dataset [39]. For this sample we ran DNAscan on the whole genome. However, both for practical reasons and to simulate a specific medical diagnostic interest, we focused our analysis report on the 126 ALS related genes reported on the ALSoD webserver [47] and also looked for the C9orf72 repeat.
For both samples, we also reported variants linked to frontotemporal dementia, which is a neurodegenerative disease that causes neuronal loss, predominantly involving the frontal or temporal lobes, with a genetic and clinical overlap with ALS [48, 49].
Pathological C9orf72 gene hexanucleotide repeat expansions were determined using repeat primed PCR (RP-PCR), as previously described [50].











