Recently published software for the analysis of RNA-Seq data for gene fusions include FusionMap , FusionSeq , FusionHunter , deFuse and Tophat-Fusion . Of these, both FusionMap and Tophat-Fusion can process SE read data. FusionMap uses a similar strategy to FusionFinder by splitting reads into smaller sections and finding fusion candidates where sections align to different genes on an annotated genomic reference prior to filtering. Tophat-Fusion uses to a two stage process of firstly aligning reads with a modified version of the spliced alignment software Tophat to a genomic reference before secondly performing a post processing step to overlay annotation and perform filtering.
To compare the performance of FusionFinder with FusionMap and Tophat-Fusion we have run a full analysis with all three tools using the Levin dataset. For this dataset, comprising 14 million 76mer reads, a complete analysis with FusionFinder took approximately 3.1 hours on a single 2.4 GHz core of a multi-core AMD server with a peak memory usage of 1.8GB and using data from a local Ensembl (release 62) mirror. FusionMap (version 2012-03-03) with comparable parameters (α = 25, β = 1, G = 2) and preloaded reference data running the same analysis using Mono (version 2.10.8) under 64-bit linux, again on a single 2.4 GHz core, took 2.1 hours to complete and at its peak consumed 7.2 GB memory. Tophat-Fusion with comparable parameters (alignment phase: –fusion-min-dist 10000 and post-processing: –num-fusion-reads 4–num-fusion-pairs 0–num-fusion-both 4) and reference data on the same platform took 15 hours to complete and consumed 9.6 GB memory at its peak during the post-processing step. Although the run time for FusionFinder is slightly slower than FusionMap on a single core both are considerably faster than Tophat-Fusion. In addition FusionFinder consumes far less memory under a Linux operating system than both FusionMap and Tophat-Fusion. It should be noted that both FusionMap and Tophat-Fusion can be run on multiple CPU cores and with the same dataset and parameters but with 5 CPU cores, the analysis took 0.8 hours and 4.5 hours respectively. Similarly Bowtie can be run on multiple CPU cores and using 5 cores for the alignment steps in the FusionFinder protocol improves the total analysis time to 2.4 hours. We are currently working on a fully multithreaded version of FusionFinder. These data are summarised in Table 3 and a detailed breakdown of resources used by FusionFinder at each step of the protocol can be found in Table S4.
In line with previous reports our analysis of the Levin dataset with FusionMap confirmed the findings of Levin et al. and reported an additional 57 candidates in this dataset. In comparison, Tophat-Fusion reported 12 candidate fusions but did not successfully identify all those reported by Levin et al or the additional candidates reported by FusionFinder, even when we allowed for the detection of the read-through transcripts we observed. Tophat-Fusion only reported two of the three isoforms of NUP214:XKR3 reported by FusionFinder and did not report CEP170:RAD51L1 but did report an additional isoform of BCR:ABL1 which was not reported by FusionFinder or FusionMap. The results of these analyses are presented in Table S1 a and b.
To further compare the performance of each software we generated a simulated dataset of approximately 13.5 million SE 75 nucleotide reads (see Methods). The dataset contained normal background reads and “fusion reads” representing the transcript breakpoint of 55 fusion transcripts generated at random (see Table S2a). The dataset simulated fusions designed to represent both high and low levels of expression with read numbers per fusion transcript ranging from 1 – 295. We ran FusionFinder, FusionMap and Tophat-Fusion against this dataset. FusionFinder was run with default parameters, generating 30mer pseudo PE reads. FusionMap was run with comparable parameters (α = 30, β = 1, G = 2, MinimalHit = 1), though we note that in order for FusionMap to detect any simulated fusions it was necessary to alter the MinimalRescuedReadNumber parameter to 0. Tophat-Fusion was also run using comparable parameters for the post-processing step of the protocol (–num-fusion-reads 1–num-fusion-pairs 0–num-fusion-both 1). Sensitivity and Positive Predictive Values (PPV) were then calculated for the simulated dataset to assess the ability of each software to accurately detect simulated fusion genes (see Methods). Table 4 summarises the overall results of this analysis whilst a plot of these data is shown in Figure 4. The raw results from each software can be found in Tables S2 b, c and d. The performance measures were calculated on subgroups of fusion genes where subgroups were selected based on the number of reads (i) evidencing the fusion genes predicted by each software. For example, the point marked at 100 on the x axis of Figure 4 shows performance measures for all predicted fusion genes that were found to be evidenced by 100 or more reads.
Among the fusion gene predictions made by each software are what we have termed “synonymous fusions”. These are where at least one of the identified gene partners has been inaccurately predicted because it shares high sequence similarity with the expected partner gene, possibly because it is a member of the same gene family (for example, an S100A3:SULT1 fusion may be detected as an S100A3:SULT1 fusion). Although the informed researcher would frequently be able to distinguish these fusions by visual comparison with other candidates in the output files, in our assessment of sensitivity and PPV such synonymous fusions were considered to be false positives to provide the most stringent assessment of each software.
It can be seen from Figure 4 and Table 4 that FusionFinder shows greater sensitivity and generally greater PPV than FusionMap in the detection of our simulated fusion genes. Figure 4 also shows that FusionFinder compares favourably against Tophat-Fusion with an overall greater sensitivity and a comparable PPV. With regard to the overall sensitivity in Table 4, FusionFinder detected 87% of the 55 simulated fusion genes, FusionMap reported 58% and Tophat-Fusion reported 64%. Importantly, FusionFinder and Tophat-Fusion only detected 15 and 5 false positives respectively (some of which were synonymous fusions – see Table S2 b, c and d) giving them a comparable PPV. In contrast FusionMap reported 582 false positives, which represents 95% of the returned results respectively (Table 4). This has a considerable effect on the PPV in Figure 4 at low read levels with FusionMap remaining significantly lower than both FusionFinder and Tophat-Fusion. Consequently, in the case of FusionMap the user is returned a large list of potential fusion genes consisting primarily of false positives. In contrast, the candidates reported by FusionFinder will be more robust and more likely to be confirmed experimentally.
For two of the fifty-five simulated fusions, the partner genes contained repeats of the same class at the transcript breakpoint. These gene fusions were detected by FusionFinder but due to the RepeatMask filter, were subsequently filtered. However, both of these appeared in the FusionMap results, suggesting that FusionMap does not take the presence of repeats in to account. This could explain the occurrence of so many false positives in FusionMap’s results. Both of these simulated fusions were also filtered by Tophat-Fusion.
It should be noted that when both FusionMap and Tophat-Fusion detected a simulated fusion gene they consistently detected all simulated fusion reads, however although FusionFinder detected more simulated fusion genes it did not consistently detect all fusion reads. This is because, FusionFinder analyses data from an alignment using Bowtie’s default parameters which does not provide results for multi-mapping reads. This means that given two genes from the same family, sharing high sequence identity, a read has an equal chance of hitting either of these genes. As a result, the expected fusion reads are distributed between all synonymous fusions. We are working on a method to analyse alignments of multi-mapping reads, which will significantly increase the numbers of reads detected.










