We have created fast motif finding methods MOST and MOST+. The system diagram of MOST/MOST+ was shown in Figure 1 and described in Methods section. MOST uses only sequences as the inputs while MOST+ can integrate additional experimental data.
MOST+ and MOST were compared with several prevalent motif finding systems, such as MEME, Trawler, WEEDER, CisFinder and HOMER, in terms of speed and accuracy. Different fragments of the mESC ChIP-seq dataset and the genome-wide promoter regions of the mouse genome were used as the gold standard datasets (see Methods for more details).
MOST outperformed all tested algorithms (Figure 2, Table S1) in terms of speed for all datasets with various sizes. MEME [9] had not produced any result after 4 days running on a 16 Mbps dataset. WEEDER failed to produce a result when the input sequence length exceeds 20 Mbps. CisFinder was fast, but had a restriction on the input size (up to 25 Mbps). In short, MOST+/MOST and HOMER [25] (version 4.2) were able to produce results for large datasets in a reasonable time.
For all methods, default parameter settings were used when possible (details can be found in Additional file 1). Word lengths were chosen to a range from 4 bps-7 bps, with an exception of WEEDER (up to 8 bps).
MOST+ is insensitive to the input size. Counting and locating words takes little time. Memory consumption remains relatively low when data size grows up to 100 Mbps (linear spatial complexity, MOST+ requires 500 megabyte memory for an 8 Mbps input).
MOST/MOST+ was designed to bridge gaps and find long gap patterns. For instance, Sox2 (canonically represented by CATTGTT) and Oct4 (or known as POU5F1, canonically represented by ATGCAAAT) often occur as a heterodimer (characterized by motifs of OCT-N-SOX or OCT-3N-SOX). These motifs had been detected simultaneously (Figure S1).
MOST+ also successfully detected motif of SMAD1, which has relatively low occurrences in tested ChIP-seq peak regions. For OCT4, MOST+ successfully detected a tandem repeat of OCT4 core motif (Figure S1).
Alternative TFBSs and palindromes were found for ESrrb (Estrogen-related receptor beta) when a larger word width (K = 11) was employed to capture more sophisticated co-occurrence and a more stringent clustering threshold was set to discriminate sub-patterns (Figure S1).
MOST reported a similar result compared to the prevalent algorithms. After epigenetic signals were added (MOST+ mode), motifs were better aligned with known motifs in databases and the rank of essayed TF was lifted (Table 1).
We observed that spuriously over-represented k-mers (like some tandem repeat that may not be motif in our dataset) were more likely associated with a higher level of noise and asymmetry (Figure 3a, and Figure S2). Indicating these features may help eliminating false positives. Furthermore, similar motifs can have distinct mark distribution patterns (Figure 3b). If guided by epigenetic marks, it would be possible to distinguish similar motifs.
In general, MOST+ found more motifs that exist in motif database than MOST, which means more essayed TF or co-factors are found. According to Bieda et al.,[26] the motif of E2f1 is very hard to be identified maybe due to indirect binding. However, MOST+ did report a motif (featured by CGCCAT) that ranked second and resembled the motif of E2f-family member E2f3. Detections of n-Myc, c-Myc, zfx and SMAD1 would also benefit from epigenetic marks in terms of highlighting the assayed motifs.
MOST+ and exiting motif-finding tools were compared using the ChIP-seq dataset of 13 TFs of mouse ES cell.
General comparison was conducted based on alignments with known motifs in the reference database. Comparison on other 5 algorithms showed that MOST+ was among the best algorithms in both capturing major TFBSs (the binding sites of the assayed TFs) and detecting co-factors. MOST+ and HOMER identified 11 of 13 major motifs (with the major motif ranked the first in results) with significant e-value of alignment whereas DREME reported 10 motifs (Table 2). MOST+ can recover a comparable amount of co-factors with DREME, which was devised for finding co-factors (Table 3). Like CisFinder, MOST+ could automatically determine a self-adaptive length for each cluster.
Assessment on site-level accuracy of motif finding methods was conducted by using the pipeline described in Methods section (Figure 4). Figure S3 showed a comparison of found motifs for different methods. Results show MOST+ has the highest AUROC sum over 13 TFs, though the situation varies from TF to TF (Figure 5, Figure S4). With parameters learned by part of the dataset, MOST+ achieved better accuracy on recovering motif positions with validation data, suggesting that motifs could reflect actual binding sites better if external signals were available. The AUC (Area Under the Curve) of ROC increased when word counts and mark distribution scores were combined under the appropriate parameter setting. This supported the idea that epigenetic marks can be helpful to cluster words [4].
With this parameter setting learned from training and validation datasets, we compared MOST+ with other 5 algorithms on the remaining partition data. The learned parameters were given in Table S2 (see Additional file 1 for more detail). Each time, we included one feature in our model to test its contribution to the accuracy improvement. Results show that the mere use of asymmetry feature contributes the most to the overall improvement of AUROC, while the utilization of original signals of essayed TF ChIP-seq data failed to show any advantage.
To validate the power of our algorithm, we also applied the model to the human data. On average, MOST+ outperforms MOST by better AUROC, however, with an exception of JUND 1 and 2 (Figure S5). When epigenetic signals were added to the model, MOST+ found some novel co-factors that MOST could not find. For instance, with the aid of DNase I hypersensitivity information, additional motifs that strongly resembled STAT1, GABPA, LM4 and LM130 (both were long motifs reported by Xie et al.) were found in VDR (Vitamin D Receptor) datasets (Figure 6).
To date, large portions of TFBSs are still undetected. However, poor performance has often been reported on capturing motifs in generic promoters [27]. With the power of epigenetic marks, we hoped our overall finding of motifs in all regions close to the transcription start sites (TSSs) might reveal novel motifs. We took an exhaustive search on 89 Mbps mouse promoter regions. When more hints from ChIP-seq data were available, we put on a more stringent threshold of word clustering and masked repeats (see Additional file 1 for MOST+ command lines).
Results showed that 117 motifs discovered by MOST+ (16% of total found motifs) could be aligned with one or more motifs in our mouse motif databases (JASPAR, UNIPROBE and Chen et al., 541 in total, Figure S6). Left panel of Figure 7 showed an example of motifs discovered by MOST+. MOST+ also output hundreds of novel motif candidates, which were not found in mouse motif databases. However, the mark distributions centred by some of these motifs show non-trivial shapes. This indicates either a putative novel motif or a potential sequence mark of epigenetic modifications (right panel of Figure 7).










