A range of analyses spanning multiple levels of abstraction have been carried out, to identify plausible drug targets. The methodology can also be used more generally as a target identification pipeline that would be applicable to many drug discovery programmes. Starting from the entire proteome of Mtb H37Rv comprising 3,989 proteins, we have shortlisted 451 proteins as potential drug targets using a variety of filters, as depicted in Figs. 1 and 2. Fig. 1 illustrates a pictorial view of the targetTB pipeline while Fig. 2 shows a simplified view of the pipeline as a flowchart, illustrating the flow of this study. We first carry out a network analysis, where a full genome-scale interactome encoding several types of protein-protein interactions and protein-protein influences from metabolic pathways is reconstructed. Gene deletions that would significantly disrupt the network are then identified (List-A1). Next, we have studied the reactome through FBA (List-A2), to identify lethal gene deletions. This is further augmented with high-throughput gene essentiality data (List-A3). These system-level analyses together comprise Filter A. This is then integrated with sequence-level (Filter B) and structural analyses (Filter C) as described below (see Fig. 1). The expression of the gene encoding for the target is highly desirable (Filter E) and the list is further pruned by eliminating targets with high similarities to known ‘anti-targets’ in the human proteome (Filter F) and proteins in gut flora (Filter G). Those targets known to contribute to drug resistance in the pathogen are then prioritised. By analysis of similarity against several pathogenic proteomes, broad-spectrum targets as well as those unique to Mtb have also been identified. Various filters, lists and the numbers of proteins passed and eliminated at the various stages of the pipeline are given in Table 2.
A protein-protein interaction network comprising 3,405 nodes and 29,302 edges was constructed, which covered over 85% of the Mtb proteome. To evaluate the importance of a given protein in the context of the large interactome network, each node was individually deleted and its impact measured in terms of the number of shortest paths that are disrupted. Shortest paths in a network are quite critical to the structure of the network. Shortest paths in metabolic networks of Mtb and M. leprae have been identified and analysed by us earlier [44]. Samson and co-workers have earlier analysed a protein network in Saccharomyces cerevisiae, indicating that the analysis of shortest paths may provide an idea of network navigability as well as the efficiency with which a perturbation can spread throughout a network [45]. More recently, Wingender and co-workers have illustrated the importance of a similar metric, a ‘pairwise disconnectivity index’, for topological analysis of regulatory networks [46]. The disruption of shortest paths is expected to have a substantial effect on the network connectivity in protein networks as well. In the interactome network studied here, most of the node deletions do not significantly disrupt network connectivity. However, substantial effects (more than 5,000 disrupted shortest paths) were observed upon deletion of 431 of the 3,405 nodes (List-A1). These 431 proteins, for which a critical role in maintaining interactome network structure is suggested, were taken through further steps of filtering, in order to identify most useful drug targets. For example, for BirA (Rv3279c), close to 95,000 shortest paths in the network, were disrupted by its removal. A complete list of these proteins is provided as supplementary material [See Additional file 4].
An FBA study, using the iNJ 661 model [19], identified 188 proteins of the 661 studied, as essential for the growth of the bacterium, whereas an additional 41 also had a significant impact on growth (the in silico knock-out mutants were slow growers) [19]. A separate FBA study using an independently derived genome-scale metabolic model (GSMN-TB) identified 259 of the 719 proteins studied as essential for growth [20]. While these two models are similar in many respects, there are subtle differences in their biomass functions for FBA, as well as their coverage of the Mtb proteomes. 134 proteins were common to both lists of essential proteins. A third FBA study (MAP), carried out by us previously for the mycolic acid pathway alone identified 15 proteins in the pathway as essential for the microbe. Put together, the three studies suggest 318 proteins to be essential for the microbe. A critical role in maintaining the metabolism of the bacterium is suggested for the 318 proteins (List-A2). We have also carried out a double knockout study, on the Mtb iNJ 661 model, identifying 49 pairs of genes, which when knocked out together, produce a lethal phenotype.
A high-throughput analysis of gene essentiality, using Transposon Site Hybridisation (TraSH) mutagenesis has been reported earlier. Genes, whose deletion produced slow-growing mutants, were also identified. These proteins (List-A3), taken together with Lists A1 and A2, form the list of proteins (List A) that are implicated to be essential, by systems-level analyses. We have combined the essentiality data, rather than take a consensus from the different system-level models discussed above, since each model has its own strengths and weaknesses. Many proteins are eliminated from the pipeline at this stage. For example, MabA (Rv1483), which has been suggested as a potential drug target [47, 48], was not found to be essential in any of the systems-level analyses. MshA (Rv0486), suggested as an essential component of mycothiol biosynthesis and essential for growth in Mtb Erdman strain [49], is also not found to be essential in any of the systems-level studies.
At the sequence level, comparison with the human (host) proteome can be useful in filtering out those targets that have detectable homologues in the human cells, in order to reduce the risk of adverse effects that arise due to unintended interaction of the drug with the host protein. For 3,611 of 3,989 Mtb proteins, no close homologues were observed in the human proteome. The remaining 378 proteins, for which close homologues were observed, were eliminated at this step. The 3,611 proteins (List-B) were taken through further steps in the targetTB pipeline. Proteins such as KasA (Rv2245), KasB (Rv2246), MabA (Rv1483), RmlB (Rv3464), which have been suggested as potential targets in earlier studies, have all been eliminated at this stage, due to the presence of close homologues in the human proteome.
Combining the systems and sequence level analyses, 942 proteins were shortlisted for further analysis. A list of these proteins is presented as supplementary material [See Additional file 4].
Similarity between proteins is better captured through structural comparisons, where structural data for both proteins are available. In fact, what ultimately matters in determining the pharmacological profiles of drug molecules is the recognition of the drug molecules by various protein molecules at their binding sites. It is therefore important to compare binding sites in the various protein molecules in both the pathogen and the host. At this step, we want to critically weed out targets that share very high similarity with binding sites from the human ‘pocketome’, since targeting these may lead to adverse drug reactions, due to inadvertent binding with human proteins.
This type of analysis would become more meaningful if carried out at the proteome-scale. Advances in crystallography and various structural genomics projects [50–52] have led to the determination of 229 and 3,515 structures of Mtb and human, respectively. In the absence of experimentally determined structures, high-confidence homology models for 2,808 Mtb proteins and 16,000 human proteins were obtained from the ModBase database. The availability of such a large number of protein structures in both species makes it feasible to carry out a proteome-scale structural assessment of targetability. Identification of binding sites and further comparison of the identified binding sites are the next two challenging steps towards this goal. Two new algorithms that we have recently developed, PD and PM, enable us to carry out this comparison.
Of the 942 proteins shortlisted earlier in the pipeline, 773 had structures available in the PDB/ModBase databases. For these 773 proteins, the top 10 binding sites for each protein, identified using PD were compared with the top three binding pockets from LigsiteCSC. LigsiteCSC considers amino acid conservation at the putative sites, in the family of proteins. This automatically leads to identifying residues and hence the sites that are likely to be functionally important. Finding a consensus among top predictions between the two methods increases confidence in site prediction significantly. Some proteins such as DesA3 (Rv3229c), EmbB (Rv3795) and AccE5 (Rv3281) passed all other tests, but were not included in the H-List of high-confidence targets, since the structural analysis could not be performed.
A consensus between PD and LigsiteCSC was obtained so as to identify the most probable pockets that also contained conserved amino acid residues at the binding sites. Using this, 3,500 pockets were identified for 767 of the Mtb proteins. A similar exercise carried out for the human proteins identified 70,149 pockets. An all-versus-all comparison of the ‘pocketomes’ of Mtb and human was performed, using PM. This translated to 245,521,500 pairwise comparisons, which corresponded to over three years of serial CPU time, that was successfully completed on a BlueGene System, within a week.
A PM score of 0.8 or more indicates high similarity between two binding pockets. This threshold was used as a filter to eliminate all those proteins in Mtb whose pockets closely matched with any pocket of any protein in the human proteome. Of the 767 proteins, 145 had closely matching pockets in the human proteomes and were therefore eliminated from the pipeline. It is possible that some of these Mtb proteins contain some pockets that are sufficiently different from pockets of human proteins. Such proteins may also be targetable, but would require a close and more detailed analysis of all the pockets in the protein. The remaining 622 form a list of targets for anti-tubercular drugs. These proteins were taken through further steps of filtering to produce lists of highly viable targets.
Thus, of the 767 proteins that passed the A and B filters described above and had available structures, only 622 of them were found to pass this filter. This is despite the fact that sequence filtering was already carried out, re-emphasising the need for a multi-level target identification and validation scheme. The resulting proteins form the D-List, of targets that can be further explored for TB drug discovery.
While the fundamental determinants of the quality of a target have already been considered earlier, the following aspects are also of importance in selecting a quality target for drug design. The following filters were therefore used to further prune the identified list and in some cases to enrich the list with targets having additional benefits.
It is obvious that a target would be desirable only if it is expressed in the organism, at least under disease conditions. Expression data is available for over 3,900 genes in Mtb from various studies [30–32]. Of the shortlisted targets in the D-List, 529 are expressed, indicating their high viability as suitable targets. It must be noted here that the expression data are not comprehensive, especially in terms of the conditions that have been tested. The expression filter, while useful in understanding what is expressed and hence what is a useful target, should not be used to rule out otherwise useful targets. Until availability of more comprehensive data, this step is best used at the post-identification analysis stage. For example, proteins such as TrpD (Rv2192c), AroA (Rv3227), RibC (Rv1412) do not appear to be expressed in any of the experiments considered.
An ideal target should not only have specific recognition to the drug directed against it, but should also be sufficiently different from the host proteins, which have been termed as anti-targets. Considering this aspect early in the drug discovery pipeline may prove to be very useful in minimising the risk of failure of the drug candidates in the later stages of drug discovery. Anti-targets include proteins such as the transporters and pumps, which modify the bio-availability of a drug by their efflux action, or those proteins that trigger hazardous side effects, such as the hERG protein, which when blocked causes the ‘sudden death syndrome’ [33]. This list is by no means complete, but has been included here, more from a conceptual perspective, to highlight the need for screening against anti-targets. Sequence comparisons against 306 sequences belonging to the eight categories of anti-targets carried out revealed that sequence homologues at a similarity of 30% for over 30% of the query length were observed for 11 of the targets from the D-List. Such a loose similarity measure is used, since it is desired to rule out even a remote similarity with any anti-target. Moreover, close homologues have already been eliminated by sequence analysis earlier. A structural analysis of the proteins, when more data become available would be of immense utility in this regard. Serine/Threonine protein kinases such as the PknB (Rv0014c), earlier proposed as a target [53], PknL(Rv2176) and PknH (Rv1266c), as well as cytochromes such as Cyp128 (Rv2268c) and Cyp132 (Rv1394c) were eliminated at this stage.
The targets from the D-List were further compared to the protein sequences of hundreds of organisms that inhabit the gut of a healthy human. This was carried out to prune the list of identified drug targets, so that the drugs administered do not bind unintentionally to the proteins of the gut flora. Unintentional inhibition of gut flora proteins are known to lead to adverse effects and can promote pathogenic colonisation of the gut [54]. Drug interactions with gut flora are also believed to be the cause of idiosyncratic drug toxicity and reduced bio-availability of the drug [55, 56]. Similarity of the identified targets to such proteins therefore affects their suitability. The sequence analyses carried out here indicate that 79 proteins from the D-List had close homologues in the gut flora and were hence removed from the list of most viable targets. For example, FtsZ (Rv2150c), Glf (Rv3809c) have homologues in gut flora and were hence eliminated at this stage. Interestingly, Icl (Rv0467), which has been particularly suggested as an attractive drug target [57] and also implicated in persistence [36], fails at this stage, due the presence of homologues in gut flora.
At this stage of filtering, from the 622 targets in the D-List identified earlier, 163 have been eliminated, leaving behind a high-confidence list of 451 targets (H-List). Several known targets appear in this list. A comprehensive analysis of the passage of several known targets in the targetTB pipeline has been performed. Some of these targets are indicated in Table 3, while the complete list is available as supplementary material [See Additional File 5].
The expression of targets in the H-List, under conditions of persistence were analysed, from a set of microarray data. 216 of the H-List targets were up-regulated two-fold or more in at least one of the studies considered. These 216 targets form the I-List of targets, which may be useful in combating persistent Mtb infection. Some examples of proteins in the I-List are DesA1 (Rv0824c), DesA2 (Rv1094), DevS (Rv3132c), FadD32 (Rv3801c), KatG (Rv1908c), Pks13 (Rv3800c), CysH (Rv2392) and Wag31 (Rv2145c). CysH has also previously been shown to be important for Mtb persistence [58].
Phylogenetic profiling of Mtb proteins against various genomes gives a measure of the uniqueness of a particular target to the Mtb proteome. Phylogenetic profiling can also help in identifying important functional linkages of chosen targets. It is also useful for identifying targets that can be used for designing broad-spectrum anti-bacterials. The 451 shortlisted targets were compared with 228 pathogenic bacterial genomes (provided as supplementary material [See Additional file 6]). If the Mtb target has close homologues in more than 100 genomes, we refer to it as a possible broad-spectrum anti-bacterial target (J-List). Several proteins involved in lipid metabolism are present in this list, viz. InhA (Rv1484), FabH (Rv0533c), FabD (Rv2243), PcaA (Rv0470c) and the MmaA’s 1–4. IspF (Rv3581), which has been suggested as an attractive target in many pathogens [59], is also in the J-List. A main concern of such a strategy to target a multitude of bacteria in clinical therapy is the emergence of resistance to multiple organisms, which is highly undesirable. However, if the emergence of resistance is countered, as discussed below, having broad-spectrum targets could be of great advantage.
Proteins that were present only in mycobacteria were also identified by this analysis (K-List). This list is rich in mycobacteria PPE proteins and also contains proteins such as DevS, a sensor histidine kinase involved in a two-component signal transduction pathway.
In a recent study, we identified possible pathways that would be involved in the emergence of drug resistance in Mtb [43]. We also proposed the concept of ‘co-targets’, referring to those proteins, which when inhibited simultaneously with a corresponding primary target, will help in reducing the emergence of resistance to the drug binding to that primary target. The importance of any protein in the H-List identified here will significantly increase if it also happens to be a constituent of the resistance pathways. These pathways comprise proteins that are predicted to be either directly responsible for generating resistance to the given drug, or serve as an important hub in the flow of information from the target of the given drug to the machinery of resistance. Proteins in the resistance pathways broadly belong to one of the four mechanisms, which are mediated by cytochromes, SOS related genes, antibiotic efflux pumps and genes involved in horizontal gene transfer (HGT). The putative targets in the H-List were analysed for their proximity to resistance-related proteins in the protein-protein interaction network described in [43]. Of 451 proteins in the H-List, 25 were closely involved in the resistance pathways and would therefore be significantly more useful as drug targets. Some notable examples are PolA (Rv1629), a protein involved in the SOS response, a cytochrome Cyp121 (Rv2276), which is also connected to 19 other cytochromes, and SecY (Rv0732), a protein connected to DnaE1 (SOS) and two other proteins, SecA1 and SecA2, implicated in HGT. Table 4 gives a list of these proteins and their association with resistance related proteins.
The various filters and the corresponding analyses that have been applied in this study, to arrive at the final lists of targets are listed in Table 2. Of the 3,989 proteins that have been annotated in the Mtb genome, 622 proteins pass the filters of systems and sequence analyses, as well as the structural assessment (D-List). These proteins are then screened to eliminate those which are not expressed, as well as those which have homologues in gut flora, or with anti-targets in the human proteome. A final list of 451 proteins is arrived at, which comprise the H-List. Of these, 216 proteins satisfy persistence criteria (I-List), while 186 are potential broad-spectrum anti-bacterial targets (J-List), and 66 targets are unique to mycobacteria (K-List). Proteins for which the analysis could not be performed, due to lack of available data at this time are separated as lists A’ and C’, which may be considered for analysis once more data become available. Proteins that have been eliminated at various stages could still find use as drug targets under different scenarios. For example, those proteins eliminated due to non-essentiality to Mtb (AX-List) may contain pairs of proteins that could together be essential and may hence be useful, if targeted concurrently. In fact, the double knock-out studies using FBA carried out here clear demonstrate this aspect. Similarly, proteins that have been eliminated due to some structural similarity with human targets (CX-List) may be useful as drug targets if the structural differences between the host and pathogen proteins could be exploited.
The functional classes of the 451 targets (H-List) identified by this study are indicated in Fig. 3. The list is also available as supplementary material [See Additional file 4]. This list includes several known targets and many that have been proposed as potential targets. Some known targets have been eliminated because they have failed one or more filters in the targetTB pipeline. The passage of known and proposed targets for anti-tubercular drugs in the targetTB pipeline is detailed in Table 3 (also see Additional File 5). Some examples of proteins that are in the H-List include known targets such as InhA, EmbA and FabH, as well as many targets that have been proposed for anti-tubercular drug discovery, such as GlfT2, a bi-functional UDP-galactofuranosyl transferase, the fatty acid synthase Fas, the pantothenate kinase PanK, a glutamine-synthetase adenylyltransferase GlnE and the sensor histidine kinase DevS. The list also indicates several proteins that have been suggested as potential drug targets in literature, but eliminated from the targetTB pipeline on account of failing one or more of the filters.
It is interesting to note that of the 451 targets in the H-List, over a half of them belong to the functional classes of ‘lipid metabolism’ and ‘intermediary metabolism and respiration’. It has been said that metabolism has often not been given sufficient importance in ‘intelligent’ drug design [60]. Our analysis is in support of that observation, highlighting several targets from lipid metabolism, particularly the critical pathway of mycolic acid biosynthesis, amino acid biosynthesis, menaquinone biosynthesis and mycothiol biosynthesis. Several of the metabolites produced in these pathways are essential for mycobacterial survival and hence, the pathways producing these metabolites are ideal candidates for anti-tubercular drug discovery. Many of these pathways do not have equivalent pathways in the human, making them even more suitable candidates for targeting.
Desaturases DesA1 and DesA2, which have been shown by us to be hallmarks of the mycolic acid biosynthesis pathway in Mtb [61], pass all the filters and are present in the H-List. They are also present in the I-List of targets expressed during persistence. These proteins thus appear to be highly viable targets for anti-tubercular drugs. AcpS (Rv2523c), an acyl-carrier-protein synthase involved in mycolic acid biosynthesis, also passes all the filters and is a potential target. TrxB2 (Rv3913), a probable thioredoxin reductase and LysA (Rv1293), a diaminopimelate decarboxylase which catalyses the conversion of diaminopimelic acid to lysine and ThrB (Rv1296), a probable homoserine kinase, which are also ranked very high (ranked two, four and six, respectively) in the metabolic list of prioritised targets reported by Schreiber and co-workers [16], are also targets of interest.
Two computational studies, outlining strategies for target identification, particularly for anti-tubercular drugs, have been reported earlier [15, 16]. We present an overview of the passage of the targets suggested in these studies in the targetTB pipeline, also outlining the advantages of the targetTB pipeline over the previously reported methods.
Based on a sequence analysis study, comparing enzymes in metabolic pathways between human and Mtb, Pennathur and co-workers proposed 186 proteins as suitable drug targets. Of these, 51 feature in our H-List and 129 do not, while six could not be considered for lack of sufficient functional data. Some examples of the 51 targets featuring in the H-List are AcpS (Rv2523c), AtpC (Rv1311), FabH (Rv0533c), FbpA (Rv3804c), FolB (Rv3607c), IspE (Rv1011), KatG (Rv1908c), LeuA (Rv3710), MenC (Rv0553), PanB (Rv2225), PanC (Rv3602c), PpdK (Rv1127c), GlfT1 (Rv3782) and TrpA (Rv1613). An account of how each of the 180 proteins proposed as targets in the study reported by Anishetty et al (2005) fare in the targetTB pipeline is given as supplementary material [See Additional file 7].
Of the 129 targets that do not pass the filters used in our study, but were predicted by Anishetty et al, 77 have been eliminated due to their non-essentiality in Mtb, as predicted by systems-level analyses, clearly demonstrating the need for incorporating systems-level studies. Of the remaining 52, one had a close homologue in the human proteome and 16 had a PM score of 0.8 or more, leading to their elimination. Of the remaining 35, 14 are not expressed under any of the conditions considered by the experiments considered (studies [30–32]), while 18 of them had homologues in gut flora (five failing both expression and gut flora filters). For the remaining eight, structural assessment through PD-LigsiteCSC-PM was infeasible due to lack of availability of an appropriate model. These observations reiterate the need for a comprehensive multi-level analysis for target identification, as demonstrated by the targetTB pipeline.
Schreiber and co-workers have reported a study in which they prioritise all proteins in the Mtb genome for use as drug targets. Their ranking is based on a consideration of metabolic choke-points, in vitro essentiality for growth and druggability as judged by sequence similarity to proteins capable of binding small molecule ligands, besides sequence analysis to identify unique proteins. Some concepts are similar between our study and that of Hasan et al, but our study differs from theirs in a number of ways: (i) to start with, the goal in our study is to identify a very high quality list of drug targets that are also computationally validated, whereas Hasan et al have aimed to prioritise all proteins in Mtb for their feasibility as drug targets (ii) a pipeline has been developed that filters out proteins at every stage, leading to a final list of very high quality targets at the same time eliminating the need for a blind consideration of all proteins at all stages. The pipeline is also useful for considering proteins eliminated at different steps, if required, with necessary caution. (iii) a rigorous FBA and network analysis have been carried out in our study, making the systems-level analysis much more comprehensive (iv) a comprehensive structural assessment of 767 proteins of Mtb that passed other filters in the pipeline, against 15,830 different human proteins, has been carried out. New algorithms developed by us have been used to identify and compare pockets, again rendering the structural analysis efficient and more importantly feasible, since it considers only the relevant features that describe drug recognition. In addition, we have considered (v) elimination of proteins similar to anti-targets and also (vi) those important in countering the emergence of drug resistance.
Hasan et al have proposed three lists of prioritised targets, based on different scoring schemes. In the metabolic list proposed by Hasan et al, 146 of the targets from the H-List are present in the top 500. Of the rest, 82 were eliminated due to the presence of sequence homologues in the human proteome. 107 were non-essential by systems analysis, while for eight, no data was available. Of the remaining 154, 43 were not feasible for structural analysis, while 49 had a PM Score of 0.8 or more. Two of the proteins had similarities with human anti-targets. Of the remaining 62, 36 had homologues in gut flora and 32 were not expressed (6 failed both filters). As a result, the final list of proteins that we have identified (H-List) differs significantly from those proposed by Hasan et al. A report of how the top 500 targets in each of the three lists proposed by Hasan et al (2006) fare in the targetTB pipeline is given as supplementary material [See Additional file 8].








