abmc是什么A global ＼(Anopheles＼ gambiae＼) gene co-expression network constructed from hundreds of experimental conditions with missing values-上海聚慕医疗器械有限公司

This section presents a brief overview of the expression matrix, and describes the steps to construct the AgGCN1.0 based on hundreds of conditions from 30 publications [29, 32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60], including (1) data pre-processing, (2) PCC thresholding, (3) edge weight assignment, and (4) final network selection.

The data set used for constructing the An. gambiae GCN, Anopheles-gambiae_EXPR-STATS_VB-2019-02, is based on the data set used by MacCallum et al. [12], which was updated by addition of several new conditions. Anopheles-gambiae_EXPR-STATS_VB-2019-02 is available through Vectorbase (vectorbase.org; [61, 62]) at https://tinyurl.com/mr38a7hj. Anopheles-gambiae_EXPR-STATS_VB-2019-02 is based on the AgamP4.11 annotation of the An. gambiae PEST genome, and includes log2-transformed expression values of 13,080 genes across 291 conditions, collected from 35 data sets [29, 32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50, 50,51,52,53,54,55,56,57,58,59,60, 63,64,65,66,67]. Each publication contributed on average of eight conditions to the data set, ranging from 1 [53] to 52 [39].

The experimental methodologies varied widely among publications, exploring gene expression changes using different experimental platforms that sampled either across the entire or various sub-sections of An. gambiae transcriptome, using total RNA collected from various life stages, tissues, and physiological conditions. Expression data obtained with single-channel microarrays represented 62% of the data set [33,34,35, 37,38,39,40,41, 44,45,46,47, 50], of which 84% were obtained with the Affymetrix GeneChip Plasmodium/Anopheles Genome Array [33, 35, 37,38,39, 41, 44,45,46,47]. Expression data for the remainder of the conditions were assessed with either dual-channel microarrays (23% of conditions, [32, 36, 42, 43, 48, 50,51,52,53,54,55, 58, 63,64,65,66,67]) or by RNAseq (15% of conditions, [56,57,58,59,60]). With regards to life stages, 84% of all conditions were analyzed using samples derived from adults, 91% of those from female mosquitoes, a bias that is easily explained by the fact that only adult female mosquitoes transmit vector-borne disease pathogens [11]. Similarly, tissue-specific analyses focused most commonly on the adult female midgut (33% of the conditions that sampled individual tissues [32, 33, 36, 38, 42, 43, 57, 65]), as it constitutes a major bottle-neck for vector-borne pathogens after being ingested by the mosquito with a blood meal from an infected individual [68, 69]. Likewise, the conditions sampled a variety of physiologies that are integral to vector biology and its control, including blood feeding (9% of conditions, [33, 56, 59, 63]), parasite infection (14% of conditions, [36, 42, 43, 47, 48, 57, 65], and insecticide resistance (6% of conditions, [29, 51,52,53, 55, 64, 66, 67].

This initial data set was partially conflated as it contained 21 conditions [32, 34, 50] that were combined from other conditions in the same data set (e.g., a conflated condition presented a ratio of two other conditions in the data set). Therefore, these 21 conditions did not represent new data, and thus were removed from further analysis. In addition, one platform, the dual-channel microarray LIV A. gambiae DETOX 0.25k [30], interrogated the expression of only 226 genes or 1.7% of the An. gambiae transcriptome. The inclusion of the 13 conditions that used the LIV A. gambiae DETOX 0.25k microarray [63,64,65,66,67] led to a large number of missing values in the data set, and therefore were excluded from the analyses. After their removal, the final data set (Table S1, in Supplementary Materials https://github.com/KSUNetSE/AgGCN1.0) consisted of gene expression data across 257 conditions collected from 30 publications [29, 32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60].

However, direct comparison of expression values across the 257 conditions in the data set was not possible, as the data distribution varied among conditions, not only due to the platforms used, but also due variations between experimental designs and procedures (Table S4, in Supplementary Materials https://github.com/KSUNetSE/AgGCN1.0/). Figure 1 shows the means and variances of expression values in each of the 257 conditions. The log2 mean gene expression values ranged from -0.9 to 10.9, with means for dual-channel microarray data usually around 0, single-channel microarray data ranging between 4.2 and 10.9, and RNAseq data ranging from -0.9 to 2.5 (Table S2, in Supplementary Materials https://github.com/KSUNetSE/AgGCN1.0/). The variance of log2-transformed expression values ranged between 0.01 and 16.5, with no correlation between mean and variance across the data set or data obtained by RNAseq. However, expression data obtained with dual-channel microarray platforms showed a strong positive correlation between expression mean and variance across conditions, while data from single-channel microarrays showed a strong negative correlation. This negative correlation can be largely explained by the data characteristics of individual dual-channel microarray platforms. Data obtained with both the OXFORD Anopheles gambiae Agilent 13k v1 and the Agilent A. gambiae 020449 44k v2 microarray had low means and high variance within conditions, data obtained with the Affymetrix GeneChip Plasmodium/Anopheles Genome Array showed means and variances in the middle range, and those obtained with the ND Anopheles gambiae Nimblegen 65k v1 microarray had high means and low variance. These observed differences in means and variance across experiments would result in gene expression correlation based on experimental design rather than on underlying gene regulation.

To equalize the means and variance of the data across all conditions, we performed a normalization step using the z-score, which is expressed as follows

where is the normalized expression value of gene k under condition i, is the raw expression value of gene k under condition i, and are the mean and the standard deviation of the expression value for all values in condition i, respectively.

Another property of the An. gambiae gene expression matrix was the highly variable number of paired elements between all possible gene pairs (Fig. 2). A paired element between gene a and b is defined as the expression value pair , where i is a specific condition, with . If either or is missing in the expression matrix, there is no paired element between gene a and b for condition i. The distribution of paired elements across the entire z-score-normalized data set was highly heterogeneous, displaying a bimodal distribution with a gap between 118-139 paired elements. This gap is explained by the properties of the platforms used to obtain the gene expression data. The transcripts of 2,813 annotated genes in the data set were not represented on the Affymetrix GeneChip Plasmodium/Anopheles Genome Array, which was used to analyze gene expression in 139 out of 257 total conditions [33, 35, 37,38,39, 41, 44,45,46,47]. Therefore, each of these 2,813 genes in the data set had an expression vector length that is 118, and the 10,433 genes represented on the array had a vector length 139. While z-score normalization addressed the differences in means and variance of the expression matrix, the variable paired element length remained and had important implications on edge selection and edge weight assignment, which are discussed below.

To determine the edges of the AgGCN1.0, we used the Pearson correlation coefficient (PCC), which is a co-expression measure used commonly to detect edges for nodes in gene co-expression networks [26, 31, 70]. We calculated PCCs for all possible gene pairs in the data set, and constructed an initial gene expression correlation matrix, which included all PCCs. From this matrix, we removed the PCCs from all gene pairs whose expression vectors had a paired element length of 4.

The PCC was calculated as

where n is the number of paired elements between gene x and y, is the normalized expression value of gene x under condition i, is the mean of gene x across all paired elements, is the normalized expression value of gene y under condition i, is the mean of gene y across all paired elements.

From Eq. (2), it is apparent that the range of the PCC is [-1, 1]. It is assumed that the two vectors are negatively and positively correlated when close to -1 or 1, respectively. If the two node-vectors are assumed independent, the density function of PCC is then expressed as

where is the gamma function, and is the degrees of freedom.

Figure 3 shows the probability density distribution of the PCC as a function of paired element length. The standard deviation of PCCs decreases with an increase in the number of paired elements. As an example, selecting the top 3rd percentile of all PCCs at a given paired element number lead to a threshold PCC of 0.61 for ten paired elements, and a threshold PCC of 0.12 for 250 paired elements (Fig. 3).

In GCN construction, an edge between a gene pair is included if their expression vectors are deemed to have a high PCC, most commonly using a fixed threshold, such as 0.8 [26]. However, a fixed threshold is only valid when the number of paired elements in the expression matrix is constant. Given the heterogeneous distribution of paired element length in the An. gambiae expression matrix, a fixed threshold of PCC would favor edges between gene pairs with fewer paired elements and not capture the edges with larger numbers of paired elements, which arguably have stronger experimental support.

To avoid this problem, we selected gene pairs among the top percentages of PCC values and used a sliding threshold based on the number of paired elements. To do so, we divided the PCCs into 26 groups according to the interval of the paired element length values, i.e., [4, 10], [11, 20], [21, 30], …, [251, 257]. The scattered points in Fig. 4 show the top 0.5%, 1%, 2%, and 3% of PCCs of all gene pairs in the 26 intervals. To assign a threshold of correlation between the expression vectors of any given paired element length, we used a curve that fitted the scattered points. The gene pairs with PCCs above the fitted curve were assigned edges in the GCN. The equation for the curve is

where , , , and are the four parameters that were fitted, and x is the number of paired elements. This equation provided a good trade-off between the accuracy of the fitting and the number of parameters to estimate. We optimized the four parameters , , , and iteratively, by (1) fixing parameters and , and optimizing and , and (2) fixing and , and optimizing and . We evaluated the performance of the four parameters by maximizing the coefficient of determination, . The solid lines in Fig. 5 show the fitted curves for the top 0.5%, 1%, 2%, and 3% points, and the optimized parameters , , , and are shown in Table 1. Optimized parameters did not change substantially when a higher interval number was chosen. For example, using 52 instead of 26 intervals to calculate the top 3% sliding threshold curve resulted in the same values for the , , and minimally changed , and (52 intervals: 7.6 and 21.1; 26 intervals: 7.5 and 21.2).

Edges were included in the network if their PCC (1) was above the sliding threshold, and (2) had a p-value smaller than , where and m are the significance level equal to 0.05, and the number of gene pairs of each interval, respectively (Bonferroni-correction, [71, 72]).

With the goal to provide a more informative network structure that identifies the relative strength of co-expression between gene pairs, we constructed a weighted network. To do so, we assigned a weight to every detected edge, representing the similarity score of co-expression based on PCC and the number of paired elements. As the edges detected with fewer paired elements tended to have higher PCCs than the edges with more paired elements (Fig. 4), we rescaled the PCCs to equalize the means and medians of the PCCs across the 26 intervals. Figure 5 shows the PCC-rescaling curves for the sliding thresholds. The dashed lines are the PCC-sliding threshold curves represented by Eq. 4, and the solid lines are the corresponding rescaling curves, given by the following equation

where , z is the range of the number of paired elements. This curve reduced the weight of those edges that were detected with a smaller number of paired elements. The edge weight was computed as

We assigned weights to all edges that were selected based on the sliding thresholds of top 0.5%, 1%, 2%, and 3% PCCs. Figure 6 confirms that the rescaling curve normalized the means and medians of edge weights across all intervals. As expected, the distribution of the PCC means and medians of the selected edges across the 26 intervals of paired element lengths followed a similar pattern to those observed for the PCC threshold curves (Figs. 6a and 4). After the rescaling of edge weights using Eq. (6), the means and medians of the edge weights in each interval were indeed similar, and thus comparable across the intervals (Fig. 6b), confirming the validity of the rescaling approach.

To determine the PCC threshold for final network construction, we built and analyzed four networks with the top 0.5%, 1%, 2%, and 3% sliding thresholds as introduced in Sect. 2.3. Table 2 shows the statistical properties of the four networks. The curated data set contained expression data for 13,080 genes (nodes) annotated in the An. gambiae genome. Table 2 shows the number of nodes connected with at least one edge, and the node percentage represents the corresponding percentage of genes present in the network. The LCC is the largest connected component of the network. The network density is defined as

where n is the number of nodes in the network.

All four networks contained more than 90% of nodes, with small node percentage increase with increasing thresholds. In contrast, the number of edges and network density nearly doubled with each threshold increase. Based on these results, we selected the network with the most stringent edge selection criterion (top 0.5% sliding threshold), as the AgGCN1.0 network. This network maintains its structure with a large number of nodes connected with the smallest number of edges. The AgGCN1.0 network, therefore, contains nearly all genes encoded in the Ag. gambiae genome and shows co-expression between pairs of genes only when their expression vectors have very high correlation.

abmc是什么A global ＼(Anopheles＼ gambiae＼) gene co-expression network constructed from hundreds of experimental conditions with missing values

相关推荐

作者介绍

聚慕医疗

热门文章

切换注册登录

切换登录注册