Abstract
Time-course gene expression datasets provide insight into the dynamics of complex biological processes, such as immune response and organ development. It is of interest to identify genes with similar temporal expression patterns because such genes are often biologically related. However, this task is challenging due to the high dimensionality of these datasets and the nonlinearity of gene expression time dynamics. We propose an empirical Bayes approach to estimating ordinary differential equation (ODE) models of gene expression, from which we derive a similarity metric between genes called the Bayesian lead-lag R2 (LLR2). Importantly, the calculation of the LLR2 leverages biological databases that document known interactions amongst genes; this information is automatically used to define informative prior distributions on the ODE model’s parameters. As a result, the LLR2 is a biologically-informed metric that can be used to identify clusters or networks of functionally-related genes with co-moving or time-delayed expression patterns. We then derive data-driven shrinkage parameters from Stein’s unbiased risk estimate that optimally balance the ODE model’s fit to both data and external biological information. Using real gene expression data, we demonstrate that our methodology allows us to recover interpretable gene clusters and sparse networks. These results reveal new insights about the dynamics of biological systems.
1. Introduction
Time-course gene expression datasets are an essential resource for querying the dynamics of complex biological processes, such as immune response, disease progression, and organ development (Bar-Joseph et al. Citation2012; Purvis and Lahav Citation2013; Yosef and Regev Citation2011). Such datasets, now abundantly available through techniques such as whole-genome RNA sequencing, consist of gene expression measurements for thousands of an organism’s genes at a few (typically 5-20) time points. Experimental evidence has revealed that groups of genes exhibiting similar temporal expression patterns are often biologically associated (Eisen et al. Citation1998). For instance, such genes may be co-regulated by the same transcription factors (Tavazoie et al. Citation1999): proteins that directly control gene expression, which ultimately contributes to changes in cellular function. Identifying clusters or networks of genes with related temporal dynamics, which is our objective in this study, can therefore uncover the regulators of dynamic biological processes. Doing so can also help generate testable hypotheses about the roles of orphan genes that exhibit similar expression patterns to ones that are better understood.
The complex, nonlinear time dynamics of gene expression pose a significant challenge for clustering and network analysis in genomics. Groups of interacting genes may be expressed with time lags or inverted patterns (Qian et al. Citation2001) due to delayed activation of underlying transcription factors, making it difficult to measure the “similarity” in two expression profiles. Ordinary differential equations (ODEs) or discrete-time difference equations have been successfully used to model the nonlinear expressions of a small number of genes (Chen et al. Citation1999; Bansal et al. Citation2006; D’haeseleer et al. Citation1999; De Jong Citation2002; Polynikis et al. Citation2009). It is possible to derive similarity metrics for the time dynamics of two genes from such ODEs, thus enabling putative identification of co-regulated genes and the reconstruction of regulatory networks (Farina et al. Citation2007, Citation2008; Wu et al. Citation2019). In particular, the approach proposed in Farina et al. (Citation2007) allows explicit modeling of lead-lag as well as contemporaneous associations between gene expression trajectories. We hence use it as the basis of the similarity calculations in our proposed clustering framework.
The high dimensionality (number of genes) and small sample sizes (number of time points) of time-course gene expression datasets pose another obstacle to identifying genes with similar expression dynamics. Due to the size of these datasets, the number of gene pairs receiving high similarity scores by any method can be overwhelmingly large. High similarity scores are typically validated for biological relevance using annotations provided by extensive public and commercial curated databases that assign genes to functional groups. For instance, Gene Ontology (GO) annotations are keywords that describe a gene’s molecular function, role in a biological process, or cellular localization (Ashburner et al. Citation2000). Other curated databases include KEGG (Kanehisa and Goto Citation2000), Reactome (Fabregat et al. Citation2018), BioCyc (Karp et al. Citation2019), and STRING (Szklarczyk et al. Citation2019). To ease the burden of manually validating a potentially vast number of gene-gene associations, we propose a Bayesian clustering technique that uses annotations as prior information to automatically validate these associations. Incorporating such information into a clustering method can encourage gene pairs with known biological associations to receive higher similarity scores, while filtering away those known to be unrelated. This also allows for knowledge gleaned from gene expression time series data to be contrasted with other knowledge bases; for instance, two genes with highly similar temporal expression patterns may not have been considered associated in previous cross-sectional (single time point) studies on which annotations are based, or vice versa.
There exist in the literature a few approaches to integrating biological knowledge with statistical measures of genetic association. One line of research considers Bayesian methods that use external data sources to determine prior distributions over genes or proteins that influence a biological response (Hill et al. Citation2012; Li and Zhang Citation2010; Lo et al. Citation2012; Peng et al. Citation2013; Stingo et al. Citation2011). Other studies develop biologically-informed regularization terms in graph-regularized methods for reconstructing gene networks (Li and Jackson Citation2015; Zhang et al. Citation2013). In another work, Nepomuceno et al. (Citation2015) propose an algorithm for biclustering gene expression data using gene ontology annotations. However, less attention has been given to using both data and prior biological knowledge to identify and model dominant patterns in the complex temporal dynamics of gene expression, e.g. with ODEs.
Our technical contribution in this work is a Bayesian method for constructing biologically-meaningful clusters and networks of genes from time-course expression data, using a new similarity measure between two genes called the Bayesian lead-lag R2 (LLR2). The Bayesian LLR2 is derived from ODE models of temporal gene expression, and is based on associations in both the time-course data and prior biological annotations. The balance between data and prior information is controlled by data-driven hyperparameters, making our approach an empirical Bayes method. As indicated by the name, the Bayesian LLR2 is based on the familiar R2 statistic (the coefficient of determination) and is simple and fast to compute for all gene pairs, where N is the number of genes under study. Importantly, external biological information regularizes the set of significant gene-gene associations found within a time-course dataset. In , for instance, we present a network of 1735 genes in Drosophila melanogaster (fruit fly) constructed both without external information, using an ordinary least-squares version of the LLR2 proposed by Farina et al. (Citation2008), and with external information, using our proposed Bayesian LLR2; the latter is a noticeably sparsified revision of the former, and retains only edges connecting genes with either known or highly plausible biological relationships.
The remainder of this paper is organized as follows. Section 2 describes the ODE model of temporal gene expression that we adopt. Section 3 details our empirical Bayes method for fitting the ODE model and obtaining our proposed LLR2 metric. Section 4 demonstrates the application of our method to real gene expression data collected by Schlamp et al. (Citation2021). We recover a tradeoff between immune response and metabolism that has been observed in several studies and present examples of biologically-meaningful gene clusters identified with the Bayesian LLR2. We also discuss the method’s potential to conjecture new data-driven hypotheses about gene-gene interactions. Sections 4.3 and Appendices D and E compare the Bayesian LLR2 to its non-Bayesian counterpart as well as the commonly used Pearson correlation between genes. Proofs and additional examples are provided in the Appendix.
2. Dynamic Models of Gene Expression
2.1. Model Derivation
We consider an ODE model of gene expression proposed by Farina et al. (Citation2007). Let denote the expression of some gene A at time t, measured for instance as the -fold change in mRNA levels relative to time 0. The model assumes the rate of change in gene A’s expression is given by some regulatory signal p(t): where κA denotes the mRNA decay rate for gene A, and ηA denotes natural and experimental noise. This model can be naturally extended to consider two genes A and B that might be associated with one another, i.e. that are governed by the same underlying p(t), yielding a pair of coupled differential equations: (1) (1) (2) (2)
The common signal p(t) accounts for the effect of one or more transcription factors that potentially regulate both genes A and B. The coefficients αA and αB measure the strength of p(t) in the expression patterns of genes A and B, respectively. βA and βB are affine coefficients that allow and to reach a non-zero steady state when p(t) = 0, i.e. in the absence of a regulatory signal.Footnote1
We now obtain a model of gene A’s expression in terms of gene B’s expression by first rearranging (2) to isolate p(t): (3) (3)
Substituting (3) into (1) yields
Integrating from 0 to t, we obtain: (4) (4) where and
Observe that (4) is linear in the parameters ck. Thus, given measurements and of the expression levels of genes A and B at time points we can express (4) as the linear model where (5) (5) and where denotes the n × n identity matrix; it is natural to assume that ηA and ηB are centered Gaussian processes with independent increments, in which case the integral defining ε above is normally distributed. Although only samples from the functions and are given, we can estimate the integral entries of the second and third columns of X by numerically integrating spline or polynomial interpolants fitted to these samples.Footnote2
In fitting the model (4) to the expression data of genes A and B, we obtain the ordinary least-squares (OLS) estimator The amount of variation in gene A’s expression that is captured by the estimated linear model is expressed as the familiar R2 value: where and denotes the vector of ones. Adopting the terminology in Farina et al. (Citation2008), we refer to this R2 as the lead-lag R2 between genes A and B. A high lead-lag R2 may indicate that the two genes are co-regulated in time by common transcription factors, or are at least associated with one another in some way. The term “lead-lag” comes from the lead-lag compensator in control theory. In this context, a “lead-lag relationship” between genes refers to the presence of a common regulatory signal (input) that, in conjunction with the process of mRNA decay, modulates the expression of genes with the same biological function (output).
2.2. Motivating the Bayesian Lead-Lag R2
Our primary contribution in this work is a biologically-informed method for clustering genes based on their temporal dynamics. Clustering involves measuring the similarity between two objects, which can also be thought of as defining an edge between two nodes in an undirected network. Our similarity measure is a Bayesian version of the lead-lag R2 that uses both temporal expression data for genes A and B as well as a prior indication of whether they are biologically associated.
We can motivate the Bayesian lead-lag R2 via , which shows examples of false positive gene pairs: genes that have a spuriously high lead-lag R2, but do not have similar expression patterns nor a biological relationship. The data comes from an experiment on fruit flies by Schlamp et al. (Citation2021) that aimed to profile the dynamics of genes involved in or affected by immune response immediately following an infection. More details on this dataset can be found in Section 4.2.
Spuriously high lead-lag R2 values are likely to arise in large datasets. For example, if gene A’s expression levels increase or decrease monotonically with time, the response vector Y in (5) will be highly correlated with the time integrals and time points in the third and fourth columns of X. The lead-lag R2 between genes A and B will be large, but not because the genes are associated either in time or biologically.
In contrast to gene pairs of the kind shown in , we can consider , which displays genes from two well-studied functional groups, known as pathways: circadian rhythms and immune response. Within each group, we expect to see high pairwise lead-lag R2 values (true positives).
Incorporating pathway membership or protein-protein interaction networks into the lead-lag R2 enables us to encourage genes in the same pathways to receive higher pairwise similarity scores, thus separating true positives from false positives. Importantly, we can also suggest possible pathways for previously uncharacterized genes. In our method, external biological information is used to determine the locations of normal prior distributions on the parameters in the model (4). Upon obtaining the posterior estimates of these parameters, we recompute the lead-lag R2 to obtain the Bayesian lead-lag R2 (LLR2). In doing so, we observe a desirable shrinkage effect in the distribution of the Bayesian lead-lag R2 values that pares down the number of significant associations. We next detail the hierarchical model and its hyperparameters in Section 3.
3. Empirical Bayes Methodology
In this section, we propose our empirical Bayes approach to deriving biologically-informed similarity metrics between genes for clustering and network analysis. The components of our method are: (1) encoding external biological information into a prior adjacency matrix, (2) defining a normal-inverse gamma prior, specifically Zellner’s g-prior, on the parameters of the ODE-based model of gene expression (4), (3) optimally selecting the hyperparameter g in Zellner’s g-prior, and (4) calculating the Bayesian lead-lag R2. Note that parts 2–4 lead to the computation of the Bayesian lead-lag R2 for a single gene pair; a summary of the full algorithm for all pairs is provided in Appendix B.
3.1. Part 1: Leveraging Biological Priors
There exist numerous databases that extensively document known or predicted interactions between genes as well as their functional roles. For instance, Gene Ontology (GO) terms are keywords that describe a gene’s molecular function, role in a biological process (e.g. “lipid metabolism”), or cellular localization (e.g. “nucleus”) (Ashburner et al. Citation2000). Semantic similarity methods, such as the R package GOSemSim (Yu et al. Citation2010), have been developed to determine how related genes are based on their associated GO terms. Other curated databases that similarly assign genes to pathways include KEGG (Kanehisa and Goto Citation2000), Reactome (Fabregat et al. Citation2018), and BioCyc (Karp et al. Citation2019). The STRING database (Szklarczyk et al. Citation2019) aggregates multiple sources of information to generate a more holistic measurement of the association between two genes. For each pair of genes in an organism, STRING provides a score between 0 and 1 indicating how likely the two genes’ encoded proteins are to interact physically based on experimental evidence, belong to the same pathways according to the aforementioned databases, be conserved across species, or be mentioned in the same studies.
Regardless of which sources of external biological information one employs, the first step of our method is to encode this information into a matrix W of size N × N, where N is the total number of genes under study. The entries of W are: Intuitively, W can be viewed as an adjacency matrix for a network that reflects the known relationships amongst the genes. Our proposed method uses this network as an informed basis for the network constructed from the time-course gene expression data itself. Importantly, the data can indicate what the “NA” entries of W ought to be, as well as confirm (or refute) the “0” or “1” entries.
In the event that W consists largely of “NA” entries, one can also use time-course gene expression data from previous studies to fill in some of them. Such datasets often consist of multiple replicates of the expression measurements, possibly gathered under the same experimental conditions to account for sampling variation, or different conditions to assess the effect of a treatment.
3.2. Part 2: The Normal-Inverse Gamma Model and Zellner’s g-Prior
Recall from Section 2.1 that given measurements and of the expressions of two genes A and B at times the model we aim to fit is where are unknown parameters, and the integrals and are estimated by numerically integrating spline interpolants of the given data. As seen in (5), this model can be represented in matrix form as where
Since c1 and c2 link the expressions of genes A and B, we intuitively ought to place priors of non-zero mean on these two parameters if W indicates that the genes are associated. To do this, we adopt the normal-inverse gamma model for and which is used frequently in Bayesian regression and allows for flexible modeling of the prior mean and covariance matrix of The normal-inverse gamma model specifies and where is a positive semidefinite matrix, and It is then said that jointly follow a normal-inverse gamma distribution with parameters This is a conjugate prior, so the posterior distribution of is also normal-inverse gamma with parameters defined as (6) (6) (7) (7) (8) (8) (9) (9) That is, the conditional posterior of given and the posterior of are and The posterior mean can be used as the estimated regression parameters. This estimator has the desirable properties of consistency and asymptotic efficiency in large samples and admissibility in finite samples (Giles and Rayner Citation1979).
The hyperparameters and are of particular interest as they allow us to incorporate biological information into our model. In defining recall that we wish to place priors with non-zero mean on the parameters c1 and c2 when external sources suggest that genes A and B are co-regulated or at least associated. We noted in Section 2.1 that c1 represents the ratio where αA and αB denote the strength of a common regulatory signal in the first-order dynamics of the two genes. If the genes are associated, it is reasonable to believe a priori that implying Then we could also say a priori that since c2 represents a parameter that is proportional to c1. When prior information about genes A and B is lacking or suggests that they are unrelated, a prior mean of zero for c1 and c2 is appropriate. Supposing genes A and B are the and genes in the dataset, we can thus set the prior mean as follows:
As for the prior covariance matrix of we first note that for the linear model where has the covariance matrix the least-squares estimator has the covariance matrix A popular choice for is therefore where g > 0. This choice of yields a particularly tractable special case of the normal-inverse gamma model known as Zellner’s g-prior (Zellner Citation1986). Substituting this choice of into the posterior mean in (6) and covariance matrix in (7), we obtain (10) (10) (11) (11) (10) Reveals that under Zellner’s g-prior, the posterior mean is a convex combination of the prior mean and the least-squares estimator The parameter g balances the weights placed on external information encoded by and on the data used to compute so the selection of g is an important component of our modeling strategy. We next describe our data-driven approach to choosing it.
3.3. Part 3: Optimal Selection of g in Zellner’s g-Prior
Several methods for choosing g in Zellner’s g-prior have been proposed previously. For instance, George and Foster (Citation2000) discuss an empirical Bayes method in which one selects the value of g that maximizes the marginal likelihood of Y. Liang et al. (Citation2008) provide a closed form expression for this maximizing value of g that is nearly identical to the F-statistic for testing the hypothesis that They show that this maximum marginal likelihood approach has the desirable property that the Bayes factor for comparing the full model to the null (intercept-only) model diverges as the R2 approaches 1. For our application, one concern is that the F-statistic defining this particular estimate of g is likely to be overwhelmingly large for many gene pairs. If g is large, then will be very close to according to (10). As a result, any biological evidence of association captured by will have a negligible impact on the model.
A fully Bayesian approach to selecting g involves placing a prior distribution on g that is then integrated out when defining the prior on This method is motivated by Zellner and Siow (Citation1980), who propose placing a Cauchy prior on to derive a closed-form posterior odds ratio for hypothesis testing purposes. These priors can be represented as a mixture of g-priors with an inverse gamma prior on g, although a closed-form posterior estimate of g is unavailable. Cui and George (Citation2008) and Liang et al. (Citation2008) alternatively consider a class of priors of the form for a > 2, known as the hyper-g priors. Under these priors, the posterior mean of g can be expressed in closed form in terms of the Gaussian hypergeometric function.
Because our application involves fitting regression models for potentially thousands of gene pairs, the computational cost of fully Bayesian methods for selecting g requires us to consider alternative approaches. One idea is to select the value of g that minimizes the sum of squared residuals where is the vector of fitted values: (12) (12) where and However, we found that there are no analytical solutions to Instead, we can minimize Stein’s unbiased risk estimate (SURE), which is an unbiased estimate of Below is a rephrased version of Theorem 8.7 in Fourdrinier et al. (Citation2018), which defines SURE for the general problem of estimating using a linear estimator of the form This theorem statement differs from its original version in that we have rewritten the divergence of with respect to Y using the generalized degrees of freedom developed by Efron (Citation2004).
Theorem 3.1
(SURE for linear models). Let where the dimensions of X are n × p, and let be a weakly differentiable function of the least squares estimator such that can be written in the form for some vector a and matrix S. Let Then, (13) (13) is an unbiased estimator of
From (12), observe that we can write as where Therefore the matrix S in Theorem 3.1 is whose trace is SURE in (13) then becomes (14) (14) where we have also substituted the posterior mean in (10) for
We next present the value of g that minimizes SURE.
Theorem 3.2
(SURE minimization with respect to g). The value of g that minimizes SURE in (14) is
The proof of Theorem 3.2 is provided in Appendix A.
It is quite possible that is small, resulting in a large value of (i.e. ) in Theorem 3.2. In this case, in (10) will be largely influenced by the data via rather than by prior information via This is desirable when the relationship between the two genes is unknown (i.e. ), but not if the relationship is known to be unlikely (i.e. ). In the latter case, we prefer to shrink the regression coefficients towards the prior mean so as to yield a smaller lead-lag R2 value. To address this, we set g conditionally on as (15) (15) where is defined according to Theorem 3.2.
3.4. Part 4: Computing the R2 for Bayesian Regression Models
Once a posterior estimate of the model coefficients (10) has been obtained, with the parameter g selected optimally, we can compute the Bayesian lead-lag R2 between genes A and B.
Recall that for a linear model where is estimated by the least-squares estimator the R2 is defined as (16) (16) where In Bayesian regression, however, the standard decomposition of total sum of squares into residual and regression sums of squares no longer holds. Thus, when we replace with an estimator such as the posterior mean of the formula (16) can potentially yield As a remedy to this issue, we compute the R2 as the ratio of the variance of the model fit to itself plus the variance of the residuals. This ratio is within by construction, and is given by (17) (17) where, for a vector with mean we define This calculation is based on the approach to computing R2 for Bayesian regression models proposed in Gelman et al. (Citation2018). For our application, we refer to (17) as the Bayesian lead-lag R2 (LLR2).
3.5. Clustering and Empirical Analysis with the Bayesian Lead-Lag R2
Given a dataset of N genes whose expressions are measured at n time points, our objective is to cluster the genes based on their temporal expression patterns. To do this, we compute a N × N similarity matrix S where stores the Bayesian LLR2 in (17) between the and genes. We then apply hierarchical clustering to the distance matrix where J is the N × N matrix of ones.
Note that the Bayesian LLR2 is asymmetric: Here, denotes the Bayesian LLR2 where we treat the gene as the response (gene A) in the model (4). This asymmetry could potentially be informative. In particular, could possibly suggest that the expression of the gene is driven by that of the gene; that is, a model in which the gene is treated as the response is a better fit to the data than one in which the gene is the response. Further research is required to determine how this asymmetry can be efficiently used to detect causal or lagged co-regulation relationships in large genomic datasets. For our purpose of clustering many genes for empirical analysis using standard algorithms such as hierarchical clustering, we symmetrize the Bayesian LLR2 by setting:
Intuitively, using the maximum over the minimum or the average should improve the power of this metric to measure associations in either direction. In Section 4.1, we use simulated gene expression data to demonstrate that the false positive rate remains controlled under these symmetrization methods.
In practice, it is also common to use similarity measures such as the Bayesian LLR2 to produce a ranked list of gene-gene associations. To aid this procedure, we further propose two additional metrics that one could use in conjunction with the Bayesian LLR2. These metrics are derived from the following two sub-models of the original model (4): (18) (18) (19) (19)
The first sub-model describes the temporal expression of gene A as a function only of the potentially co-regulated gene B. The second sub-model is “autoregressive” in the sense that it describes gene A’s expression only in terms of its own past and linear time trends. We again apply our Bayesian approach to fitting these two sub-models and compute new variants of the Bayesian LLR2 from each, denoted and respectively. The value indicates the amount of variation in gene A’s temporal expression that can be explained by the dynamics of another gene B. indicates the amount of variation in gene A that is explained by its own past, via its time integrals, and linear time trends. We can view as the amount of additional variation in gene A’s temporal dynamics that can be explained by considering gene B, on top of the variation captured by gene A’s own past via sub-model 2. Intuitively, if LLR2 is large, then a large value of suggests that the LLR2 value is unlikely to mean a false positive relationship between the two genes. In Section 4.5, we demonstrate empirically that using and together can help identify gene pairs with highly similar time dynamics.
3.6. Significance of the Bayesian Lead-Lag R2
One may further desire a notion of statistical significance for the Bayesian LLR2. One option is to simulate the posterior distribution of the LLR2 using draws from the posterior distribution of as described in Gelman et al. (Citation2018). Recall from Section 3.2 that the posterior distribution of is normal-inverse gamma with parameters defined respectively in (10), (11), (8), and (9). To draw samples from this posterior distribution, we can first sample from the distribution, and then sample from the distribution. In particular, if we may wish to simulate the posterior distribution of the Bayesian LLR2 under a null hypothesis of no relationship between genes A and B. This can be reflected in the sampling procedure by calculating with set to
Alternatively, one could use the Bayes factors presented in Liang et al. (Citation2008) to corroborate the Bayesian lead-lag R2. Let denote the model (4) of gene expression, which has an intercept and p = 4 “covariates” with coefficients c1 through c4. Let denote the null (intercept-only) model. Then the Bayes factor for comparing to is where R2 is the usual coefficient of determination in (16). Kass and Raftery (Citation1995) interpret a value between 1 and 2 as “strong” evidence in favor of or above 2 as “decisive” evidence.
In Appendix B, we give a summary of our methodology in the form of a generic algorithm that can be run on any time-course gene expression dataset.
3.7. Possible Modifications to Prior Hyperparameters
In Section 3.2, we set the prior mean of the parameters of the ODE model to when i.e. there is prior evidence suggesting genes A and B are associated. To make the method even more data-driven, one could alternatively set in the case, where is chosen adaptively from the data. The following theorem presents the values of ξ and g that simultaneously minimize SURE in (14) in this setting.
Theorem 3.3
(SURE minimization with respect to ξ and g). Assume the entries of the least-squares estimator are all distinct and the expression of gene B is non-zero for at least one time point, i.e. for at least one i. Let in the case that Then the values of ξ and g that minimize SURE in (14) are where is the element-wise sum of the first two columns of X.
The proof of Theorem 3.3 is provided in Appendix A.
4. Results
4.1. Analysis of Simulated Data
We first validate our methodology on a simulated dataset of temporal gene expression trajectories. We generate data from the model discussed in Section 2.1: where we set p(t) = 0 to assume that there is no shared regulatory signal between genes. We also assume that observations are taken in the space, and thus noise is additive to rather than This leaves us with
which is a linear ODE with the solution
We generate 100 gene expression trajectories from this model, using parameters and at 17 time pointsFootnote3 ranging from t = 0 to t = 48. We then add N(0, 1) noise to each point.
We apply our methodology to this simulated dataset by computing the Bayesian LLR2 between each pair of simulated genes. We set the prior adjacency matrix for all gene pairs, i.e. we assume that prior knowledge about each pair’s relationship is unavailable. Owing to the process of generating data with “independent” trajectories in the absence of shared regulatory signals, we expect to see low Bayesian LLR2 values between the vast majority of gene pairs. Observing only a small number of gene pairs with LLR2 values above reasonable thresholds confirms that the false positive rateFootnote4 remains controlled; it is also possible that some gene pairs do indeed exhibit similar temporal patterns through this data generating process, which we will demonstrate in .
In Section 3.5, it is mentioned that the Bayesian LLR2 is symmetrized by taking its value to be where denotes the lead-lag R2 calculated with the gene treated as the response. In , we show the distribution of Bayesian LLR2 values between all simulated gene pairs under three symmetrization methods: the minimum, mean, and maximum of and We indicate where the and percentiles of each distribution fall, and we also mark falls on the plot, as this threshold was used to define network edges between genes in the analysis of real gene expression data that we present in subsequent sections.
No simulated gene pairs have a Bayesian LLR2 above 0.9 under any symmetrization method, and at most 24 pairs (under the maximum symmetrization) have a LLR2 above 0.8. This demonstrates that the false positive rate of the method is controlled in the absence of prior knowledge about the relationships between genes, and when the genes do not share common regulatory signals. Of the gene pairs whose LLR2 values were above the percentile in each distribution, 48% were shared across at least two symmetrization methods. In , we present four examples of groups of simulated genes whose values were above the percentile under all three symmetrization methods; each group of trajectories exhibits visually similar temporal behavior.
4.2. Collecting a Time-Course Gene Expression Dataset for Drosophila melanogaster
The expression of a gene is typically measured via RNA sequencing (RNA-seq) as the count of a particular type of messenger RNA found in a tissue sample. These counts can be normalized to library size and transformed using the limma-voom transformation (Law et al. Citation2014). This transformation produces near-normally distributed gene expression measurements, making them more amenable to analysis with linear models such as those described in Section 2.1.
Our primary time-course gene expression dataset, introduced in Schlamp et al. (Citation2021), profiles the expression dynamics of 12,657 genes in Drosophila melanogaster (fruit fly) in response to an immune challenge. Immune responses were triggered in flies by injecting them with commercial lipopolysaccharide, which contains peptidoglycan (cell wall molecules) derived from the bacterium E. coli. Following injection, the flies were sampled via RNA-seq at 20 time points over five days. The data was normalized by the aforementioned limma-voom transformation and expressed as the -fold change with respect to the first time point, which was sampled pre-injection as a control. We focus on the first 17 time points, ranging from zero to 48 h post-injection, as this is when most of the variation in expression occurs.
Differential expression analysis is typically used to identify genes exhibiting significant expression changes, and thus reduce a set of thousands of genes into a smaller set meriting further study. In Appendix C, we provide further details on how we use such methods to reduce our set of 12,657 genes into a set of 1735 genes of interest. We also describe therein how we define our prior adjacency matrix W, introduced in Section 3.1, using the STRING database.
4.3. Small-Scale Case Study: Immunity and Metabolism
We now validate our methodology on a small set of genes whose behavior exhibits a known interplay between immunity and metabolism. Schlamp et al. (Citation2021) observe that exposure to bacterial peptidoglycan has an effect not only on the time dynamics of immune response, but also on the expression of genes involved in metabolism. In particular, some genes involved in immune response are up-regulated immediately following peptidoglycan injection, while other genes associated with metabolic processes are down-regulated more slowly. Interestingly, the metabolic genes return to their pre-infection levels of expression well before the immune response has subsided. This phenomenon can be observed in , which shows the expression patterns of two immune response genes (IM1, IM2) and four metabolic process genes (FASN1, UGP, mino, fbp).
show the prior adjacency matrix W, the Bayesian LLR2 values, and the non-Bayesian LLR2 values (computed via ordinary least-squares regression) corresponding to these six genes.
Within this set of six genes, indicates that there is prior evidence of association between the immune response genes IM1 and IM2, as well as between the metabolic genes mino and UGP. The off-diagonal “NA” entries in signify that the relationships between the immune and metabolic genes are uncharacterized in the STRING database. However, the Bayesian LLR2 values between the metabolic gene FASN1 and both immune response genes are high, as shown in , indicating that the relationship identified by Schlamp et al. (Citation2021) between these gene groups is automatically uncovered by our proposed Bayesian method. Indeed, shows that the temporal expression pattern of FASN1 resembles a vertically-reflected copy of that of either IM1 or IM2. shows the non-Bayesian LLR2 values for each gene pair and demonstrates that computing the LLR2 without biologically-informed priors yields inflated scores that make it difficult to discern either within- or between-group associations. For further validation of these results, we present in Appendix E. These tables display the values of the metric which we introduced in Section 3.5 as a means of assessing whether associations indicated by the LLR2 alone are false positives.
4.4. Biologically-Meaningful Clustering with the Bayesian
We now apply hierarchical clustering using Ward’s method (Ward Citation1963) to the N × N distance matrix where J is a matrix of ones and is the similarity matrix containing the symmetrized Bayesian LLR2 values between all gene pairs. These genes come from the dataset described in Section 4.2, so N = 1735. Cutting the resulting dendrogram at a height of ten yields 12 clusters, which we display in . A robustness analysis of the within-cluster sum of squares was also performed to confirm that this selection of dendrogram height was appropriate.
From each cluster, we can also construct a network in which an edge is defined between two genes if their corresponding Bayesian LLR2 exceeds a certain threshold. In our analysis, we choose this threshold to be 0.9, which is also the percentile of the empirical distribution of the Bayesian LLR2 for this dataset.
To understand the biological processes that are represented by these clusters, we make use of the Gene Ontology (GO) resource (Ashburner et al. Citation2000; Carbon et al. Citation2021). GO links genes with standardized terms that describe their functions. To determine whether a GO term is significantly enriched within a cluster, i.e. whether the cluster contains significantly more genes associated with that term than expected by chance, we perform a hypergeometric test using the R package ClusterProfiler (Yu et al. Citation2012). We use Benjamini-Hochberg (B–H) corrected p-values below 0.05 (Benjamini and Hochberg Citation1995) from this test to determine “significant” enrichment.
Our analysis shows that with the exception of cluster 8, all clusters are significantly enriched for specific biological functions. Recall that our dataset profiles the dynamics of immune response, which is an energetically costly process that is also associated with metabolic changes (DiAngelo et al. Citation2009). The interplay between immunity and metabolism, which we briefly explored in Section 4.3, is represented particularly well in these clusters. Clusters 1, 4, 6, and 7 are significantly enriched for metabolic processes; cluster 10 is significantly enriched for immune response; and cluster 5 is significantly enriched for both metabolic processes and immune responses. Below, we highlight biologically relevant findings from one cluster, and we discuss three additional clusters in Appendix F. These examples show that clustering with the Bayesian LLR2 allows genes with similar but lagged expression patterns to be grouped together, even in the absence of known prior information. Finally, the Bayesian LLR2 is not influenced by the direction of gene expression changes (i.e. positive or negative changes), making it easier to detect tradeoffs or negative regulatory interactions between genes.
We compare our methodology to other gene clustering techniques in the Appendix. In Appendix D, we show clustering results obtained using the non-Bayesian LLR2 as well as Pearson correlation between genes. In Appendix G, we present clusters derived from weighted correlation network analysis (WGCNA) (Langfelder and Horvath Citation2008). However, we find that these methods do not group together genes with similarly-shaped temporal trajectories as effectively as the Bayesian LLR2.
4.4.1. Analysis of Cluster 10
In cluster 10, which consists of 44 genes, one of the most significantly enriched GO terms is “defense response”. This GO term is supported by 24 genes in the cluster, within which there are two distinct groups that we display in : eight genes that are known to respond to “Imd” signaling and eight genes that respond to “Toll” signaling. The Imd and Toll signaling pathways represent well-studied molecular responses to infection in flies. The Imd pathway is tailored to fight off infections from gram-negative bacteria Kaneko et al. Citation2004; Zaidman-Rémy et al. Citation2011; Hanson and Lemaitre Citation2020, while the Toll pathway fights off infections from gram-positive bacteria and fungi (Gobert et al. Citation2003; Hanson and Lemaitre Citation2020).
Since the flies profiled in this gene expression dataset were injected with peptidoglycan derived from E. coli, a gram-negative bacterium, we expect to see an activation of Imd-regulated genes. Indeed, as seen in , the eight Imd-regulated genes in this cluster immediately underwent strong up-regulation and reached their highest expression one to two hours after peptidoglycan injection. By contrast, Toll-regulated genes underwent a delayed up-regulation of smaller magnitude, and reached their highest expression two to four hours after injection. Overall, the Bayesian LLR2 method successfully grouped together functionally related genes with distinct activation kinetics in this cluster.
In addition to recovering known dynamics of immune response pathways in cluster 10, the Bayesian LLR2 metric identified several new relationships between genes. In this cluster, four genes (CG44404, also known as IBIN; CG43236, also known as Mtk-like; CG43202, also known as BomT1; and CG43920) had no prior information available in the STRING database to link them with Imd-regulated genes. However, these four genes exhibit similar expression patterns to Imd-regulated genes, as seen in , although previous studies examine CG44404/IBIN and CG43202/BomT1 expression downstream of Toll signaling (Clemmons et al. Citation2015; Valanne et al. Citation2019). This suggests that these four genes are not exclusively controlled by Toll signaling, and that they can also respond to Imd signaling after a gram-negative immune challenge. While Imd-regulation of CG43236/Mtk-like and CG43920 has not been experimentally validated, their co-clustering pattern with Imd-regulated genes was also observed by Schlamp et al. (Citation2021).
In , we show a network consisting of the four aforementioned genes (CG44404/IBIN, CG43236/Mtk-like, CG43202/BomT1, CG43920) and their neighbors, i.e. the genes with which they have a Bayesian LLR2 of at least 0.9. Red edges in connect genes that were known to be associated according to prior information, i.e. for these pairs. Blue edges connect genes with an uncharacterized relationship in the STRING database, i.e. for these pairs.
4.4.2. Transcription Factor Targets Cluster Together
As mentioned in Sections 1 and 2, genes can be co-regulated in time by the same transcription factor(s), which are proteins that control gene expression. The ODE model of gene expression we employ in EquationEquations (1)(1) (1) and Equation(2)(2) (2) assumes that two genes that are co-regulated by a common transcription factor will have similar time dynamics; in particular, the rates of change in the expressions of both genes will be governed by the same regulatory signal p(t). In light of this model, we expect that co-regulated genes ought to cluster together, i.e. that the target genes of a transcription factor would not be dispersed across many different clusters. displays a heatmap showing the proportion of each transcription factor’s targets appearing in each of the clusters identified by our Bayesian LLR2 technique, and demonstrates that target genes do indeed appear in the same cluster for many transcription factors. Targets of transcription factors in each cluster were identified using the RcisTarget R package (Aibar et al. Citation2017).
4.5. The Bayesian LLR2 Produces a Sparse Set of Associations
The lead-lag R2 can be computed with biological information via our proposed Bayesian methodology, or without such information via ordinary least squares regression. In Section 2.2, we discuss the challenge of observing many spuriously high lead-lag R2 values under the least squares approach, which does not incorporate existing knowledge of the biological relationship between genes. In , we show the empirical distributions of the non-Bayesian and Bayesian LLR2 values in our Drosophila gene expression dataset. The non-Bayesian LLR2 distribution is heavily left-skewed, which complicates the definition of suitable LLR2 thresholds above which two genes can be considered to be associated. The Bayesian approach effectively “regularizes” the LLR2 values, yielding a distribution that reflects the reality of only a relatively small set of noteworthy pairwise associations existing within the dataset.
In particular, the distribution of the Bayesian LLR2 is conducive to identifying pairs or groups of genes with highly similar time dynamics. In Section 3.5, we introduced the metrics and As described therein, the former indicates how much variation in gene A’s expression can be explained only through the dynamics of another gene B. The latter indicates how much additional variation in gene A can be explained by considering gene B, on top of considering gene A’s own past trajectory. Intuitively, both of these quantities should be large if the two genes are indeed associated in a way that manifests in highly similar temporal dynamics.
In , we randomly select 150 genes and place all 1,175 pairs on two scatterplots whose horizontal axes display the values and vertical axes display the values. These R2 values are computed via our Bayesian method in one scatterplot and via ordinary least squares regression in the other. Gene pairs of particular interest fall into the upper right quadrant of the scatterplot.
shows that when we use the ordinary least-squares approach, i.e. without incorporating external biological information, we obtain an overwhelmingly large number of gene pairs with high scores. Many are those that are unlikely to be associated, according to our chosen sources of prior information. By contrast, the Bayesian approach leverages this information to shift the distribution of the R2 values noticeably, yielding a smaller set of gene pairs that are worth examining further. This distributional shift is due to both the estimator in (10) for the coefficients in the model (4), as well as the way in which the parameter g is set in (15). In particular, g controls how much is influenced by either the data or prior information.
Importantly, shows that gene pairs with previously uncharacterized relationships but highly similar time dynamics are more easily identified with the Bayesian LLR2. A few of these gene pairs, which fall in the fairly sparse upper-right region of the left-hand scatterplot, are shown in in Appendix D. shows gene pairs in the same region of the scatterplot with well-known relationships, further demonstrating that the proposed method successfully recovers familiar associations.
5. Discussion
Time-course gene expression datasets are a valuable resource for understanding the complex dynamics of interconnected biological systems. An important statistical task in the analysis of such data is to identify clusters or networks of genes that exhibit similar temporal expression patterns. These patterns yield systems-level insight into the roles of biological pathways and processes such as disease progression and recovery.
The main statistical challenges in detecting associations within time-course gene expression datasets stem from high dimensionality and small sample sizes, combined with the nonlinearity of gene expression time dynamics. To overcome these difficulties, we proposed a method for identifying potentially associated genes that treats temporal gene expression as a dynamical system governed by ODEs, whose parameters are determined in a Bayesian way using gene networks curated a priori from biological databases. The ODE model is fit to a pair of genes via Bayesian regression and is used to derive the biologically-informed Bayesian lead-lag R2 similarity measure. The Bayesian regression procedure leverages Zellner’s g-prior and ideas from shrinkage estimation, namely minimization of Stein’s unbiased risk estimate (SURE), to balance the posterior ODE model’s fit to the data and to prior knowledge of the relationship between the genes. As a result, we automatically encourage gene pairs with known associations to receive higher Bayesian lead-lag R2 scores, while reducing the scores of gene pairs that are unlikely to be related and allowing new relationships to be discovered. Future extensions of this work could adapt this procedure to groups of genes by defining a similar m-dimensional system of ODEs that share a common regulatory signal. Dimensionality reduction or regularization may be required to fit the resulting linear model in this setting.
In Section 4, we analyzed clusters and networks of genes that were identified by our method as having similar temporal dynamics. In particular, the clusters highlighted the known interplay between immune response and metabolism, and suggested roles for uncharacterized genes displaying remarkably similar temporal patterns to more well-studied ones. We contrasted our results to those obtained by using only the ordinary least-squares version of the lead-lag R2 and demonstrated how the inclusion of prior biological information greatly aids the identification of biologically relevant gene groups.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Additional information
Funding
Notes
1 Suppose p(t) = 0. Then we have which has the solution for some As we have (a non-zero steady state, if ).
2 In practice, Farina et al. (Citation2008) employ piecewise cubic Hermite polynomials to interpolate as well as add additional samples to each time interval. The same authors use the trapezoidal rule for numerical integration. Our software employs default R methods for interpolation and integration, which are natural cubic splines (via the splinefun function) and Gauss-Kronrod quadrature (via the integrate function) respectively.
3 We use the same time points that are used in our analysis of real gene expression data beginning in Section 4.2. These points are 0, 1, 2, 4, 5, 6, 8, 10, 12, 14, 16, 20, 24, 30, 36, 42, 48.
4 As mentioned in Section 2.2, we use the term “false positives” in this context to refer to gene pairs whose lead-lag R2 value is spuriously high but that have neither similar expression patterns nor a biological relationship.
References
- Aibar S, González-Blas CB, Moerman T, Huynh-Thu VA, Imrichova H, Hulselmans G, Rambow F, Marine J-C, Geurts P, Aerts J, et al. 2017. Scenic: single-cell regulatory network inference and clustering. Nat Methods. 14(11):1083–1086.
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. 2000. Gene ontology: tool for the unification of biology. Nat Genet. 25(1):25–29.
- Bansal M, Gatta GD, di Bernardo D. 2006. Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics. 22(7):815–822.
- Bar-Joseph Z, Gitter A, Simon I. 2012. Studying and modelling dynamic biological processes using time-series gene expression data. Nat Rev Genet. 13(8):552–564.
- Benjamini Y, Hochberg Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Statist Soc. 57(1):289–300.
- Carbon S, Douglass E, Good BM, Unni DR, Harris NL, Mungall CJ, Basu S, Chisholm RL, Dodson RJ, Hartline E, et al. 2021. The gene ontology resource: enriching a GOld mine. Nucleic Acids Res. 49(D1):D325–D334.
- Chen T, He HL, Church GM. 1999. Modeling gene expression with differential equations. In: Altman RB, Lauderdale K, Dunker AK, Hunter L, Klein TE, eds. Biocomputing’99. Proceedings of the Pacific Symposium; 1999 January 4–9; Mauna Lani, Hawaii. Singapore: World Scientific, p. 29–40.
- Clemmons AW, Lindsay SA, Wasserman SA. 2015. An effector peptide family required for Drosophila toll-mediated immunity. PLOS Pathog. 11(4):e1004876.
- Cui W, George EI. 2008. Empirical Bayes vs. fully Bayes variable selection. J Statist Plan Infer. 138(4):888–900.
- Cyran SA, Buchsbaum AM, Reddy KL, Lin M-C, Glossop NR, Hardin PE, Young MW, Storti RV, Blau J. 2003. vrille, Pdp1, and dClock form a second feedback loop in the Drosophila circadian clock. Cell. 112(3):329–341.
- De Jong H. 2002. Modeling and simulation of genetic regulatory systems: a literature review. J Comput Biol. 9(1):67–103.
- D’haeseleer P, Wen X, Fuhrman S, Somogyi R. 1999. Linear modeling of mRNA expression levels during CNS development and injury. In: Altman RB, Lauderdale K, Dunker AK, Hunter L, Klein TE, eds. Biocomputing’99. Proceedings of the Pacific Symposium; 1999 January 4–9; Mauna Lani, Hawaii. Singapore: World Scientific, p. 41–52.
- DiAngelo JR, Bland ML, Bambina S, Cherry S, Birnbaum MJ. 2009. The immune response attenuates growth and nutrient storage in Drosophila by reducing insulin signaling. Proc Natl Acad Sci USA. 106(49):20853–20858.
- Efron B. 2004. The estimation of prediction error: covariance penalties and cross-validation. J Am Stat Assoc. 99(467):619–632.
- Eisen MB, Spellman PT, Brown PO, Botstein D. 1998. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 95(25):14863–14868.
- Fabregat A, Jupe S, Matthews L, Sidiropoulos K, Gillespie M, Garapati P, Haw R, Jassal B, Korninger F, May B, et al. 2018. The Reactome pathway knowledgebase. Nucleic Acids Res. 46(D1):D649–D655.
- Farina L, De Santis A, Morelli G, Ruberti I. 2007. Dynamic measure of gene co-regulation. IET Syst Biol. 1(1):10–17.
- Farina L, De Santis A, Salvucci S, Morelli G, Ruberti I. 2008. Embedding mRNA stability in correlation analysis of time-series gene expression data. PLOS Comput Biol. 4(8):e1000141.
- Fourdrinier D, Strawderman WE, Wells MT. 2018. Shrinkage estimation. Switzerland: Springer.
- Gaudet P, Livstone MS, Lewis SE, Thomas PD. 2011. Phylogenetic-based propagation of functional annotations within the gene ontology consortium. Brief Bioinform. 12(5):449–462.
- Gelman A, Goodrich B, Gabry J, Vethari A. 2018. R-squared for Bayesian regression models. American Statistician. 73: 307–309.
- George EI, Foster DP. 2000. Calibration and empirical Bayes variable selection. Biometrika. 87(4):731–747.
- Giles D, Rayner A. 1979. The mean squared errors of the maximum likelihood and natural-conjugate Bayes regression estimators. J Economet. 11(2–3):319–334.
- Gobert V, Gottar M, Matskevich AA, Rutschmann S, Royet J, Belvin M, Hoffmann JA, Ferrandon D. 2003. Dual activation of the Drosophila toll pathway by two pattern recognition receptors. Science. 302(5653):2126–2130.
- Gonzalez EA, Garg A, Tang J, Nazario-Toole AE, Wu LP. 2013. A glutamate-dependent redox system in blood cells is integral for phagocytosis in Drosophila melanogaster. Curr Biol. 23(22):2319–2324.
- Goto A, Kadowaki T, Kitagawa Y. 2003. Drosophila hemolectin gene is expressed in embryonic and larval hemocytes and its knock down causes bleeding defects. Dev Biol. 264(2):582–591.
- Hanson MA, Lemaitre B. 2020. New insights on Drosophila antimicrobial peptide function in host defense and beyond. Curr Opin Immunol. 62:22–30.
- Hill SM, Neve RM, Bayani N, Kuo WL, Ziyad S, Spellman PT, Gray JW, Mukherjee S. 2012. Integrating biological knowledge into variable selection: an empirical Bayes approach with an application in cancer biology. BMC Bioinf. 13(1):1–15.
- Kanehisa M, Goto S. 2000. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28(1):27–30.
- Kaneko T, Goldman WE, Mellroth P, Steiner H, Fukase K, Kusumoto S, Harley W, Fox A, Golenbock D, Silverman N. 2004. Monomeric and polymeric gram-negative peptidoglycan but not purified LPS stimulate the Drosophila IMD pathway. Immunity. 20(5):637–649.
- Karp PD, Billington R, Caspi R, Fulcher CA, Latendresse M, Kothari A, Keseler IM, Krummenacker M, Midford PE, Ong Q, et al. 2019. The BioCyc collection of microbial genomes and metabolic pathways. Brief Bioinform. 20(4):1085–1093.
- Kass RE, Raftery AE. 1995. Bayes factors. J Am Stat Assoc. 90(430):773–795.
- Krejčová G, Danielová A, Nedbalová P, Kazek M, Strych L, Chawla G, Tennessen JM, Lieskovská J, Jindra M, Doležal T, et al. 2019. Drosophila macrophages switch to aerobic glycolysis to mount effective antibacterial defense. Elife. 8:e50414.
- Langfelder P, Horvath S. 2008. Wgcna: an r package for weighted correlation network analysis. BMC Bioinf. 9(1):1–13.
- Larkin A, Marygold SJ, Antonazzo G, Attrill H, Dos Santos G, Garapati PV, Goodman JL, Gramates LS, Millburn G, Strelets VB, et al. 2021. FlyBase: updates to the Drosophila melanogaster knowledge base. Nucleic Acids Res. 49(D1):D899–D907.
- Law CW, Chen Y, Shi W, Smyth GK. 2014. Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15(2):R29.
- Li F, Zhang NR. 2010. Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. J Am Stat Assoc. 105(491):1202–1214.
- Li Y, Jackson SA. 2015. Gene network reconstruction by integration of prior biological knowledge. G3. 5(6):1075–1079.
- Liang F, Paulo R, Molina G, Clyde MA, Berger JO. 2008. Mixtures of g priors for Bayesian variable selection. J Am Stat Assoc. 103(481):410–423.
- Lo K, Raftery AE, Dombek KM, Zhu J, Schadt EE, Bumgarner RE, Yeung KY. 2012. Integrating external biological knowledge in the construction of regulatory networks from time-series expression data. BMC Syst Biol. 6(1):17520509.
- Nepomuceno JA, Troncoso A, Nepomuceno-Chamorro IA, Aguilar-Ruiz JS. 2015. Integrating biological knowledge based on functional annotations for biclustering of gene expression data. Comput Methods Programs Biomed. 119(3):163–180.
- Peng B, Zhu D, Ander BP, Zhang X, Xue F, Sharp FR, Yang X. 2013. An integrative framework for Bayesian variable selection with informative priors for identifying genes and pathways. PLOS One. 8(7):e67672.
- Polynikis A, Hogan S, di Bernardo M. 2009. Comparing different ODE modelling approaches for gene regulatory networks. J Theor Biol. 261(4):511–530.
- Purvis JE, Lahav G. 2013. Encoding and decoding cellular information through signaling dynamics. Cell. 152(5):945–956.
- Qian J, Dolled-Filhart M, Lin J, Yu H, Gerstein M. 2001. Beyond synexpression relationships: local clustering of time-shifted and inverted gene expression profiles identifies new, biologically relevant interactions. J Mol Biol. 314(5):1053–1066.
- W. Research. 2019. Mathematica, version 12.0. Champaign, IL: Wolfram Research, Inc.
- Schlamp F, Delbare SY, Early AM, Wells MT, Basu S, Clark AG. 2021. Dense time-course gene expression profiling of the Drosophila melanogaster innate immune response. BMC Genomics. 22(1):1–22.
- Stingo FC, Chen YA, Tadesse MG, Vannucci M. 2011. Incorporating biological information into linear models: a Bayesian approach to the selection of pathways and genes. Ann Appl Stat. 5(3):1978–2002.
- Sun J, Liu C, Bai X, Li X, Li J, Zhang Z, Zhang Y, Guo J, Li Y. 2017. Drosophila FIT is a protein-specific satiety hormone essential for feeding control. Nat Commun. 8(1):1–13.
- Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, Simonovic M, Doncheva NT, Morris JH, Bork P, et al. 2019. STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 47(D1):D607–D613.
- Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM. 1999. Systematic determination of genetic network architecture. Nat Genet. 22(3):281–285.
- Valanne S, Salminen TS, Järvelä-Stölting M, Vesala L, Rämet M. 2019. Immune-inducible non-coding RNA molecule lincRNA-IBIN connects immunity and metabolism in Drosophila melanogaster. PLOS Pathog. 15(1):e1007504.
- Wang X, Wang T, Jiao Y, von Lintig J, Montell C. 2010. Requirement for an enzymatic visual cycle in Drosophila. Curr Biol. 20(2):93–102.
- Ward JH. Jr. 1963. Hierarchical grouping to optimize an objective function. J Am Stat Assoc. 58(301):236–244.
- Wu L, Qiu X, Xiang Yuan Y, Wu H. 2019. Parameter estimation and variable selection for big systems of linear ordinary differential equations: a matrix-based approach. J Am Stat Assoc. 114(526):657–667.
- Yosef N, Regev A. 2011. Impulse control: temporal dynamics in gene transcription. Cell. 144(6):886–896.
- Yu G, Li F, Qin Y, Bo X, Wu Y, Wang S. 2010. GoSemSim: an R package for measuring semantic similarity among go terms and gene products. Bioinformatics. 26(7):976–978.
- Yu G, Wang L-G, Han Y, He Q-Y. 2012. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 16(5):284–287.
- Zaidman-Rémy A, Poidevin M, Hervé M, Welchman DP, Paredes JC, Fahlander C, Steiner H, Mengin-Lecreulx D, Lemaitre B. 2011. Drosophila immunity: analysis of PGRP-SB1 expression, enzymatic activity and function. PLOS One. 6(2):e17231.
- Zellner A. 1986. On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In: Goel P, Zellner A, eds. Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti. New York, NY: Elsevier. p. 233–243.
- Zellner A, Siow A. 1980. Posterior odds ratios for selected regression hypotheses. Trabajos de Estadistica Y de Investigacion Operativa. 31(1):585–603.
- Zhang B, Horvath S. 2005. A general framework for weighted gene co-expression network analysis. Statist Appl Genet Mol Biol. 4(1): Article 17.
- Zhang B, Huang Y, McDermott JE, Posey RH, Xu H, Zhao Z. 2013. Molecular pathway identification using biological network-regularized logistic models. BMC Genomics. 14(Suppl 8):S1.
Appendix A.
Proofs
In this section, we provide proofs of Theorems 3.2 and 3.3.
Proof of Theorem 3.2.
We write as treating Y as fixed. Expanding the expression for we obtain (A1) (A1)
Next, observe that because (A2) (A2)
Furthermore, observe that because (A3) (A3) where the last equality follows from the fact that i.e. is in the column span of X, which is the space onto which H projects. The identities (A2) and (A3) can now be used to write in (A1) as where and Differentiating with respect to g, we obtain
Setting this derivative to zero and rearranging yields:
We now substitute the definitions of b, c, and d back into this expression to obtain (A4) (A4)
The numerator of (A4) can be simplified by observing that
Therefore, (A4) becomes (A5) (A5)
When we have In this case, (A5) becomes
Finally, the second derivative of evaluated at in (A5) is equal to
which is positive, thus confirming that is indeed minimized at This second derivative calculation was verified in Mathematica.
Proof of Theorem 3.3.
We write as treating Y as fixed. First, if then where is the element-wise sum of the first two columns of X. We now proceed similarly to the proof of Theorem 3.2 by expanding (A6) (A6)
Next, observe that because (A7) (A7) where the last equality follows from the fact that is in the column span of X, which is the space onto which projects. The identities (A2) and (A7) can now be used to write in (A6) as where and Differentiating with respect to ξ, we obtain
Setting this derivative to zero and rearranging yields:
We now substitute the definitions of b and d back into this expression to obtain (A8) (A8)
Next, we differentiate with respect to g:
Setting this derivative to zero and rearranging yields:
We now substitute the definitions of b, c, and d back into this expression to obtain
Substituting ξ in (A8) into this expression for g yields (A9) (A9)
Finally, the determinant of the Hessian matrix of denoted evaluated at in (A9) and in (A8) is (A10) (A10)
We must now verify that i.e. that is indeed an extremum of δ0. First, we observe that (A11) (A11) by direct application of the Cauchy-Schwarz inequality (recall that from (A7)). The inequality is strict because of the assumption that the entries of are distinct, which prevents and from being linearly dependent. Thus, (A11) implies that the denominator in (A10) is strictly negative. The numerator is also strictly negative, because the assumption that gene B is not zero everywhere ensures that (recall that X is defined in (5), and its first two columns consist of gene B’s expression measurements over time and its time integrals). Therefore, (A10) is strictly positive. To verify that is indeed a minimizer of δ0, we now check that as well. We have:
which is strictly positive. This second derivative calculation, as well as those in (A10), were verified in Mathematica.
Appendix B.
Bayesian lead-lag R2 algorithm
Following is an algorithm for computing the Bayesian lead-lag R2 for all gene pairs in a time-course gene expression dataset, consisting of N genes measured at n time points
Algorithm 1:
Bayesian lead-lag R2 calculations for all gene pairs
Note that this algorithm produces an upper-triangular similarity matrix S resulting from regressing gene i on gene j, for all i < j, and storing the resulting Bayesian value. In the empirical analysis presented in this paper, we set
To instead compute the Bayesian from sub-model 1 in (18), it suffices to change the definition of X to and to define as if and otherwise. To compute the Bayesian from sub-model 2 in (19), we change X to and regardless of the value of An optimized version of this algorithm runs in 21.8 min for N = 1735 genes on a 2017 3.1 GHz Intel Core i5 MacBook Pro.
Appendix C.
Additional dataset details
In this section, we provide further details on how our primary time-course gene expression dataset was constructed.
We first reduce our set of 12,657 genes into a set of 951 “differentially-expressed” (DE) genes identified in Schlamp et al. (Citation2021), defined as genes satisfying either of the following criteria: 1) there is at least one time point at which the gene’s expression undergoes a -fold change of at least two, or 2) a spline-based method for differential expression analysis returns a significant result. The latter method involves fitting a cubic spline to the temporal expression measurements of a gene under treatment and control settings, and testing whether the difference in the resulting two sets of coefficients is significant. We then add back a set of 784 genes that are not DE by these criteria, but that are “neighbors” of at least one DE gene. We define a neighbor of a DE gene as a non-DE gene that has a STRING score of at least 0.95 with the DE gene. The purpose of adding such neighbors back into the dataset, now consisting of genes, is to enable more complete biological pathways to be reconstructed from our cluster analysis.
In Section 3.1, we describe several sources of biological information that can be encoded into a prior adjacency matrix W. We choose to use the STRING database, and we mark two genes as “associated” (i.e. if their STRING score is greater than 0.5. We additionally use replicate information from the time-course dataset to fill in some of the unknown STRING scores. Specifically, if two genes have entries in the STRING database but have an unknown STRING score, we set if the correlation between their replicated temporal expressions is greater than 0.8. We keep for the gene pairs that do not have entries in STRING.
Appendix D.
Additional figures
Appendix E.
Additional tables
Appendix F.
Additional results
We continue our analysis in Section 4.4 of the results of hierarchical clustering with the Bayesian lead-lag R2 (LLR2).
F.1 Analysis of Cluster 2
Cluster 2 contains both up- and down-regulated genes with circadian rhythms, according to GO term enrichment. Several of these genes are displayed in . Among the up-regulated genes are three regulators of the circadian clock (per, vri, Pdp1). A fourth regulator of the circadian clock, Clk, is down-regulated. Pdp1 has been reported to reach its peak expression three to six hours after vri’s peak expression (Cyran et al. Citation2003), a pattern that is visible in this cluster. Cluster 2 further contains genes that are involved in visual perception: two genes encoding rhodopsins (Rh5, Rh6) (Gaudet et al. Citation2011) and Pdh, which encodes a retinal pigment dehydrogenase (Wang et al. Citation2010). Similar to Schlamp et al. (Citation2021), we also found that Smvt and salt, which encode sodium transporters (Gaudet et al. Citation2011), are under circadian control.
F.2 Analysis of Cluster 4
Genes in cluster 4, some of which are displayed in , are characterized by a transient decrease in expression during the first 24 h after peptidoglycan injection. This cluster was significantly enriched for “carbohydrate metabolic process” (B-H corrected p-value of ). A highly-connected gene involved in carbohydrate metabolism is fbp, which encodes the enzyme fructose-1,6-biphosphatase and has a degree of 102 in our reconstructed network shown in .
All of the genes to which fbp is connected had NA values in our prior adjacency matrix, meaning that their relationship to fbp is unknown according to our chosen sources of information. Some of these genes include Gale, AGBE, and Gba1b, which have known roles in carbohydrate metabolism. Twenty-two other genes connected to fbp in our network are not well-studied. These connections suggest roles for these uncharacterized genes in carbohydrate metabolism or energy homeostasis. Another gene connected to fbp is fit, which has a similar expression profile as fbp but experiences a much stronger and sharper down- and up-regulation. fit is not directly involved in carbohydrate processing but encodes a secreted protein that stimulates insulin signaling, which in turn regulates the expression of genes involved in carbohydrate metabolism, such as fbp (Sun et al. Citation2017). A previous study also showed that an immune response reduces insulin signaling in Drosphila (DiAngelo et al. Citation2009).
F.3 Analysis of Cluster 6
Cluster 6 is significantly enriched for GO terms related to metabolic processes, particularly the terms “cellular lipid catabolic process” and “carbohydrate metabolic process” (B–H corrected p-values of for both). In addition to these metabolic GO terms, there is significant enrichment of genes involved in “phagocytosis” (B–H corrected p-value of 0.03). During an immune response, phagocytosis is the process by which an immune cell engulfs and digests bacteria and apoptotic cells as a way to fight the infection. In Drosophila, phagocytosis is carried out by specialized hemocytes (Drosophila blood cells).
Among the genes in cluster 6 involved in carbohydrate metabolism are five mannosidases (LManI, LManIII, LManIV, LManV, LManVI) and three maltases (Mal-A2, Mal-A3, Mal-A4). These genes encode enzymes that break down complex sugars into simple sugars like glucose. Six genes in cluster 6 are expressed in hemocytes and are involved in phagocytosis. These include four genes that belong to the Nimrod gene family (NimB4, NimC1, NimC2, eater); Hml, which is involved in the clotting reaction in larvae (Goto et al. Citation2003); and Gs1. Gs1 encodes a glutamine synthetase, an enzyme whose action is not unique to hemocytes, but that has been shown to support hemocyte function (Gonzalez et al. Citation2013).
shows that the genes involved in metabolic processes and in phagocytosis exhibit similar expression patterns, with coordinated up- and down-regulation. This coordinated expression is sensible in the context of known hemocyte biology. After an infection, hemocytes undergo a metabolic switch, whereby their energy production is sustained mostly by aerobic glycolysis rather than oxidative phosphorylation (Krejčová et al. Citation2019). Since aerobic glycolysis is dependent on glucose, the simultaneous up-regulation of glucose-producing enzymes and genes needed for phagocytosis is aligned with our expectations. It is also worth noting that the fold changes of genes involved in phagocytosis are generally small, e.g. less than two-fold up-regulation, which is often used as a minimal cutoff in RNA-seq analyses. However, the coordinated expression changes detected by the Bayesian LLR2 suggest that these are biologically relevant patterns.
Appendix G.
Comparison to WGCNA
We now compare our methodology to the well-known weighted correlation network analysis (WGCNA) (Langfelder and Horvath Citation2008), which is often used to identify co-expression networks from gene expression data. To construct such a network, WGCNA uses pairwise correlations between genes to define a weighted adjacency matrix as where is the expression profile of the gene and is a soft thresholding power; this β can be chosen according to a scale-free topology criterion proposed by Zhang and Horvath (Citation2005). Modules (clusters of densely connected genes) are then identified by a number of possible methods, the default of which is hierarchical clustering applied to The biological significance of each module can be assessed by functional enrichment analysis.
We apply WGCNA to our Drosophila melanogaster gene expression dataset, choosing a soft thresholding parameter β = 9. This results in 20 clusters ranging from 22 to 333 genes, shown in .
shows that many of the WGCNA modules do not have an easily identifiable pattern of gene expression. Several modules consist predominantly of genes with mostly flat trajectories, while shows that the Bayesian LLR2 largely grouped such genes into at most two clusters (clusters 1 and 5). The genes that the Bayesian LLR2 originally placed into the visually distinctive clusters 2 (circadian rhythm genes), 10 (Imd- and Toll-regulated immune response genes) and 12 (ribosomal proteins) also appear to have been split across multiple WGCNA modules. Ultimately, while WGCNA is a helpful resource for co-expression analysis, it does not account for temporal dependence in gene expression measurements. The Bayesian LLR2 is more suitable for such data because it is derived from a temporal model of gene expression and leverages prior information about pairwise relationships.