605
Views
0
CrossRef citations to date
0
Altmetric
Discussion

An Empirical Bayes Approach to Estimating Dynamic Models of Co-Regulated Gene Expression

ORCID Icon, ORCID Icon, ORCID Icon, ORCID Icon, & ORCID Icon
Article: 2219707 | Received 21 Jul 2022, Accepted 08 May 2023, Published online: 21 Aug 2023

Abstract

Time-course gene expression datasets provide insight into the dynamics of complex biological processes, such as immune response and organ development. It is of interest to identify genes with similar temporal expression patterns because such genes are often biologically related. However, this task is challenging due to the high dimensionality of these datasets and the nonlinearity of gene expression time dynamics. We propose an empirical Bayes approach to estimating ordinary differential equation (ODE) models of gene expression, from which we derive a similarity metric between genes called the Bayesian lead-lag R2 (LLR2). Importantly, the calculation of the LLR2 leverages biological databases that document known interactions amongst genes; this information is automatically used to define informative prior distributions on the ODE model’s parameters. As a result, the LLR2 is a biologically-informed metric that can be used to identify clusters or networks of functionally-related genes with co-moving or time-delayed expression patterns. We then derive data-driven shrinkage parameters from Stein’s unbiased risk estimate that optimally balance the ODE model’s fit to both data and external biological information. Using real gene expression data, we demonstrate that our methodology allows us to recover interpretable gene clusters and sparse networks. These results reveal new insights about the dynamics of biological systems.

1. Introduction

Time-course gene expression datasets are an essential resource for querying the dynamics of complex biological processes, such as immune response, disease progression, and organ development (Bar-Joseph et al. Citation2012; Purvis and Lahav Citation2013; Yosef and Regev Citation2011). Such datasets, now abundantly available through techniques such as whole-genome RNA sequencing, consist of gene expression measurements for thousands of an organism’s genes at a few (typically 5-20) time points. Experimental evidence has revealed that groups of genes exhibiting similar temporal expression patterns are often biologically associated (Eisen et al. Citation1998). For instance, such genes may be co-regulated by the same transcription factors (Tavazoie et al. Citation1999): proteins that directly control gene expression, which ultimately contributes to changes in cellular function. Identifying clusters or networks of genes with related temporal dynamics, which is our objective in this study, can therefore uncover the regulators of dynamic biological processes. Doing so can also help generate testable hypotheses about the roles of orphan genes that exhibit similar expression patterns to ones that are better understood.

The complex, nonlinear time dynamics of gene expression pose a significant challenge for clustering and network analysis in genomics. Groups of interacting genes may be expressed with time lags or inverted patterns (Qian et al. Citation2001) due to delayed activation of underlying transcription factors, making it difficult to measure the “similarity” in two expression profiles. Ordinary differential equations (ODEs) or discrete-time difference equations have been successfully used to model the nonlinear expressions of a small number of genes (Chen et al. Citation1999; Bansal et al. Citation2006; D’haeseleer et al. Citation1999; De Jong Citation2002; Polynikis et al. Citation2009). It is possible to derive similarity metrics for the time dynamics of two genes from such ODEs, thus enabling putative identification of co-regulated genes and the reconstruction of regulatory networks (Farina et al. Citation2007, Citation2008; Wu et al. Citation2019). In particular, the approach proposed in Farina et al. (Citation2007) allows explicit modeling of lead-lag as well as contemporaneous associations between gene expression trajectories. We hence use it as the basis of the similarity calculations in our proposed clustering framework.

The high dimensionality (number of genes) and small sample sizes (number of time points) of time-course gene expression datasets pose another obstacle to identifying genes with similar expression dynamics. Due to the size of these datasets, the number of gene pairs receiving high similarity scores by any method can be overwhelmingly large. High similarity scores are typically validated for biological relevance using annotations provided by extensive public and commercial curated databases that assign genes to functional groups. For instance, Gene Ontology (GO) annotations are keywords that describe a gene’s molecular function, role in a biological process, or cellular localization (Ashburner et al. Citation2000). Other curated databases include KEGG (Kanehisa and Goto Citation2000), Reactome (Fabregat et al. Citation2018), BioCyc (Karp et al. Citation2019), and STRING (Szklarczyk et al. Citation2019). To ease the burden of manually validating a potentially vast number of gene-gene associations, we propose a Bayesian clustering technique that uses annotations as prior information to automatically validate these associations. Incorporating such information into a clustering method can encourage gene pairs with known biological associations to receive higher similarity scores, while filtering away those known to be unrelated. This also allows for knowledge gleaned from gene expression time series data to be contrasted with other knowledge bases; for instance, two genes with highly similar temporal expression patterns may not have been considered associated in previous cross-sectional (single time point) studies on which annotations are based, or vice versa.

There exist in the literature a few approaches to integrating biological knowledge with statistical measures of genetic association. One line of research considers Bayesian methods that use external data sources to determine prior distributions over genes or proteins that influence a biological response (Hill et al. Citation2012; Li and Zhang Citation2010; Lo et al. Citation2012; Peng et al. Citation2013; Stingo et al. Citation2011). Other studies develop biologically-informed regularization terms in graph-regularized methods for reconstructing gene networks (Li and Jackson Citation2015; Zhang et al. Citation2013). In another work, Nepomuceno et al. (Citation2015) propose an algorithm for biclustering gene expression data using gene ontology annotations. However, less attention has been given to using both data and prior biological knowledge to identify and model dominant patterns in the complex temporal dynamics of gene expression, e.g. with ODEs.

Our technical contribution in this work is a Bayesian method for constructing biologically-meaningful clusters and networks of genes from time-course expression data, using a new similarity measure between two genes called the Bayesian lead-lag R2 (LLR2). The Bayesian LLR2 is derived from ODE models of temporal gene expression, and is based on associations in both the time-course data and prior biological annotations. The balance between data and prior information is controlled by data-driven hyperparameters, making our approach an empirical Bayes method. As indicated by the name, the Bayesian LLR2 is based on the familiar R2 statistic (the coefficient of determination) and is simple and fast to compute for all (N2) gene pairs, where N is the number of genes under study. Importantly, external biological information regularizes the set of significant gene-gene associations found within a time-course dataset. In , for instance, we present a network of 1735 genes in Drosophila melanogaster (fruit fly) constructed both without external information, using an ordinary least-squares version of the LLR2 proposed by Farina et al. (Citation2008), and with external information, using our proposed Bayesian LLR2; the latter is a noticeably sparsified revision of the former, and retains only edges connecting genes with either known or highly plausible biological relationships.

Figure 1. Networks of 1735 genes profiled in a time-course gene expression dataset collected by Schlamp et al. (Citation2021). Vertices represent genes and edges connect two genes if their lead-lag R2 exceeds 0.9. Left: lead-lag R2 is computed using ordinary least squares regression according to Farina et al. (Citation2008), without any external biological information. All 1735 genes form a single connected component (599,896 edges). Right: lead-lag R2 is computed using our proposed Bayesian approach, which leverages external sources of biological information about gene-gene relationships. Red edges (11,380 edges) connect genes known to be associated. Blue edges (2830 edges) connect genes whose relationship is unknown but is supported by the data.

Figure 1. Networks of 1735 genes profiled in a time-course gene expression dataset collected by Schlamp et al. (Citation2021). Vertices represent genes and edges connect two genes if their lead-lag R2 exceeds 0.9. Left: lead-lag R2 is computed using ordinary least squares regression according to Farina et al. (Citation2008), without any external biological information. All 1735 genes form a single connected component (599,896 edges). Right: lead-lag R2 is computed using our proposed Bayesian approach, which leverages external sources of biological information about gene-gene relationships. Red edges (11,380 edges) connect genes known to be associated. Blue edges (2830 edges) connect genes whose relationship is unknown but is supported by the data.

The remainder of this paper is organized as follows. Section 2 describes the ODE model of temporal gene expression that we adopt. Section 3 details our empirical Bayes method for fitting the ODE model and obtaining our proposed LLR2 metric. Section 4 demonstrates the application of our method to real gene expression data collected by Schlamp et al. (Citation2021). We recover a tradeoff between immune response and metabolism that has been observed in several studies and present examples of biologically-meaningful gene clusters identified with the Bayesian LLR2. We also discuss the method’s potential to conjecture new data-driven hypotheses about gene-gene interactions. Sections 4.3 and Appendices D and E compare the Bayesian LLR2 to its non-Bayesian counterpart as well as the commonly used Pearson correlation between genes. Proofs and additional examples are provided in the Appendix.

2. Dynamic Models of Gene Expression

2.1. Model Derivation

We consider an ODE model of gene expression proposed by Farina et al. (Citation2007). Let mA(t) denote the expression of some gene A at time t, measured for instance as the log2-fold change in mRNA levels relative to time 0. The model assumes the rate of change in gene A’s expression is given by some regulatory signal p(t): dmA(t)dt=p(t)κAmA(t)+ηA, where κA denotes the mRNA decay rate for gene A, and ηA denotes natural and experimental noise. This model can be naturally extended to consider two genes A and B that might be associated with one another, i.e. that are governed by the same underlying p(t), yielding a pair of coupled differential equations: (1) dmA(t)dt=αAp(t)+βAκAmA(t)+ηA,(1) (2) dmB(t)dt=αBp(t)+βBκBmB(t)+ηB.(2)

The common signal p(t) accounts for the effect of one or more transcription factors that potentially regulate both genes A and B. The coefficients αA and αB measure the strength of p(t) in the expression patterns of genes A and B, respectively. βA and βB are affine coefficients that allow mA(t) and mB(t) to reach a non-zero steady state when p(t) = 0, i.e. in the absence of a regulatory signal.Footnote1

We now obtain a model of gene A’s expression in terms of gene B’s expression by first rearranging (2) to isolate p(t): (3) p(t)=1αB(dmB(t)dtβB+κBmB(t)ηB).(3)

Substituting (3) into (1) yields dmA(t)dt=αAαBdmB(t)dt+αAκBαBmB(t)κAmA(t)+βAαAβBαB+ηAαAηBαB.

Integrating from 0 to t, we obtain: (4) mA(t)=αAαBmB(t)+αAκBαB0tmB(s)dsκA0tmA(s)ds+(βAαAβBαB)t+0t(ηAαAηBαB)ds+c=c1mB(t)+c20tmB(s)ds+c30tmA(s)ds+c4t+c5+ε,(4) where c1=αAαB, c2=αAκBαB, c3=κA, c4=βAαAβBαB,c5=c, and ε=0t(ηAαAηBαB)d​s.

Observe that (4) is linear in the parameters ck. Thus, given measurements {mA(t1),,mA(tn)} and {mB(t1),,mB(tn)} of the expression levels of genes A and B at time points t1,,tn, we can express (4) as the linear model Y=Xβ+ε, where (5) Y=[mA(t1)mA(t2)mA(tn)],X=[mB(t1)0t1mB(s)ds0t1mA(s)dst11mB(t2)0t2mB(s)ds 0t2mA(s)dst21mB(tn)0tnmB(s)ds0tnmA(s)dstn1],β=[c1c2c3c4c5],(5) and εN(0,σ2In), where In denotes the n × n identity matrix; it is natural to assume that ηA and ηB are centered Gaussian processes with independent increments, in which case the integral defining ε above is normally distributed. Although only samples from the functions mA(t) and mB(t) are given, we can estimate the integral entries of the second and third columns of X by numerically integrating spline or polynomial interpolants fitted to these samples.Footnote2

In fitting the model (4) to the expression data of genes A and B, we obtain the ordinary least-squares (OLS) estimator β̂=(XTX)1XTY. The amount of variation in gene A’s expression that is captured by the estimated linear model is expressed as the familiar R2 value: R2=||Xβ̂Y¯1n||2/||YY¯1n||2, where Y¯=1nYT1n and 1n denotes the n×1 vector of ones. Adopting the terminology in Farina et al. (Citation2008), we refer to this R2 as the lead-lag R2 between genes A and B. A high lead-lag R2 may indicate that the two genes are co-regulated in time by common transcription factors, or are at least associated with one another in some way. The term “lead-lag” comes from the lead-lag compensator in control theory. In this context, a “lead-lag relationship” between genes refers to the presence of a common regulatory signal (input) that, in conjunction with the process of mRNA decay, modulates the expression of genes with the same biological function (output).

2.2. Motivating the Bayesian Lead-Lag R2

Our primary contribution in this work is a biologically-informed method for clustering genes based on their temporal dynamics. Clustering involves measuring the similarity between two objects, which can also be thought of as defining an edge between two nodes in an undirected network. Our similarity measure is a Bayesian version of the lead-lag R2 that uses both temporal expression data for genes A and B as well as a prior indication of whether they are biologically associated.

We can motivate the Bayesian lead-lag R2 via , which shows examples of false positive gene pairs: genes that have a spuriously high lead-lag R2, but do not have similar expression patterns nor a biological relationship. The data comes from an experiment on fruit flies by Schlamp et al. (Citation2021) that aimed to profile the dynamics of genes involved in or affected by immune response immediately following an infection. More details on this dataset can be found in Section 4.2.

Figure 2. Plots of the temporal expression profiles of three gene pairs for which the lead-lag R2 is spuriously high. Spline interpolants were fit through the observed points, which are indicated by solid dots. Functional annotations for these genes in Flybase (Larkin et al. Citation2021) do not suggest a clear link within each pair. By contrast, the lower Bayesian lead-lag R2 values more accurately reflect the degree of these associations.

Figure 2. Plots of the temporal expression profiles of three gene pairs for which the lead-lag R2 is spuriously high. Spline interpolants were fit through the observed points, which are indicated by solid dots. Functional annotations for these genes in Flybase (Larkin et al. Citation2021) do not suggest a clear link within each pair. By contrast, the lower Bayesian lead-lag R2 values more accurately reflect the degree of these associations.

Spuriously high lead-lag R2 values are likely to arise in large datasets. For example, if gene A’s expression levels increase or decrease monotonically with time, the response vector Y in (5) will be highly correlated with the time integrals and time points in the third and fourth columns of X. The lead-lag R2 between genes A and B will be large, but not because the genes are associated either in time or biologically.

In contrast to gene pairs of the kind shown in , we can consider , which displays genes from two well-studied functional groups, known as pathways: circadian rhythms and immune response. Within each group, we expect to see high pairwise lead-lag R2 values (true positives).

Figure 3. Left: The uncharacterized gene CG33511 exhibits similar time dynamics to known circadian rhythm genes with 24-hour cyclic temporal expressions. Right: Uncharacterized genes CG43920 and CG45045 exhibit similar temporal expression patterns to known immune response genes, which are up-regulated in response to infection.

Figure 3. Left: The uncharacterized gene CG33511 exhibits similar time dynamics to known circadian rhythm genes with 24-hour cyclic temporal expressions. Right: Uncharacterized genes CG43920 and CG45045 exhibit similar temporal expression patterns to known immune response genes, which are up-regulated in response to infection.

Incorporating pathway membership or protein-protein interaction networks into the lead-lag R2 enables us to encourage genes in the same pathways to receive higher pairwise similarity scores, thus separating true positives from false positives. Importantly, we can also suggest possible pathways for previously uncharacterized genes. In our method, external biological information is used to determine the locations of normal prior distributions on the parameters c1,,c5 in the model (4). Upon obtaining the posterior estimates of these parameters, we recompute the lead-lag R2 to obtain the Bayesian lead-lag R2 (LLR2). In doing so, we observe a desirable shrinkage effect in the distribution of the Bayesian lead-lag R2 values that pares down the number of significant associations. We next detail the hierarchical model and its hyperparameters in Section 3.

3. Empirical Bayes Methodology

In this section, we propose our empirical Bayes approach to deriving biologically-informed similarity metrics between genes for clustering and network analysis. The components of our method are: (1) encoding external biological information into a prior adjacency matrix, (2) defining a normal-inverse gamma prior, specifically Zellner’s g-prior, on the parameters of the ODE-based model of gene expression (4), (3) optimally selecting the hyperparameter g in Zellner’s g-prior, and (4) calculating the Bayesian lead-lag R2. Note that parts 2–4 lead to the computation of the Bayesian lead-lag R2 for a single gene pair; a summary of the full algorithm for all pairs is provided in Appendix B.

3.1. Part 1: Leveraging Biological Priors

There exist numerous databases that extensively document known or predicted interactions between genes as well as their functional roles. For instance, Gene Ontology (GO) terms are keywords that describe a gene’s molecular function, role in a biological process (e.g. “lipid metabolism”), or cellular localization (e.g. “nucleus”) (Ashburner et al. Citation2000). Semantic similarity methods, such as the R package GOSemSim (Yu et al. Citation2010), have been developed to determine how related genes are based on their associated GO terms. Other curated databases that similarly assign genes to pathways include KEGG (Kanehisa and Goto Citation2000), Reactome (Fabregat et al. Citation2018), and BioCyc (Karp et al. Citation2019). The STRING database (Szklarczyk et al. Citation2019) aggregates multiple sources of information to generate a more holistic measurement of the association between two genes. For each pair of genes in an organism, STRING provides a score between 0 and 1 indicating how likely the two genes’ encoded proteins are to interact physically based on experimental evidence, belong to the same pathways according to the aforementioned databases, be conserved across species, or be mentioned in the same studies.

Regardless of which sources of external biological information one employs, the first step of our method is to encode this information into a matrix W of size N × N, where N is the total number of genes under study. The entries of W are: Wij={1if prior evidence suggests that the ith and jth genes are associatedNAif the relationship between the ith and    jth genes is unknown0if the ith and                             jth genes are unlikely to be associated. Intuitively, W can be viewed as an adjacency matrix for a network that reflects the known relationships amongst the genes. Our proposed method uses this network as an informed basis for the network constructed from the time-course gene expression data itself. Importantly, the data can indicate what the “NA” entries of W ought to be, as well as confirm (or refute) the “0” or “1” entries.

In the event that W consists largely of “NA” entries, one can also use time-course gene expression data from previous studies to fill in some of them. Such datasets often consist of multiple replicates of the expression measurements, possibly gathered under the same experimental conditions to account for sampling variation, or different conditions to assess the effect of a treatment.

3.2. Part 2: The Normal-Inverse Gamma Model and Zellner’s g-Prior

Recall from Section 2.1 that given measurements {mA(t1),,mA(tn)} and {mB(t1),,mB(tn)} of the expressions of two genes A and B at times t1,,tn, the model we aim to fit is mA(t)=c1mB(t)+c20tmB(s)ds+c30tmA(s)ds+c4t+c5, where c1,,c5 are unknown parameters, and the integrals 0tmA(s)d​s and 0tmB(s)d​s are estimated by numerically integrating spline interpolants of the given data. As seen in (5), this model can be represented in matrix form as Y=Xβ+ε, where εN(0,σ2In).

Since c1 and c2 link the expressions of genes A and B, we intuitively ought to place priors of non-zero mean on these two parameters if W indicates that the genes are associated. To do this, we adopt the normal-inverse gamma model for β and σ2, which is used frequently in Bayesian regression and allows for flexible modeling of the prior mean and covariance matrix of β. The normal-inverse gamma model specifies β|σ2N(β0,σ2V0) and σ2Γ1(a,b) where β0Rp×1,V0Rp×p is a positive semidefinite matrix, and a,b>0. It is then said that (β,σ2) jointly follow a normal-inverse gamma distribution with parameters (β0,V0,a,b). This is a conjugate prior, so the posterior distribution of (β,σ2) is also normal-inverse gamma with parameters (β*,V*,a*,b*) defined as (6) β*=(V01+XTX)1(V01β0+XTY),(6) (7) V*=(V01+XTX)1,(7) (8) a*=a+n2,(8) (9) b*=b+12(β0TV01β0+YTYβ*TV*1β*).(9) That is, the conditional posterior of β given σ2 and the posterior of σ2 are β|σ2,YN(β*,σ2V*) and σ2|YΓ1(a*,b*). The posterior mean β* can be used as the estimated regression parameters. This estimator has the desirable properties of consistency and asymptotic efficiency in large samples and admissibility in finite samples (Giles and Rayner Citation1979).

The hyperparameters β0 and V0 are of particular interest as they allow us to incorporate biological information into our model. In defining β0, recall that we wish to place priors with non-zero mean on the parameters c1 and c2 when external sources suggest that genes A and B are co-regulated or at least associated. We noted in Section 2.1 that c1 represents the ratio αA/αB, where αA and αB denote the strength of a common regulatory signal in the first-order dynamics of the two genes. If the genes are associated, it is reasonable to believe a priori that αA=αB, implying c1=1. Then we could also say a priori that c2=1, since c2 represents a parameter that is proportional to c1. When prior information about genes A and B is lacking or suggests that they are unrelated, a prior mean of zero for c1 and c2 is appropriate. Supposing genes A and B are the ith and jth genes in the dataset, we can thus set the prior mean β0 as follows: β0={[1,1,0,0,0]Tif Wij=1[0,0,0,0,0]Tif Wij=0 or NA.

As for the prior covariance matrix σ2V0 of β, we first note that for the linear model Y=Xβ+ε where ε has the covariance matrix σ2In, the least-squares estimator β̂OLS=(XTX)1XTY has the covariance matrix σ2(XTX)1. A popular choice for V0 is therefore g(XTX)1, where g > 0. This choice of V0 yields a particularly tractable special case of the normal-inverse gamma model known as Zellner’s g-prior (Zellner Citation1986). Substituting this choice of V0 into the posterior mean β* in (6) and covariance matrix V* in (7), we obtain (10) β*=11+gβ0+g1+gβ̂OLS,(10) (11) V*=g1+g(XTX)1.(11) (10) Reveals that under Zellner’s g-prior, the posterior mean β* is a convex combination of the prior mean β0 and the least-squares estimator β̂OLS. The parameter g balances the weights placed on external information encoded by β0 and on the data used to compute β̂OLS, so the selection of g is an important component of our modeling strategy. We next describe our data-driven approach to choosing it.

3.3. Part 3: Optimal Selection of g in Zellner’s g-Prior

Several methods for choosing g in Zellner’s g-prior have been proposed previously. For instance, George and Foster (Citation2000) discuss an empirical Bayes method in which one selects the value of g that maximizes the marginal likelihood of Y. Liang et al. (Citation2008) provide a closed form expression for this maximizing value of g that is nearly identical to the F-statistic for testing the hypothesis that β=0. They show that this maximum marginal likelihood approach has the desirable property that the Bayes factor for comparing the full model to the null (intercept-only) model diverges as the R2 approaches 1. For our application, one concern is that the F-statistic defining this particular estimate of g is likely to be overwhelmingly large for many gene pairs. If g is large, then β* will be very close to β̂OLS according to (10). As a result, any biological evidence of association captured by β0 will have a negligible impact on the model.

A fully Bayesian approach to selecting g involves placing a prior distribution on g that is then integrated out when defining the prior on β. This method is motivated by Zellner and Siow (Citation1980), who propose placing a Cauchy prior on β to derive a closed-form posterior odds ratio for hypothesis testing purposes. These priors can be represented as a mixture of g-priors with an inverse gamma prior on g, although a closed-form posterior estimate of g is unavailable. Cui and George (Citation2008) and Liang et al. (Citation2008) alternatively consider a class of priors of the form π(g)=a22(1+g)a/2 for a > 2, known as the hyper-g priors. Under these priors, the posterior mean of g can be expressed in closed form in terms of the Gaussian hypergeometric function.

Because our application involves fitting regression models for potentially thousands of gene pairs, the computational cost of fully Bayesian methods for selecting g requires us to consider alternative approaches. One idea is to select the value of g that minimizes the sum of squared residuals ||YY||2, where Y=Xβ* is the vector of fitted values: (12) Y=Xβ*=X(11+gβ0+g1+gβ̂OLS)=11+gY0+g1+gYOLS,(12) where Y0=Xβ0 and YOLS=Xβ̂OLS. However, we found that there are no analytical solutions to g*=argming>0||YY||2=0. Instead, we can minimize Stein’s unbiased risk estimate (SURE), which is an unbiased estimate of ||YXβ||2. Below is a rephrased version of Theorem 8.7 in Fourdrinier et al. (Citation2018), which defines SURE for the general problem of estimating E(Y) using a linear estimator of the form Y=a+SY. This theorem statement differs from its original version in that we have rewritten the divergence of Xβ̂ with respect to Y using the generalized degrees of freedom developed by Efron (Citation2004).

Theorem 3.1

(SURE for linear models). Let YN(Xβ,σ2In), where the dimensions of X are n × p, and let β̂=β̂(Y) be a weakly differentiable function of the least squares estimator β̂OLS such that Y=Xβ̂ can be written in the form Y=a+SY for some vector a and matrix S. Let σ̂2=||YXβ̂OLS||2/(np). Then, (13) δ0(Y)=||YXβ̂||2+(2Tr(S)n)σ̂2(13) is an unbiased estimator of ||YXβ||2.

From (12), observe that we can write Y as 11+gY0+g1+gHY, where H=X(XTX)1XT. Therefore the matrix S in Theorem 3.1 is gH/(1+g), whose trace is gp/(1+g). SURE in (13) then becomes (14) δ0(Y)=||YXβ*||2+(2gp1+gn)σ̂2,(14) where we have also substituted the posterior mean β* in (10) for β̂.

We next present the value of g that minimizes SURE.

Theorem 3.2

(SURE minimization with respect to g). The value of g that minimizes SURE in (14) is g*=||YOLSY0||2pσ̂21.

The proof of Theorem 3.2 is provided in Appendix A.

It is quite possible that pσ̂2 is small, resulting in a large value of g* (i.e. g*1) in Theorem 3.2. In this case, β* in (10) will be largely influenced by the data via β̂OLS, rather than by prior information via β0. This is desirable when the relationship between the two genes is unknown (i.e. Wij=NA), but not if the relationship is known to be unlikely (i.e. Wij=0). In the latter case, we prefer to shrink the regression coefficients towards the prior mean β0=0 so as to yield a smaller lead-lag R2 value. To address this, we set g conditionally on Wij as (15) g={g*if Wij=NA or 11if Wij=0,(15) where g* is defined according to Theorem 3.2.

3.4. Part 4: Computing the R2 for Bayesian Regression Models

Once a posterior estimate of the model coefficients (10) has been obtained, with the parameter g selected optimally, we can compute the Bayesian lead-lag R2 between genes A and B.

Recall that for a linear model Y=Xβ+ε where β is estimated by the least-squares estimator β̂OLS=(XTX)1XTY, the R2 is defined as (16) R2=||Xβ̂OLSY¯1n||2||YY¯1n||2,(16) where Y¯=(1/n)YT1n. In Bayesian regression, however, the standard decomposition of total sum of squares into residual and regression sums of squares no longer holds. Thus, when we replace β̂OLS with an estimator such as the posterior mean of β, the formula (16) can potentially yield R2>1. As a remedy to this issue, we compute the R2 as the ratio of the variance of the model fit to itself plus the variance of the residuals. This ratio is within [0,1] by construction, and is given by (17) R2=Var̂(Xβ*)Var̂(Xβ*)+Var̂(YXβ*),(17) where, for a vector Z=[z1zn]T with mean z¯, we define Var̂(Z)=1n1i=1n(ziz¯)2. This calculation is based on the approach to computing R2 for Bayesian regression models proposed in Gelman et al. (Citation2018). For our application, we refer to (17) as the Bayesian lead-lag R2 (LLR2).

3.5. Clustering and Empirical Analysis with the Bayesian Lead-Lag R2

Given a dataset of N genes whose expressions are measured at n time points, our objective is to cluster the genes based on their temporal expression patterns. To do this, we compute a N × N similarity matrix S where Si,j stores the Bayesian LLR2 in (17) between the ith and jth genes. We then apply hierarchical clustering to the distance matrix JS, where J is the N × N matrix of ones.

Note that the Bayesian LLR2 is asymmetric: LLR2(i,j)LLR2(j,i). Here, LLR2(i,j) denotes the Bayesian LLR2 where we treat the ith gene as the response (gene A) in the model (4). This asymmetry could potentially be informative. In particular, LLR2(i,j)>LLR2(j,i) could possibly suggest that the expression of the ith gene is driven by that of the jth gene; that is, a model in which the ith gene is treated as the response is a better fit to the data than one in which the jth gene is the response. Further research is required to determine how this asymmetry can be efficiently used to detect causal or lagged co-regulation relationships in large genomic datasets. For our purpose of clustering many genes for empirical analysis using standard algorithms such as hierarchical clustering, we symmetrize the Bayesian LLR2 by setting: Si,j=max{LLR2(i,j),LLR2(j,i)}.

Intuitively, using the maximum over the minimum or the average should improve the power of this metric to measure associations in either direction. In Section 4.1, we use simulated gene expression data to demonstrate that the false positive rate remains controlled under these symmetrization methods.

In practice, it is also common to use similarity measures such as the Bayesian LLR2 to produce a ranked list of gene-gene associations. To aid this procedure, we further propose two additional metrics that one could use in conjunction with the Bayesian LLR2. These metrics are derived from the following two sub-models of the original model (4): (18) mA(t)=c1mB(t)+c20tmB(s)ds+c5,(Sub-model 1)(18) (19) mA(t)=c30tmA(s)ds+c4t+c5. (Sub-model 2)(19)

The first sub-model describes the temporal expression of gene A as a function only of the potentially co-regulated gene B. The second sub-model is “autoregressive” in the sense that it describes gene A’s expression only in terms of its own past and linear time trends. We again apply our Bayesian approach to fitting these two sub-models and compute new variants of the Bayesian LLR2 from each, denoted LLRother2 and LLRown2 respectively. The LLRother2 value indicates the amount of variation in gene A’s temporal expression that can be explained by the dynamics of another gene B. LLRown2 indicates the amount of variation in gene A that is explained by its own past, via its time integrals, and linear time trends. We can view LLR2LLRown2 as the amount of additional variation in gene A’s temporal dynamics that can be explained by considering gene B, on top of the variation captured by gene A’s own past via sub-model 2. Intuitively, if LLR2 is large, then a large value of LLR2LLRown2 suggests that the LLR2 value is unlikely to mean a false positive relationship between the two genes. In Section 4.5, we demonstrate empirically that using LLRother2 and LLR2LLRown2 together can help identify gene pairs with highly similar time dynamics.

3.6. Significance of the Bayesian Lead-Lag R2

One may further desire a notion of statistical significance for the Bayesian LLR2. One option is to simulate the posterior distribution of the LLR2 using draws from the posterior distribution of β, as described in Gelman et al. (Citation2018). Recall from Section 3.2 that the posterior distribution of (β,σ2) is normal-inverse gamma with parameters (β*,V*,a*,b*), defined respectively in (10), (11), (8), and (9). To draw samples from this posterior distribution, we can first sample σ2 from the Γ1(a*,b*) distribution, and then sample β from the N(β*,gσ2V*) distribution. In particular, if Wij=1, we may wish to simulate the posterior distribution of the Bayesian LLR2 under a null hypothesis of no relationship between genes A and B. This can be reflected in the sampling procedure by calculating β* with β0 set to 0.

Alternatively, one could use the Bayes factors presented in Liang et al. (Citation2008) to corroborate the Bayesian lead-lag R2. Let MF denote the model (4) of gene expression, which has an intercept and p = 4 “covariates” with coefficients c1 through c4. Let MN denote the null (intercept-only) model. Then the Bayes factor for comparing MF to MN is BF(MF:MN)=(1+g)(np1)/21(1+g(1R2))(n1)/2, where R2 is the usual coefficient of determination in (16). Kass and Raftery (Citation1995) interpret a log10(BF(MF:MN)) value between 1 and 2 as “strong” evidence in favor of MF, or above 2 as “decisive” evidence.

In Appendix B, we give a summary of our methodology in the form of a generic algorithm that can be run on any time-course gene expression dataset.

3.7. Possible Modifications to Prior Hyperparameters

In Section 3.2, we set the prior mean β0 of the parameters of the ODE model to [1,1,0,0,0]T when Wij=1, i.e. there is prior evidence suggesting genes A and B are associated. To make the method even more data-driven, one could alternatively set β0=[ξ,ξ,0,0,0]T in the Wij=1 case, where ξ0 is chosen adaptively from the data. The following theorem presents the values of ξ and g that simultaneously minimize SURE in (14) in this setting.

Theorem 3.3

(SURE minimization with respect to ξ and g). Assume the entries of the least-squares estimator β̂OLS are all distinct and the expression of gene B is non-zero for at least one time point, i.e. mB(ti)0 for at least one i. Let β0=[ξ,ξ,0,0,0]T in the case that Wij=1. Then the values of ξ and g that minimize SURE in (14) are ξ*=YTX12||X12||2  and  g*=||YOLS||2||X12||2(YTX12)2||X12||2pσ̂21, where X12Rn×1 is the element-wise sum of the first two columns of X.

The proof of Theorem 3.3 is provided in Appendix A.

4. Results

4.1. Analysis of Simulated Data

We first validate our methodology on a simulated dataset of temporal gene expression trajectories. We generate data from the model discussed in Section 2.1: dmA(t)dt=αAp(t)+βAκAmA(t)+ηA where we set p(t) = 0 to assume that there is no shared regulatory signal between genes. We also assume that observations are taken in the mA(t) space, and thus noise is additive to mA(t) rather than dmA(t)/dt. This leaves us with dmA(t)dt=βAκAmA(t),

which is a linear ODE with the solution mA(t)=βAκA+ceκAt, cR.

We generate 100 gene expression trajectories from this model, using parameters βAUniform(0,2),κAUniform(0,1), and cUniform(0,1), at 17 time pointsFootnote3 ranging from t = 0 to t = 48. We then add N(0, 1) noise to each point.

We apply our methodology to this simulated dataset by computing the Bayesian LLR2 between each pair of simulated genes. We set the prior adjacency matrix Wij=NA for all gene pairs, i.e. we assume that prior knowledge about each pair’s relationship is unavailable. Owing to the process of generating data with “independent” trajectories in the absence of shared regulatory signals, we expect to see low Bayesian LLR2 values between the vast majority of gene pairs. Observing only a small number of gene pairs with LLR2 values above reasonable thresholds confirms that the false positive rateFootnote4 remains controlled; it is also possible that some gene pairs do indeed exhibit similar temporal patterns through this data generating process, which we will demonstrate in .

Figure 4. Groups of simulated gene expression trajectories whose pairwise Bayesian lead-lag R2 values fell above the 99th percentile of the three empirical LLR2 distributions shown in . Each group exhibits visually similar temporal trajectories despite having been generated independently, demonstrating that the LLR2 remains able to detect such pairs even in the absence of prior information and shared regulatory signals.

Figure 4. Groups of simulated gene expression trajectories whose pairwise Bayesian lead-lag R2 values fell above the 99th percentile of the three empirical LLR2 distributions shown in Figure 5. Each group exhibits visually similar temporal trajectories despite having been generated independently, demonstrating that the LLR2 remains able to detect such pairs even in the absence of prior information and shared regulatory signals.

In Section 3.5, it is mentioned that the Bayesian LLR2 is symmetrized by taking its value to be max{LLR2(i,j),LLR2(j,i)}, where LLR2(i,j) denotes the lead-lag R2 calculated with the ith gene treated as the response. In , we show the distribution of Bayesian LLR2 values between all simulated gene pairs under three symmetrization methods: the minimum, mean, and maximum of LLR2(i,j) and LLR2(j,i). We indicate where the 95th and 99th percentiles of each distribution fall, and we also mark LLR2=0.9 falls on the plot, as this threshold was used to define network edges between genes in the analysis of real gene expression data that we present in subsequent sections.

Figure 5. Distributions of the Bayesian lead-lag R2 when it is symmetrized by taking the maximum, mean, or minimum of LLR2(i,j) and LLR2(j,i), where LLR2(i,j) denotes the lead-lag R2 obtained by treating the ith gene as the response in model (4). Gene trajectories are simulated from the model mA(t)=βA/κA+ceκAt, to which standard Gaussian noise is added at each of 17 time points.

Figure 5. Distributions of the Bayesian lead-lag R2 when it is symmetrized by taking the maximum, mean, or minimum of LLR2(i,j) and LLR2(j,i), where LLR2(i,j) denotes the lead-lag R2 obtained by treating the ith gene as the response in model (4). Gene trajectories are simulated from the model mA(t)=βA/κA+ce−κAt, to which standard Gaussian noise is added at each of 17 time points.

No simulated gene pairs have a Bayesian LLR2 above 0.9 under any symmetrization method, and at most 24 pairs (under the maximum symmetrization) have a LLR2 above 0.8. This demonstrates that the false positive rate of the method is controlled in the absence of prior knowledge about the relationships between genes, and when the genes do not share common regulatory signals. Of the gene pairs whose LLR2 values were above the 99th percentile in each distribution, 48% were shared across at least two symmetrization methods. In , we present four examples of groups of simulated genes whose LLR2 values were above the 99th percentile under all three symmetrization methods; each group of trajectories exhibits visually similar temporal behavior.

4.2. Collecting a Time-Course Gene Expression Dataset for Drosophila melanogaster

The expression of a gene is typically measured via RNA sequencing (RNA-seq) as the count of a particular type of messenger RNA found in a tissue sample. These counts can be normalized to library size and transformed using the limma-voom transformation (Law et al. Citation2014). This transformation produces near-normally distributed gene expression measurements, making them more amenable to analysis with linear models such as those described in Section 2.1.

Our primary time-course gene expression dataset, introduced in Schlamp et al. (Citation2021), profiles the expression dynamics of 12,657 genes in Drosophila melanogaster (fruit fly) in response to an immune challenge. Immune responses were triggered in flies by injecting them with commercial lipopolysaccharide, which contains peptidoglycan (cell wall molecules) derived from the bacterium E. coli. Following injection, the flies were sampled via RNA-seq at 20 time points over five days. The data was normalized by the aforementioned limma-voom transformation and expressed as the log2-fold change with respect to the first time point, which was sampled pre-injection as a control. We focus on the first 17 time points, ranging from zero to 48 h post-injection, as this is when most of the variation in expression occurs.

Differential expression analysis is typically used to identify genes exhibiting significant expression changes, and thus reduce a set of thousands of genes into a smaller set meriting further study. In Appendix C, we provide further details on how we use such methods to reduce our set of 12,657 genes into a set of 1735 genes of interest. We also describe therein how we define our prior adjacency matrix W, introduced in Section 3.1, using the STRING database.

4.3. Small-Scale Case Study: Immunity and Metabolism

We now validate our methodology on a small set of genes whose behavior exhibits a known interplay between immunity and metabolism. Schlamp et al. (Citation2021) observe that exposure to bacterial peptidoglycan has an effect not only on the time dynamics of immune response, but also on the expression of genes involved in metabolism. In particular, some genes involved in immune response are up-regulated immediately following peptidoglycan injection, while other genes associated with metabolic processes are down-regulated more slowly. Interestingly, the metabolic genes return to their pre-infection levels of expression well before the immune response has subsided. This phenomenon can be observed in , which shows the expression patterns of two immune response genes (IM1, IM2) and four metabolic process genes (FASN1, UGP, mino, fbp).

Figure 6. Temporal expression patterns of selected genes involved in immune response (IM1, IM2) and metabolic processes (FASN1, UGP, mino, fbp). Upon infection, immune response genes are immediately up-regulated. At the same time, metabolic genes are down-regulated but return to pre-infection expression levels after 12–24 h, which is before the immune response is resolved.

Figure 6. Temporal expression patterns of selected genes involved in immune response (IM1, IM2) and metabolic processes (FASN1, UGP, mino, fbp). Upon infection, immune response genes are immediately up-regulated. At the same time, metabolic genes are down-regulated but return to pre-infection expression levels after 12–24 h, which is before the immune response is resolved.

show the prior adjacency matrix W, the Bayesian LLR2 values, and the non-Bayesian LLR2 values (computed via ordinary least-squares regression) corresponding to these six genes.

Table 1. Portion of the prior adjacency matrix W corresponding to the six genes in .

Table 2. Bayesian LLR2 values corresponding to each gene pair in .

Table 3. Non-Bayesian LLR2 values corresponding to each gene pair in .

Within this set of six genes, indicates that there is prior evidence of association between the immune response genes IM1 and IM2, as well as between the metabolic genes mino and UGP. The off-diagonal “NA” entries in signify that the relationships between the immune and metabolic genes are uncharacterized in the STRING database. However, the Bayesian LLR2 values between the metabolic gene FASN1 and both immune response genes are high, as shown in , indicating that the relationship identified by Schlamp et al. (Citation2021) between these gene groups is automatically uncovered by our proposed Bayesian method. Indeed, shows that the temporal expression pattern of FASN1 resembles a vertically-reflected copy of that of either IM1 or IM2. shows the non-Bayesian LLR2 values for each gene pair and demonstrates that computing the LLR2 without biologically-informed priors yields inflated scores that make it difficult to discern either within- or between-group associations. For further validation of these results, we present in Appendix E. These tables display the values of the metric LLR2LLRown2, which we introduced in Section 3.5 as a means of assessing whether associations indicated by the LLR2 alone are false positives.

4.4. Biologically-Meaningful Clustering with the Bayesian LLR2

We now apply hierarchical clustering using Ward’s method (Ward Citation1963) to the N × N distance matrix JS, where J is a matrix of ones and Si,j=max{LLR2(i,j),LLR2(j,i)} is the similarity matrix containing the symmetrized Bayesian LLR2 values between all gene pairs. These genes come from the dataset described in Section 4.2, so N = 1735. Cutting the resulting dendrogram at a height of ten yields 12 clusters, which we display in . A robustness analysis of the within-cluster sum of squares was also performed to confirm that this selection of dendrogram height was appropriate.

Figure 7. Plots of the temporal expression patterns of genes in each cluster obtained by applying hierarchical clustering to the distance matrix JS, where S contains the Bayesian LLR2 for all gene pairs. Cluster sizes range from 40 to 727 genes, with a mean of 145 and a median of 76. These plots visually demonstrate that the Bayesian LLR2 is capable of capturing similarities in the overall shapes of two temporal expression patterns, even if one gene is a time-delayed copy of the other or is reflected across a horizontal axis, for instance.

Figure 7. Plots of the temporal expression patterns of genes in each cluster obtained by applying hierarchical clustering to the distance matrix J−S, where S contains the Bayesian LLR2 for all gene pairs. Cluster sizes range from 40 to 727 genes, with a mean of 145 and a median of 76. These plots visually demonstrate that the Bayesian LLR2 is capable of capturing similarities in the overall shapes of two temporal expression patterns, even if one gene is a time-delayed copy of the other or is reflected across a horizontal axis, for instance.

From each cluster, we can also construct a network in which an edge is defined between two genes if their corresponding Bayesian LLR2 exceeds a certain threshold. In our analysis, we choose this threshold to be 0.9, which is also the 99th percentile of the empirical distribution of the Bayesian LLR2 for this dataset.

To understand the biological processes that are represented by these clusters, we make use of the Gene Ontology (GO) resource (Ashburner et al. Citation2000; Carbon et al. Citation2021). GO links genes with standardized terms that describe their functions. To determine whether a GO term is significantly enriched within a cluster, i.e. whether the cluster contains significantly more genes associated with that term than expected by chance, we perform a hypergeometric test using the R package ClusterProfiler (Yu et al. Citation2012). We use Benjamini-Hochberg (B–H) corrected p-values below 0.05 (Benjamini and Hochberg Citation1995) from this test to determine “significant” enrichment.

Our analysis shows that with the exception of cluster 8, all clusters are significantly enriched for specific biological functions. Recall that our dataset profiles the dynamics of immune response, which is an energetically costly process that is also associated with metabolic changes (DiAngelo et al. Citation2009). The interplay between immunity and metabolism, which we briefly explored in Section 4.3, is represented particularly well in these clusters. Clusters 1, 4, 6, and 7 are significantly enriched for metabolic processes; cluster 10 is significantly enriched for immune response; and cluster 5 is significantly enriched for both metabolic processes and immune responses. Below, we highlight biologically relevant findings from one cluster, and we discuss three additional clusters in Appendix F. These examples show that clustering with the Bayesian LLR2 allows genes with similar but lagged expression patterns to be grouped together, even in the absence of known prior information. Finally, the Bayesian LLR2 is not influenced by the direction of gene expression changes (i.e. positive or negative changes), making it easier to detect tradeoffs or negative regulatory interactions between genes.

We compare our methodology to other gene clustering techniques in the Appendix. In Appendix D, we show clustering results obtained using the non-Bayesian LLR2 as well as Pearson correlation between genes. In Appendix G, we present clusters derived from weighted correlation network analysis (WGCNA) (Langfelder and Horvath Citation2008). However, we find that these methods do not group together genes with similarly-shaped temporal trajectories as effectively as the Bayesian LLR2.

4.4.1. Analysis of Cluster 10

In cluster 10, which consists of 44 genes, one of the most significantly enriched GO terms is “defense response”. This GO term is supported by 24 genes in the cluster, within which there are two distinct groups that we display in : eight genes that are known to respond to “Imd” signaling and eight genes that respond to “Toll” signaling. The Imd and Toll signaling pathways represent well-studied molecular responses to infection in flies. The Imd pathway is tailored to fight off infections from gram-negative bacteria Kaneko et al. Citation2004; Zaidman-Rémy et al. Citation2011; Hanson and Lemaitre Citation2020, while the Toll pathway fights off infections from gram-positive bacteria and fungi (Gobert et al. Citation2003; Hanson and Lemaitre Citation2020).

Figure 8. Temporal expression patterns of selected genes in cluster 10 during the first ten hours following peptidoglycan injection. The eight red lines correspond to Imd-regulated genes. The eight blue lines correspond to Toll-regulated genes, which show a smaller and delayed up-regulation. The four black lines correspond to genes that exhibited high Bayesian LLR2 values (> 0.9) with Imd-regulated genes; while there is no prior information in STRING linking them with the Imd pathway, their co-clustering with Imd-regulated genes was also observed in Schlamp et al. (Citation2021).

Figure 8. Temporal expression patterns of selected genes in cluster 10 during the first ten hours following peptidoglycan injection. The eight red lines correspond to Imd-regulated genes. The eight blue lines correspond to Toll-regulated genes, which show a smaller and delayed up-regulation. The four black lines correspond to genes that exhibited high Bayesian LLR2 values (> 0.9) with Imd-regulated genes; while there is no prior information in STRING linking them with the Imd pathway, their co-clustering with Imd-regulated genes was also observed in Schlamp et al. (Citation2021).

Since the flies profiled in this gene expression dataset were injected with peptidoglycan derived from E. coli, a gram-negative bacterium, we expect to see an activation of Imd-regulated genes. Indeed, as seen in , the eight Imd-regulated genes in this cluster immediately underwent strong up-regulation and reached their highest expression one to two hours after peptidoglycan injection. By contrast, Toll-regulated genes underwent a delayed up-regulation of smaller magnitude, and reached their highest expression two to four hours after injection. Overall, the Bayesian LLR2 method successfully grouped together functionally related genes with distinct activation kinetics in this cluster.

In addition to recovering known dynamics of immune response pathways in cluster 10, the Bayesian LLR2 metric identified several new relationships between genes. In this cluster, four genes (CG44404, also known as IBIN; CG43236, also known as Mtk-like; CG43202, also known as BomT1; and CG43920) had no prior information available in the STRING database to link them with Imd-regulated genes. However, these four genes exhibit similar expression patterns to Imd-regulated genes, as seen in , although previous studies examine CG44404/IBIN and CG43202/BomT1 expression downstream of Toll signaling (Clemmons et al. Citation2015; Valanne et al. Citation2019). This suggests that these four genes are not exclusively controlled by Toll signaling, and that they can also respond to Imd signaling after a gram-negative immune challenge. While Imd-regulation of CG43236/Mtk-like and CG43920 has not been experimentally validated, their co-clustering pattern with Imd-regulated genes was also observed by Schlamp et al. (Citation2021).

In , we show a network consisting of the four aforementioned genes (CG44404/IBIN, CG43236/Mtk-like, CG43202/BomT1, CG43920) and their neighbors, i.e. the genes with which they have a Bayesian LLR2 of at least 0.9. Red edges in connect genes that were known to be associated according to prior information, i.e. Wij=1 for these pairs. Blue edges connect genes with an uncharacterized relationship in the STRING database, i.e. Wij=NA for these pairs.

Figure 9. Network of genes formed from CG44404/IBIN, CG43236/Mtk-like, CG43202/BomT1, CG43920, and their neighbors. Two genes are connected by an edge if their Bayesian LLR2>0.9. Red and blue edges connect genes with known and unknown relationships, respectively. Darkened edges connect genes within cluster 10. Blue edges connect the four genes of interest with genes known to be regulated by the Imd signaling pathway, suggesting a possible role for them in fighting gram-negative infections.

Figure 9. Network of genes formed from CG44404/IBIN, CG43236/Mtk-like, CG43202/BomT1, CG43920, and their neighbors. Two genes are connected by an edge if their Bayesian LLR2>0.9. Red and blue edges connect genes with known and unknown relationships, respectively. Darkened edges connect genes within cluster 10. Blue edges connect the four genes of interest with genes known to be regulated by the Imd signaling pathway, suggesting a possible role for them in fighting gram-negative infections.

4.4.2. Transcription Factor Targets Cluster Together

As mentioned in Sections 1 and 2, genes can be co-regulated in time by the same transcription factor(s), which are proteins that control gene expression. The ODE model of gene expression we employ in EquationEquations (1) and Equation(2) assumes that two genes that are co-regulated by a common transcription factor will have similar time dynamics; in particular, the rates of change in the expressions of both genes will be governed by the same regulatory signal p(t). In light of this model, we expect that co-regulated genes ought to cluster together, i.e. that the target genes of a transcription factor would not be dispersed across many different clusters. displays a heatmap showing the proportion of each transcription factor’s targets appearing in each of the clusters identified by our Bayesian LLR2 technique, and demonstrates that target genes do indeed appear in the same cluster for many transcription factors. Targets of transcription factors in each cluster were identified using the RcisTarget R package (Aibar et al. Citation2017).

Figure 10. Heatmap displaying the proportion of each transcription factor’s targets that are assigned to each of the 12 clusters. Transcription factors are named on the right-hand side. The number of target genes they have ranges from three to 1090, with a median of 21. Many transcription factors have their targets grouped in the same cluster, demonstrating that the Bayesian LLR2 successfully identifies clusters containing co-regulated genes.

Figure 10. Heatmap displaying the proportion of each transcription factor’s targets that are assigned to each of the 12 clusters. Transcription factors are named on the right-hand side. The number of target genes they have ranges from three to 1090, with a median of 21. Many transcription factors have their targets grouped in the same cluster, demonstrating that the Bayesian LLR2 successfully identifies clusters containing co-regulated genes.

4.5. The Bayesian LLR2 Produces a Sparse Set of Associations

The lead-lag R2 can be computed with biological information via our proposed Bayesian methodology, or without such information via ordinary least squares regression. In Section 2.2, we discuss the challenge of observing many spuriously high lead-lag R2 values under the least squares approach, which does not incorporate existing knowledge of the biological relationship between genes. In , we show the empirical distributions of the non-Bayesian and Bayesian LLR2 values in our Drosophila gene expression dataset. The non-Bayesian LLR2 distribution is heavily left-skewed, which complicates the definition of suitable LLR2 thresholds above which two genes can be considered to be associated. The Bayesian approach effectively “regularizes” the LLR2 values, yielding a distribution that reflects the reality of only a relatively small set of noteworthy pairwise associations existing within the dataset.

Figure 11. Empirical distributions of the non-Bayesian (left) and Bayesian (right) lead-lag R2 values calculated on the Drosophila gene expression dataset under consideration in this study. The 95th and 99th empirical percentiles are marked in each plot.

Figure 11. Empirical distributions of the non-Bayesian (left) and Bayesian (right) lead-lag R2 values calculated on the Drosophila gene expression dataset under consideration in this study. The 95th and 99th empirical percentiles are marked in each plot.

In particular, the distribution of the Bayesian LLR2 is conducive to identifying pairs or groups of genes with highly similar time dynamics. In Section 3.5, we introduced the metrics LLRother2 and LLR2LLRown2. As described therein, the former indicates how much variation in gene A’s expression can be explained only through the dynamics of another gene B. The latter indicates how much additional variation in gene A can be explained by considering gene B, on top of considering gene A’s own past trajectory. Intuitively, both of these quantities should be large if the two genes are indeed associated in a way that manifests in highly similar temporal dynamics.

In , we randomly select 150 genes and place all (1502)=11,175 pairs on two scatterplots whose horizontal axes display the LLRother2 values and vertical axes display the LLR2LLRown2 values. These R2 values are computed via our Bayesian method in one scatterplot and via ordinary least squares regression in the other. Gene pairs of particular interest fall into the upper right quadrant of the scatterplot.

Figure 12. Scatterplots of LLRother2 (horizontal axis) and LLR2LLRown2 (vertical axis) for a random selection of 150 genes, resulting in 11,175 gene pairs. Left: LLR2 values are computed via ordinary least squares, without external biological information. Points are colored according to how prior information characterizes the corresponding two genes: unlikely to be associated (gray); uncharacterized association (blue); or known association (red). Right: LLR2 values are computed with the proposed Bayesian approach.

Figure 12. Scatterplots of LLRother2 (horizontal axis) and LLR2−LLRown2 (vertical axis) for a random selection of 150 genes, resulting in 11,175 gene pairs. Left: LLR2 values are computed via ordinary least squares, without external biological information. Points are colored according to how prior information characterizes the corresponding two genes: unlikely to be associated (gray); uncharacterized association (blue); or known association (red). Right: LLR2 values are computed with the proposed Bayesian approach.

shows that when we use the ordinary least-squares approach, i.e. without incorporating external biological information, we obtain an overwhelmingly large number of gene pairs with high LLRother2 scores. Many are those that are unlikely to be associated, according to our chosen sources of prior information. By contrast, the Bayesian approach leverages this information to shift the distribution of the R2 values noticeably, yielding a smaller set of gene pairs that are worth examining further. This distributional shift is due to both the estimator β* in (10) for the coefficients in the model (4), as well as the way in which the parameter g is set in (15). In particular, g controls how much β* is influenced by either the data or prior information.

Importantly, shows that gene pairs with previously uncharacterized relationships but highly similar time dynamics are more easily identified with the Bayesian LLR2. A few of these gene pairs, which fall in the fairly sparse upper-right region of the left-hand scatterplot, are shown in in Appendix D. shows gene pairs in the same region of the scatterplot with well-known relationships, further demonstrating that the proposed method successfully recovers familiar associations.

5. Discussion

Time-course gene expression datasets are a valuable resource for understanding the complex dynamics of interconnected biological systems. An important statistical task in the analysis of such data is to identify clusters or networks of genes that exhibit similar temporal expression patterns. These patterns yield systems-level insight into the roles of biological pathways and processes such as disease progression and recovery.

The main statistical challenges in detecting associations within time-course gene expression datasets stem from high dimensionality and small sample sizes, combined with the nonlinearity of gene expression time dynamics. To overcome these difficulties, we proposed a method for identifying potentially associated genes that treats temporal gene expression as a dynamical system governed by ODEs, whose parameters are determined in a Bayesian way using gene networks curated a priori from biological databases. The ODE model is fit to a pair of genes via Bayesian regression and is used to derive the biologically-informed Bayesian lead-lag R2 similarity measure. The Bayesian regression procedure leverages Zellner’s g-prior and ideas from shrinkage estimation, namely minimization of Stein’s unbiased risk estimate (SURE), to balance the posterior ODE model’s fit to the data and to prior knowledge of the relationship between the genes. As a result, we automatically encourage gene pairs with known associations to receive higher Bayesian lead-lag R2 scores, while reducing the scores of gene pairs that are unlikely to be related and allowing new relationships to be discovered. Future extensions of this work could adapt this procedure to groups of m2 genes by defining a similar m-dimensional system of ODEs that share a common regulatory signal. Dimensionality reduction or regularization may be required to fit the resulting linear model in this setting.

In Section 4, we analyzed clusters and networks of genes that were identified by our method as having similar temporal dynamics. In particular, the clusters highlighted the known interplay between immune response and metabolism, and suggested roles for uncharacterized genes displaying remarkably similar temporal patterns to more well-studied ones. We contrasted our results to those obtained by using only the ordinary least-squares version of the lead-lag R2 and demonstrated how the inclusion of prior biological information greatly aids the identification of biologically relevant gene groups.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by National Institutes of Health and National Science Foundation.

Notes

1 Suppose p(t) = 0. Then we have dmA(t)dt=βAκAmA(t), which has the solution mA(t)=βAκA+ceκAt for some cR. As t, we have mA(t)βAκA (a non-zero steady state, if βA0).

2 In practice, Farina et al. (Citation2008) employ piecewise cubic Hermite polynomials to interpolate as well as add additional samples to each time interval. The same authors use the trapezoidal rule for numerical integration. Our software employs default R methods for interpolation and integration, which are natural cubic splines (via the splinefun function) and Gauss-Kronrod quadrature (via the integrate function) respectively.

3 We use the same time points that are used in our analysis of real gene expression data beginning in Section 4.2. These points are 0, 1, 2, 4, 5, 6, 8, 10, 12, 14, 16, 20, 24, 30, 36, 42, 48.

4 As mentioned in Section 2.2, we use the term “false positives” in this context to refer to gene pairs whose lead-lag R2 value is spuriously high but that have neither similar expression patterns nor a biological relationship.

References

  • Aibar S, González-Blas CB, Moerman T, Huynh-Thu VA, Imrichova H, Hulselmans G, Rambow F, Marine J-C, Geurts P, Aerts J, et al. 2017. Scenic: single-cell regulatory network inference and clustering. Nat Methods. 14(11):1083–1086.
  • Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. 2000. Gene ontology: tool for the unification of biology. Nat Genet. 25(1):25–29.
  • Bansal M, Gatta GD, di Bernardo D. 2006. Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics. 22(7):815–822.
  • Bar-Joseph Z, Gitter A, Simon I. 2012. Studying and modelling dynamic biological processes using time-series gene expression data. Nat Rev Genet. 13(8):552–564.
  • Benjamini Y, Hochberg Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Statist Soc. 57(1):289–300.
  • Carbon S, Douglass E, Good BM, Unni DR, Harris NL, Mungall CJ, Basu S, Chisholm RL, Dodson RJ, Hartline E, et al. 2021. The gene ontology resource: enriching a GOld mine. Nucleic Acids Res. 49(D1):D325–D334.
  • Chen T, He HL, Church GM. 1999. Modeling gene expression with differential equations. In: Altman RB, Lauderdale K, Dunker AK, Hunter L, Klein TE, eds. Biocomputing’99. Proceedings of the Pacific Symposium; 1999 January 4–9; Mauna Lani, Hawaii. Singapore: World Scientific, p. 29–40.
  • Clemmons AW, Lindsay SA, Wasserman SA. 2015. An effector peptide family required for Drosophila toll-mediated immunity. PLOS Pathog. 11(4):e1004876.
  • Cui W, George EI. 2008. Empirical Bayes vs. fully Bayes variable selection. J Statist Plan Infer. 138(4):888–900.
  • Cyran SA, Buchsbaum AM, Reddy KL, Lin M-C, Glossop NR, Hardin PE, Young MW, Storti RV, Blau J. 2003. vrille, Pdp1, and dClock form a second feedback loop in the Drosophila circadian clock. Cell. 112(3):329–341.
  • De Jong H. 2002. Modeling and simulation of genetic regulatory systems: a literature review. J Comput Biol. 9(1):67–103.
  • D’haeseleer P, Wen X, Fuhrman S, Somogyi R. 1999. Linear modeling of mRNA expression levels during CNS development and injury. In: Altman RB, Lauderdale K, Dunker AK, Hunter L, Klein TE, eds. Biocomputing’99. Proceedings of the Pacific Symposium; 1999 January 4–9; Mauna Lani, Hawaii. Singapore: World Scientific, p. 41–52.
  • DiAngelo JR, Bland ML, Bambina S, Cherry S, Birnbaum MJ. 2009. The immune response attenuates growth and nutrient storage in Drosophila by reducing insulin signaling. Proc Natl Acad Sci USA. 106(49):20853–20858.
  • Efron B. 2004. The estimation of prediction error: covariance penalties and cross-validation. J Am Stat Assoc. 99(467):619–632.
  • Eisen MB, Spellman PT, Brown PO, Botstein D. 1998. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 95(25):14863–14868.
  • Fabregat A, Jupe S, Matthews L, Sidiropoulos K, Gillespie M, Garapati P, Haw R, Jassal B, Korninger F, May B, et al. 2018. The Reactome pathway knowledgebase. Nucleic Acids Res. 46(D1):D649–D655.
  • Farina L, De Santis A, Morelli G, Ruberti I. 2007. Dynamic measure of gene co-regulation. IET Syst Biol. 1(1):10–17.
  • Farina L, De Santis A, Salvucci S, Morelli G, Ruberti I. 2008. Embedding mRNA stability in correlation analysis of time-series gene expression data. PLOS Comput Biol. 4(8):e1000141.
  • Fourdrinier D, Strawderman WE, Wells MT. 2018. Shrinkage estimation. Switzerland: Springer.
  • Gaudet P, Livstone MS, Lewis SE, Thomas PD. 2011. Phylogenetic-based propagation of functional annotations within the gene ontology consortium. Brief Bioinform. 12(5):449–462.
  • Gelman A, Goodrich B, Gabry J, Vethari A. 2018. R-squared for Bayesian regression models. American Statistician. 73: 307–309.
  • George EI, Foster DP. 2000. Calibration and empirical Bayes variable selection. Biometrika. 87(4):731–747.
  • Giles D, Rayner A. 1979. The mean squared errors of the maximum likelihood and natural-conjugate Bayes regression estimators. J Economet. 11(2–3):319–334.
  • Gobert V, Gottar M, Matskevich AA, Rutschmann S, Royet J, Belvin M, Hoffmann JA, Ferrandon D. 2003. Dual activation of the Drosophila toll pathway by two pattern recognition receptors. Science. 302(5653):2126–2130.
  • Gonzalez EA, Garg A, Tang J, Nazario-Toole AE, Wu LP. 2013. A glutamate-dependent redox system in blood cells is integral for phagocytosis in Drosophila melanogaster. Curr Biol. 23(22):2319–2324.
  • Goto A, Kadowaki T, Kitagawa Y. 2003. Drosophila hemolectin gene is expressed in embryonic and larval hemocytes and its knock down causes bleeding defects. Dev Biol. 264(2):582–591.
  • Hanson MA, Lemaitre B. 2020. New insights on Drosophila antimicrobial peptide function in host defense and beyond. Curr Opin Immunol. 62:22–30.
  • Hill SM, Neve RM, Bayani N, Kuo WL, Ziyad S, Spellman PT, Gray JW, Mukherjee S. 2012. Integrating biological knowledge into variable selection: an empirical Bayes approach with an application in cancer biology. BMC Bioinf. 13(1):1–15.
  • Kanehisa M, Goto S. 2000. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28(1):27–30.
  • Kaneko T, Goldman WE, Mellroth P, Steiner H, Fukase K, Kusumoto S, Harley W, Fox A, Golenbock D, Silverman N. 2004. Monomeric and polymeric gram-negative peptidoglycan but not purified LPS stimulate the Drosophila IMD pathway. Immunity. 20(5):637–649.
  • Karp PD, Billington R, Caspi R, Fulcher CA, Latendresse M, Kothari A, Keseler IM, Krummenacker M, Midford PE, Ong Q, et al. 2019. The BioCyc collection of microbial genomes and metabolic pathways. Brief Bioinform. 20(4):1085–1093.
  • Kass RE, Raftery AE. 1995. Bayes factors. J Am Stat Assoc. 90(430):773–795.
  • Krejčová G, Danielová A, Nedbalová P, Kazek M, Strych L, Chawla G, Tennessen JM, Lieskovská J, Jindra M, Doležal T, et al. 2019. Drosophila macrophages switch to aerobic glycolysis to mount effective antibacterial defense. Elife. 8:e50414.
  • Langfelder P, Horvath S. 2008. Wgcna: an r package for weighted correlation network analysis. BMC Bioinf. 9(1):1–13.
  • Larkin A, Marygold SJ, Antonazzo G, Attrill H, Dos Santos G, Garapati PV, Goodman JL, Gramates LS, Millburn G, Strelets VB, et al. 2021. FlyBase: updates to the Drosophila melanogaster knowledge base. Nucleic Acids Res. 49(D1):D899–D907.
  • Law CW, Chen Y, Shi W, Smyth GK. 2014. Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15(2):R29.
  • Li F, Zhang NR. 2010. Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. J Am Stat Assoc. 105(491):1202–1214.
  • Li Y, Jackson SA. 2015. Gene network reconstruction by integration of prior biological knowledge. G3. 5(6):1075–1079.
  • Liang F, Paulo R, Molina G, Clyde MA, Berger JO. 2008. Mixtures of g priors for Bayesian variable selection. J Am Stat Assoc. 103(481):410–423.
  • Lo K, Raftery AE, Dombek KM, Zhu J, Schadt EE, Bumgarner RE, Yeung KY. 2012. Integrating external biological knowledge in the construction of regulatory networks from time-series expression data. BMC Syst Biol. 6(1):17520509.
  • Nepomuceno JA, Troncoso A, Nepomuceno-Chamorro IA, Aguilar-Ruiz JS. 2015. Integrating biological knowledge based on functional annotations for biclustering of gene expression data. Comput Methods Programs Biomed. 119(3):163–180.
  • Peng B, Zhu D, Ander BP, Zhang X, Xue F, Sharp FR, Yang X. 2013. An integrative framework for Bayesian variable selection with informative priors for identifying genes and pathways. PLOS One. 8(7):e67672.
  • Polynikis A, Hogan S, di Bernardo M. 2009. Comparing different ODE modelling approaches for gene regulatory networks. J Theor Biol. 261(4):511–530.
  • Purvis JE, Lahav G. 2013. Encoding and decoding cellular information through signaling dynamics. Cell. 152(5):945–956.
  • Qian J, Dolled-Filhart M, Lin J, Yu H, Gerstein M. 2001. Beyond synexpression relationships: local clustering of time-shifted and inverted gene expression profiles identifies new, biologically relevant interactions. J Mol Biol. 314(5):1053–1066.
  • W. Research. 2019. Mathematica, version 12.0. Champaign, IL: Wolfram Research, Inc.
  • Schlamp F, Delbare SY, Early AM, Wells MT, Basu S, Clark AG. 2021. Dense time-course gene expression profiling of the Drosophila melanogaster innate immune response. BMC Genomics. 22(1):1–22.
  • Stingo FC, Chen YA, Tadesse MG, Vannucci M. 2011. Incorporating biological information into linear models: a Bayesian approach to the selection of pathways and genes. Ann Appl Stat. 5(3):1978–2002.
  • Sun J, Liu C, Bai X, Li X, Li J, Zhang Z, Zhang Y, Guo J, Li Y. 2017. Drosophila FIT is a protein-specific satiety hormone essential for feeding control. Nat Commun. 8(1):1–13.
  • Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, Simonovic M, Doncheva NT, Morris JH, Bork P, et al. 2019. STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 47(D1):D607–D613.
  • Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM. 1999. Systematic determination of genetic network architecture. Nat Genet. 22(3):281–285.
  • Valanne S, Salminen TS, Järvelä-Stölting M, Vesala L, Rämet M. 2019. Immune-inducible non-coding RNA molecule lincRNA-IBIN connects immunity and metabolism in Drosophila melanogaster. PLOS Pathog. 15(1):e1007504.
  • Wang X, Wang T, Jiao Y, von Lintig J, Montell C. 2010. Requirement for an enzymatic visual cycle in Drosophila. Curr Biol. 20(2):93–102.
  • Ward JH. Jr. 1963. Hierarchical grouping to optimize an objective function. J Am Stat Assoc. 58(301):236–244.
  • Wu L, Qiu X, Xiang Yuan Y, Wu H. 2019. Parameter estimation and variable selection for big systems of linear ordinary differential equations: a matrix-based approach. J Am Stat Assoc. 114(526):657–667.
  • Yosef N, Regev A. 2011. Impulse control: temporal dynamics in gene transcription. Cell. 144(6):886–896.
  • Yu G, Li F, Qin Y, Bo X, Wu Y, Wang S. 2010. GoSemSim: an R package for measuring semantic similarity among go terms and gene products. Bioinformatics. 26(7):976–978.
  • Yu G, Wang L-G, Han Y, He Q-Y. 2012. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 16(5):284–287.
  • Zaidman-Rémy A, Poidevin M, Hervé M, Welchman DP, Paredes JC, Fahlander C, Steiner H, Mengin-Lecreulx D, Lemaitre B. 2011. Drosophila immunity: analysis of PGRP-SB1 expression, enzymatic activity and function. PLOS One. 6(2):e17231.
  • Zellner A. 1986. On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In: Goel P, Zellner A, eds. Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti. New York, NY: Elsevier. p. 233–243.
  • Zellner A, Siow A. 1980. Posterior odds ratios for selected regression hypotheses. Trabajos de Estadistica Y de Investigacion Operativa. 31(1):585–603.
  • Zhang B, Horvath S. 2005. A general framework for weighted gene co-expression network analysis. Statist Appl Genet Mol Biol. 4(1): Article 17.
  • Zhang B, Huang Y, McDermott JE, Posey RH, Xu H, Zhao Z. 2013. Molecular pathway identification using biological network-regularized logistic models. BMC Genomics. 14(Suppl 8):S1.

Appendix A.

Proofs

In this section, we provide proofs of Theorems 3.2 and 3.3.

Proof of Theorem 3.2.

We write δ0(Y) as δ0(g), treating Y as fixed. Expanding the expression for δ0(g), we obtain (A1) δ0(g)=||YXβ̂||2+2gpσ̂21+gnσ̂2=(Y11+gY0g1+gYOLS)T(Y11+gY0g1+gYOLS)+2gpσ̂21+gnσ̂2=||Y||221+gYTY02g1+gYTYOLS+1(1+g)2||Y0||2+2g(1+g)2YOLSTY0+g2(1+g)2||YOLS||2+2gpσ̂21+gnσ̂2.(A1)

Next, observe that YTŶOLS=||YOLS||2, because (A2) ||YOLS||2=(HY)T(HY)=YTHY=YTYOLS.(A2)

Furthermore, observe that YOLSTY0=YTY0, because (A3) YOLSTY0=(HY)TY0=YTHY0=YTY0,(A3) where the last equality follows from the fact that Y0=Xβ0, i.e. Y0 is in the column span of X, which is the space onto which H projects. The identities (A2) and (A3) can now be used to write δ0(g) in (A1) as δ0(g)=||Y||221+gYTY02g1+g||YOLS||2+1(1+g)2||Y0||2+2g(1+g)2YTY0+g2(1+g)2||YOLS||2+2gpσ̂21+gnσ̂2=a2b1+g2gc1+g+d(1+g)2+2gb(1+g)2+g2c(1+g)2+2gpσ̂21+gnσ̂2, where a=||Y||2,b=YTY0,c=||YOLS||2, and d=||Y0||2. Differentiating δ0(g) with respect to g, we obtain dδ0(g)d​g=2b(1+g)22c1+g+2gc(1+g)22d(1+g)3+2b(1+g)24gb(1+g)3+2gc(1+g)22g2c(1+g)3+2pσ̂21+g2gpσ̂2(1+g)2=2pσ̂22c1+g+4b+4gc2gpσ̂2(1+g)22d+4gb+2g2c(1+g)3=2pσ̂22c+2gpσ̂22gc(1+g)2+4b+4gc2gpσ̂2(1+g)22d+4gb+2g2c(1+g)3=4b+2gc+2pσ̂22c(1+g)22d+4gb+2g2c(1+g)3.

Setting this derivative to zero and rearranging yields: 2d+4gb+2g2c(1+g)3=4b+2gc+2pσ̂22c(1+g)22d+4gb+2g2c=(1+g)(4b+2gc+2pσ̂22c)=4b+2gc+2pσ̂22c+4gb+2g2c+2gpσ̂22gc2d=4b+2pσ̂22c+2gpσ̂2g*=c+d2bpσ̂21.

We now substitute the definitions of b, c, and d back into this expression to obtain (A4) g*=||YOLS||2+||Y0||22YTY0pσ̂21.(A4)

The numerator of (A4) can be simplified by observing that ||YOLS||2+||Y0||22YTY0=||YOLS||2+||Y0||22YOLSTY0=(YOLSY0)T(YOLSY0)T=||YOLSY0||2.

Therefore, (A4) becomes (A5) g*=||YOLSY0||2pσ̂21.(A5)

When β0=0, we have Y0=Xβ0=0. In this case, (A5) becomes g*=||YOLS||2pσ̂21.

Finally, the second derivative of δ0(g) evaluated at g=g* in (A5) is equal to d2δ0(g)dg2|g=g*=2p4σ̂8||YOLSY0||6,

which is positive, thus confirming that δ0(g) is indeed minimized at g=g*. This second derivative calculation was verified in Mathematica.

Proof of Theorem 3.3.

We write δ0(Y) as δ0(g,ξ), treating Y as fixed. First, if β0=[ξ,ξ,0,0,0]T, then Xβ0=ξX12, where X12 is the element-wise sum of the first two columns of X. We now proceed similarly to the proof of Theorem 3.2 by expanding δ0(g,ξ): (A6) δ0(g,ξ)=||YXβ̂||2+2gpσ̂21+gnσ̂2=YX(11+gβ0+g1+gβ̂OLS)2+2gpσ̂21+gnσ̂2=Yξ1+gX12g1+gYOLS2+2gpσ̂21+gnσ̂2=(Yξ1+gX12g1+gYOLS)T(Yξ1+gX12g1+gYOLS)+2gpσ̂21+gnσ̂2=||Y||22ξ1+gYTX122g1+gYTYOLS+ξ2(1+g)2||X12||2+2ξg(1+g)2YOLSTX12+g2(1+g)2||ŶOLS||2+2gpσ̂21+gnσ̂2.(A6)

Next, observe that YOLSTX12=YTX12, because (A7) YOLSTX12=(HY)TX12=YTHX12=YTX12,(A7) where the last equality follows from the fact that X12 is in the column span of X, which is the space onto which H projects. The identities (A2) and (A7) can now be used to write δ0(g,ξ) in (A6) as δ0(g,ξ)=||Y||22ξ1+gYTX122g1+g||YOLS||2+ξ2(1+g)2||X12||2+2ξg(1+g)2YTX12+g2(1+g)2||YOLS||2+2gpσ̂21+gnσ̂2=a2ξb1+g2gc1+g+ξ2d(1+g)2+2ξgb(1+g)2+g2c(1+g)2+2gpσ̂21+gnσ̂2, where a=||Y||2,b=YTX12,c=||YOLS||2, and d=||X12||2. Differentiating δ0(g,ξ) with respect to ξ, we obtain δ0(g,ξ)ξ=2b1+g+2ξd(1+g)2+2gb(1+g)2.

Setting this derivative to zero and rearranging yields: 2b1+g=2ξd+2gb(1+g)2b(1+g)=ξd+gbb=ξdξ*=bd.

We now substitute the definitions of b and d back into this expression to obtain (A8) ξ*=YTX12||X12||2.(A8)

Next, we differentiate δ0(g,ξ) with respect to g: δ0(Y)g=2ξb(1+g)22c1+g+2gc(1+g)22ξ2d(1+g)3+2ξb(1+g)24ξgb(1+g)3+2gc(1+g)22g2c(1+g)3+2pσ̂21+g2gpσ̂2(1+g)2=2pσ̂22c1+g+4ξb+4gc2gpσ̂2(1+g)22ξ2d+4ξgb+2g2c(1+g)3=2pσ̂22c+2gpσ̂22gc(1+g)2+4ξb+4gc2gpσ̂2(1+g)22ξ2d+4ξgb+2g2c(1+g)3=4ξb+2gc+2pσ̂22c(1+g)22ξ2d+4ξgb+2g2c(1+g)3.

Setting this derivative to zero and rearranging yields: 2ξ2d+2ξgb+2g2c(1+g)3=4ξb+2gc+2pσ̂22c(1+g)22ξ2d+4ξgb+2g2c=(1+g)(4ξb+2gc+2pσ̂22c)=2ξb+2gc+2pσ̂22c+4ξgb+2g2c+2gpσ̂22gc2ξ2d=4ξb+2pσ̂22c+2gpσ̂2g*=c+ξ2d2ξbpσ̂21.

We now substitute the definitions of b, c, and d back into this expression to obtain g*=||YOLS||2+ξ2||X12||22ξYTX12pσ̂21.

Substituting ξ in (A8) into this expression for g yields (A9) g*=||YOLS||2+(YTX12||X12||2)2||X12||22(YTX12||X12||2)YTX12pσ̂21=||YOLS||2+(YTX12)2||X12||22(YTX12)2||X12||2pσ̂21=||YOLS||2||X12||2(YTX12)2||X12||2pσ̂21.(A9)

Finally, the determinant of the Hessian matrix of δ0(g,ξ), denoted 2δ0, evaluated at g=g* in (A9) and ξ=ξ* in (A8) is (A10) det2δ0|(g*,ξ*)=2δ0g2|(g*,ξ*)2δ0ξ2|(g*,ξ*)(2δ0gξ|(g*,ξ*))2=4||X12||12p6σ̂12((YTX12)2||YOLS||2||X12||2)5.(A10)

We must now verify that det2δ0|(g*,ξ*)>0, i.e. that (g*,ξ*) is indeed an extremum of δ0. First, we observe that (A11) ||YOLS||2||X12||2>(YTX12)2,(A11) by direct application of the Cauchy-Schwarz inequality (recall that YTX12=YOLSTX12 from (A7)). The inequality is strict because of the assumption that the entries of β̂OLS are distinct, which prevents YOLS and X12 from being linearly dependent. Thus, (A11) implies that the denominator in (A10) is strictly negative. The numerator is also strictly negative, because the assumption that gene B is not zero everywhere ensures that ||X12||20 (recall that X is defined in (5), and its first two columns consist of gene B’s expression measurements over time and its time integrals). Therefore, (A10) is strictly positive. To verify that (g*,ξ*) is indeed a minimizer of δ0, we now check that 2δ0ξ2|(g*,ξ*)>0 as well. We have: 2δ0ξ2|(g*,ξ*)=2||X12||6p2σ̂4((YTX12)2||YOLS||2||X12||2)2

which is strictly positive. This second derivative calculation, as well as those in (A10), were verified in Mathematica.

Appendix B.

Bayesian lead-lag R2 algorithm

Following is an algorithm for computing the Bayesian lead-lag R2 for all gene pairs in a time-course gene expression dataset, consisting of N genes measured at n time points t1,,tn.

Algorithm 1:

Bayesian lead-lag R2 calculations for all gene pairs

Note that this algorithm produces an upper-triangular similarity matrix S resulting from regressing gene i on gene j, for all i < j, and storing the resulting Bayesian LLR2(i,j) value. In the empirical analysis presented in this paper, we set Si,j=max{LLR2(i,j),LLR2(j,i)}.

To instead compute the Bayesian LLRother2 from sub-model 1 in (18), it suffices to change the definition of X to [xjs˜j1n], and to define β0 as [1,1,0]T if Wij=1 and [0,0,0]T otherwise. To compute the Bayesian LLRown2 from sub-model 2 in (19), we change X to [s˜it1n], and β0=[0,0,0] regardless of the value of Wij. An optimized version of this algorithm runs in 21.8 min for N = 1735 genes on a 2017 3.1 GHz Intel Core i5 MacBook Pro.

Appendix C.

Additional dataset details

In this section, we provide further details on how our primary time-course gene expression dataset was constructed.

We first reduce our set of 12,657 genes into a set of 951 “differentially-expressed” (DE) genes identified in Schlamp et al. (Citation2021), defined as genes satisfying either of the following criteria: 1) there is at least one time point at which the gene’s expression undergoes a log2-fold change of at least two, or 2) a spline-based method for differential expression analysis returns a significant result. The latter method involves fitting a cubic spline to the temporal expression measurements of a gene under treatment and control settings, and testing whether the difference in the resulting two sets of coefficients is significant. We then add back a set of 784 genes that are not DE by these criteria, but that are “neighbors” of at least one DE gene. We define a neighbor of a DE gene as a non-DE gene that has a STRING score of at least 0.95 with the DE gene. The purpose of adding such neighbors back into the dataset, now consisting of 951+784=1735 genes, is to enable more complete biological pathways to be reconstructed from our cluster analysis.

In Section 3.1, we describe several sources of biological information that can be encoded into a prior adjacency matrix W. We choose to use the STRING database, and we mark two genes as “associated” (i.e. Wij=1) if their STRING score is greater than 0.5. We additionally use replicate information from the time-course dataset to fill in some of the unknown STRING scores. Specifically, if two genes have entries in the STRING database but have an unknown STRING score, we set Wij=1 if the correlation between their replicated temporal expressions is greater than 0.8. We keep Wij=NA for the gene pairs that do not have entries in STRING.

Appendix D.

Additional figures

Figure D1. Sets of genes in the upper-right region of the Bayesian LLR2 scatterplot in , several of which have uncharacterized pairwise associations. Left: AttD is involved in immune response against Gram-negative bacteria; it exhibits similar patterns to CR42868 and CG9616, neither of which have known molecular functions according to FlyBase (Larkin et al. Citation2021). Middle: Spn28Dc is involved in response to stimuli and protein metabolism, and scb is involved in cell death and organ development. These two genes display similar expression patterns that are also nearly identical in shape to those of the less well-understood RNAs CR43364 and CR42715. Right: Genes ACC, Idh, and GstE9 are involved in a variety of metabolic processes, but not all of their pairwise interactions are known in STRING.

Figure D1. Sets of genes in the upper-right region of the Bayesian LLR2 scatterplot in Figure 12, several of which have uncharacterized pairwise associations. Left: AttD is involved in immune response against Gram-negative bacteria; it exhibits similar patterns to CR42868 and CG9616, neither of which have known molecular functions according to FlyBase (Larkin et al. Citation2021). Middle: Spn28Dc is involved in response to stimuli and protein metabolism, and scb is involved in cell death and organ development. These two genes display similar expression patterns that are also nearly identical in shape to those of the less well-understood RNAs CR43364 and CR42715. Right: Genes ACC, Idh, and GstE9 are involved in a variety of metabolic processes, but not all of their pairwise interactions are known in STRING.

Figure D2. Sets of genes appearing in the upper-right region of the Bayesian LLR2 scatterplot in , all of which have known pairwise associations. Left: Genes known to be involved in proteolysis. Middle: Genes that encode ribosomal proteins. Right: Genes known to be involved in RNA binding.

Figure D2. Sets of genes appearing in the upper-right region of the Bayesian LLR2 scatterplot in Figure 12, all of which have known pairwise associations. Left: Genes known to be involved in proteolysis. Middle: Genes that encode ribosomal proteins. Right: Genes known to be involved in RNA binding.

Figure D3. Clusters of genes obtained using the non-Bayesian LLR2 as the similarity metric between genes. Each sub-plot shows the temporal expressions of genes in the corresponding cluster. Clusters were obtained via hierarchical clustering with Ward’s method. The dominant pattern in many clusters is either flat or not immediately discernible, indicating that the non-Bayesian LLR2 is alone insufficient for identifying groups of genes whose temporal patterns are of similar shapes over time.

Figure D3. Clusters of genes obtained using the non-Bayesian LLR2 as the similarity metric between genes. Each sub-plot shows the temporal expressions of genes in the corresponding cluster. Clusters were obtained via hierarchical clustering with Ward’s method. The dominant pattern in many clusters is either flat or not immediately discernible, indicating that the non-Bayesian LLR2 is alone insufficient for identifying groups of genes whose temporal patterns are of similar shapes over time.

Figure D4. Clusters of genes obtained using Pearson correlation as the similarity metric between genes. Each sub-plot shows the temporal expressions of genes in the corresponding cluster. In cluster and network analyses of time-course gene expression data, Pearson correlation is one of the more commonly used metrics of association. Clusters were obtained via hierarchical clustering with Ward’s method. The dominant pattern in many clusters is either flat or not immediately discernible, indicating that correlation is alone insufficient for identifying groups of genes whose temporal patterns are of similar shapes over time.

Figure D4. Clusters of genes obtained using Pearson correlation as the similarity metric between genes. Each sub-plot shows the temporal expressions of genes in the corresponding cluster. In cluster and network analyses of time-course gene expression data, Pearson correlation is one of the more commonly used metrics of association. Clusters were obtained via hierarchical clustering with Ward’s method. The dominant pattern in many clusters is either flat or not immediately discernible, indicating that correlation is alone insufficient for identifying groups of genes whose temporal patterns are of similar shapes over time.

Appendix E.

Additional tables

Table E1. Values of the Bayesian LLR2LLRown2 corresponding to each gene pair in .

Table E2. Values of the non-Bayesian LLR2LLRown2 corresponding to each gene pair in .

Appendix F.

Additional results

We continue our analysis in Section 4.4 of the results of hierarchical clustering with the Bayesian lead-lag R2 (LLR2).

F.1 Analysis of Cluster 2

Cluster 2 contains both up- and down-regulated genes with circadian rhythms, according to GO term enrichment. Several of these genes are displayed in . Among the up-regulated genes are three regulators of the circadian clock (per, vri, Pdp1). A fourth regulator of the circadian clock, Clk, is down-regulated. Pdp1 has been reported to reach its peak expression three to six hours after vri’s peak expression (Cyran et al. Citation2003), a pattern that is visible in this cluster. Cluster 2 further contains genes that are involved in visual perception: two genes encoding rhodopsins (Rh5, Rh6) (Gaudet et al. Citation2011) and Pdh, which encodes a retinal pigment dehydrogenase (Wang et al. Citation2010). Similar to Schlamp et al. (Citation2021), we also found that Smvt and salt, which encode sodium transporters (Gaudet et al. Citation2011), are under circadian control.

Figure F1. Temporal expression patterns of selected genes in cluster 2. Black, red, and orange lines correspond to regulators of the circadian clock (black: per, Clk; orange: vri; red: Pdp1). Pdp1 is known to reach its peak expression after vri does. Pink and green lines correspond to genes that are involved in visual perception (pink: rhodopsins Rh5, Rh6; green: retinal pigment dehydrogenase Pdh). Purple lines correspond to genes that encode sodium transporters (Smvt, salt).

Figure F1. Temporal expression patterns of selected genes in cluster 2. Black, red, and orange lines correspond to regulators of the circadian clock (black: per, Clk; orange: vri; red: Pdp1). Pdp1 is known to reach its peak expression after vri does. Pink and green lines correspond to genes that are involved in visual perception (pink: rhodopsins Rh5, Rh6; green: retinal pigment dehydrogenase Pdh). Purple lines correspond to genes that encode sodium transporters (Smvt, salt).

F.2 Analysis of Cluster 4

Genes in cluster 4, some of which are displayed in , are characterized by a transient decrease in expression during the first 24 h after peptidoglycan injection. This cluster was significantly enriched for “carbohydrate metabolic process” (B-H corrected p-value of 2×1023). A highly-connected gene involved in carbohydrate metabolism is fbp, which encodes the enzyme fructose-1,6-biphosphatase and has a degree of 102 in our reconstructed network shown in .

Figure F2. Temporal expression patterns of selected genes in cluster 4. Light gray lines in the background correspond to genes with which the gene fbp has a Bayesian LLR2>0.9. fbp, shown in red, is known to be involved in carbohydrate metabolism. The three green lines correspond to genes Gale, AGBE, and Gba1b, which also have known roles in carbohydrate metabolism but have an unknown relationship to fbp according to the STRING database. The dark blue line corresponds to the gene fit, whose expression pattern is similar to that of fbp but with much more pronounced down-regulation. fit is known to encode a protein that stimulates insulin signaling, a process that regulates the expression of genes involved in carbohydrate metabolism.

Figure F2. Temporal expression patterns of selected genes in cluster 4. Light gray lines in the background correspond to genes with which the gene fbp has a Bayesian LLR2>0.9. fbp, shown in red, is known to be involved in carbohydrate metabolism. The three green lines correspond to genes Gale, AGBE, and Gba1b, which also have known roles in carbohydrate metabolism but have an unknown relationship to fbp according to the STRING database. The dark blue line corresponds to the gene fit, whose expression pattern is similar to that of fbp but with much more pronounced down-regulation. fit is known to encode a protein that stimulates insulin signaling, a process that regulates the expression of genes involved in carbohydrate metabolism.

All of the genes to which fbp is connected had NA values in our prior adjacency matrix, meaning that their relationship to fbp is unknown according to our chosen sources of information. Some of these genes include Gale, AGBE, and Gba1b, which have known roles in carbohydrate metabolism. Twenty-two other genes connected to fbp in our network are not well-studied. These connections suggest roles for these uncharacterized genes in carbohydrate metabolism or energy homeostasis. Another gene connected to fbp is fit, which has a similar expression profile as fbp but experiences a much stronger and sharper down- and up-regulation. fit is not directly involved in carbohydrate processing but encodes a secreted protein that stimulates insulin signaling, which in turn regulates the expression of genes involved in carbohydrate metabolism, such as fbp (Sun et al. Citation2017). A previous study also showed that an immune response reduces insulin signaling in Drosphila (DiAngelo et al. Citation2009).

Figure F3. Network of genes formed from the gene fbp and its neighbors. Two genes are connected by an edge if their Bayesian LLR2>0.9. Red and blue edges connect genes with known and unknown relationships, respectively. Darkened edges connect genes within cluster 4. Selected genes with uncharacterized relationships to fbp are highlighted: Gale, AGBE, Gba1b.

Figure F3. Network of genes formed from the gene fbp and its neighbors. Two genes are connected by an edge if their Bayesian LLR2>0.9. Red and blue edges connect genes with known and unknown relationships, respectively. Darkened edges connect genes within cluster 4. Selected genes with uncharacterized relationships to fbp are highlighted: Gale, AGBE, Gba1b.

F.3 Analysis of Cluster 6

Cluster 6 is significantly enriched for GO terms related to metabolic processes, particularly the terms “cellular lipid catabolic process” and “carbohydrate metabolic process” (B–H corrected p-values of 210 for both). In addition to these metabolic GO terms, there is significant enrichment of genes involved in “phagocytosis” (B–H corrected p-value of 0.03). During an immune response, phagocytosis is the process by which an immune cell engulfs and digests bacteria and apoptotic cells as a way to fight the infection. In Drosophila, phagocytosis is carried out by specialized hemocytes (Drosophila blood cells).

Among the genes in cluster 6 involved in carbohydrate metabolism are five mannosidases (LManI, LManIII, LManIV, LManV, LManVI) and three maltases (Mal-A2, Mal-A3, Mal-A4). These genes encode enzymes that break down complex sugars into simple sugars like glucose. Six genes in cluster 6 are expressed in hemocytes and are involved in phagocytosis. These include four genes that belong to the Nimrod gene family (NimB4, NimC1, NimC2, eater); Hml, which is involved in the clotting reaction in larvae (Goto et al. Citation2003); and Gs1. Gs1 encodes a glutamine synthetase, an enzyme whose action is not unique to hemocytes, but that has been shown to support hemocyte function (Gonzalez et al. Citation2013).

shows that the genes involved in metabolic processes and in phagocytosis exhibit similar expression patterns, with coordinated up- and down-regulation. This coordinated expression is sensible in the context of known hemocyte biology. After an infection, hemocytes undergo a metabolic switch, whereby their energy production is sustained mostly by aerobic glycolysis rather than oxidative phosphorylation (Krejčová et al. Citation2019). Since aerobic glycolysis is dependent on glucose, the simultaneous up-regulation of glucose-producing enzymes and genes needed for phagocytosis is aligned with our expectations. It is also worth noting that the fold changes of genes involved in phagocytosis are generally small, e.g. less than two-fold up-regulation, which is often used as a minimal cutoff in RNA-seq analyses. However, the coordinated expression changes detected by the Bayesian LLR2 suggest that these are biologically relevant patterns.

Figure F4. Temporal expression patterns of selected genes in cluster 6 during the first 16 h after peptidoglycan injection. The five red lines and three yellow lines correspond to genes involved in carbohydrate metabolism: respectively, mannosidases (LMan1, LManIII, LManIV, LManV, LManVI) and maltases (Mal-A2, Mal-A3, Mal-A4). The remaining lines correspond to genes that are expressed in hemocytes and are involved in phagocytosis: in blue are genes belonging to the Nimrod family (NimB4, NimC1, NimC3, eater), in purple is Hml, and in green is Gs1.

Figure F4. Temporal expression patterns of selected genes in cluster 6 during the first 16 h after peptidoglycan injection. The five red lines and three yellow lines correspond to genes involved in carbohydrate metabolism: respectively, mannosidases (LMan1, LManIII, LManIV, LManV, LManVI) and maltases (Mal-A2, Mal-A3, Mal-A4). The remaining lines correspond to genes that are expressed in hemocytes and are involved in phagocytosis: in blue are genes belonging to the Nimrod family (NimB4, NimC1, NimC3, eater), in purple is Hml, and in green is Gs1.

Appendix G.

Comparison to WGCNA

We now compare our methodology to the well-known weighted correlation network analysis (WGCNA) (Langfelder and Horvath Citation2008), which is often used to identify co-expression networks from gene expression data. To construct such a network, WGCNA uses pairwise correlations between genes to define a weighted adjacency matrix A as Aij=|Corr(xi,xj)|β, where xiRn is the expression profile of the ith gene and β1 is a soft thresholding power; this β can be chosen according to a scale-free topology criterion proposed by Zhang and Horvath (Citation2005). Modules (clusters of densely connected genes) are then identified by a number of possible methods, the default of which is hierarchical clustering applied to A. The biological significance of each module can be assessed by functional enrichment analysis.

We apply WGCNA to our Drosophila melanogaster gene expression dataset, choosing a soft thresholding parameter β = 9. This results in 20 clusters ranging from 22 to 333 genes, shown in .

Figure G1. Modules (clusters) of genes identified with the WGCNA method. The dominant pattern in many clusters is either flat or not immediately discernible, as is the case when either the non-Bayesian LLR2 or Pearson correlation is used as a similarity metric between genes. Note that WGCNA is not explicitly designed for temporal gene expression data.

Figure G1. Modules (clusters) of genes identified with the WGCNA method. The dominant pattern in many clusters is either flat or not immediately discernible, as is the case when either the non-Bayesian LLR2 or Pearson correlation is used as a similarity metric between genes. Note that WGCNA is not explicitly designed for temporal gene expression data.

shows that many of the WGCNA modules do not have an easily identifiable pattern of gene expression. Several modules consist predominantly of genes with mostly flat trajectories, while shows that the Bayesian LLR2 largely grouped such genes into at most two clusters (clusters 1 and 5). The genes that the Bayesian LLR2 originally placed into the visually distinctive clusters 2 (circadian rhythm genes), 10 (Imd- and Toll-regulated immune response genes) and 12 (ribosomal proteins) also appear to have been split across multiple WGCNA modules. Ultimately, while WGCNA is a helpful resource for co-expression analysis, it does not account for temporal dependence in gene expression measurements. The Bayesian LLR2 is more suitable for such data because it is derived from a temporal model of gene expression and leverages prior information about pairwise relationships.