Full article: An Empirical Bayes Approach to Estimating Dynamic Models of Co-Regulated Gene Expression

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Time-course gene expression datasets provide insight into the dynamics of complex biological processes, such as immune response and organ development. It is of interest to identify genes with similar temporal expression patterns because such genes are often biologically related. However, this task is challenging due to the high dimensionality of these datasets and the nonlinearity of gene expression time dynamics. We propose an empirical Bayes approach to estimating ordinary differential equation (ODE) models of gene expression, from which we derive a similarity metric between genes called the Bayesian lead-lag R² (LLR²). Importantly, the calculation of the LLR² leverages biological databases that document known interactions amongst genes; this information is automatically used to define informative prior distributions on the ODE model’s parameters. As a result, the LLR² is a biologically-informed metric that can be used to identify clusters or networks of functionally-related genes with co-moving or time-delayed expression patterns. We then derive data-driven shrinkage parameters from Stein’s unbiased risk estimate that optimally balance the ODE model’s fit to both data and external biological information. Using real gene expression data, we demonstrate that our methodology allows us to recover interpretable gene clusters and sparse networks. These results reveal new insights about the dynamics of biological systems.

Keywords:

1. Introduction

Time-course gene expression datasets are an essential resource for querying the dynamics of complex biological processes, such as immune response, disease progression, and organ development (Bar-Joseph et al. Citation2012; Purvis and Lahav Citation2013; Yosef and Regev Citation2011). Such datasets, now abundantly available through techniques such as whole-genome RNA sequencing, consist of gene expression measurements for thousands of an organism’s genes at a few (typically 5-20) time points. Experimental evidence has revealed that groups of genes exhibiting similar temporal expression patterns are often biologically associated (Eisen et al. Citation1998). For instance, such genes may be co-regulated by the same transcription factors (Tavazoie et al. Citation1999): proteins that directly control gene expression, which ultimately contributes to changes in cellular function. Identifying clusters or networks of genes with related temporal dynamics, which is our objective in this study, can therefore uncover the regulators of dynamic biological processes. Doing so can also help generate testable hypotheses about the roles of orphan genes that exhibit similar expression patterns to ones that are better understood.

The complex, nonlinear time dynamics of gene expression pose a significant challenge for clustering and network analysis in genomics. Groups of interacting genes may be expressed with time lags or inverted patterns (Qian et al. Citation2001) due to delayed activation of underlying transcription factors, making it difficult to measure the “similarity” in two expression profiles. Ordinary differential equations (ODEs) or discrete-time difference equations have been successfully used to model the nonlinear expressions of a small number of genes (Chen et al. Citation1999; Bansal et al. Citation2006; D’haeseleer et al. Citation1999; De Jong Citation2002; Polynikis et al. Citation2009). It is possible to derive similarity metrics for the time dynamics of two genes from such ODEs, thus enabling putative identification of co-regulated genes and the reconstruction of regulatory networks (Farina et al. Citation2007, Citation2008; Wu et al. Citation2019). In particular, the approach proposed in Farina et al. (Citation2007) allows explicit modeling of lead-lag as well as contemporaneous associations between gene expression trajectories. We hence use it as the basis of the similarity calculations in our proposed clustering framework.

The high dimensionality (number of genes) and small sample sizes (number of time points) of time-course gene expression datasets pose another obstacle to identifying genes with similar expression dynamics. Due to the size of these datasets, the number of gene pairs receiving high similarity scores by any method can be overwhelmingly large. High similarity scores are typically validated for biological relevance using annotations provided by extensive public and commercial curated databases that assign genes to functional groups. For instance, Gene Ontology (GO) annotations are keywords that describe a gene’s molecular function, role in a biological process, or cellular localization (Ashburner et al. Citation2000). Other curated databases include KEGG (Kanehisa and Goto Citation2000), Reactome (Fabregat et al. Citation2018), BioCyc (Karp et al. Citation2019), and STRING (Szklarczyk et al. Citation2019). To ease the burden of manually validating a potentially vast number of gene-gene associations, we propose a Bayesian clustering technique that uses annotations as prior information to automatically validate these associations. Incorporating such information into a clustering method can encourage gene pairs with known biological associations to receive higher similarity scores, while filtering away those known to be unrelated. This also allows for knowledge gleaned from gene expression time series data to be contrasted with other knowledge bases; for instance, two genes with highly similar temporal expression patterns may not have been considered associated in previous cross-sectional (single time point) studies on which annotations are based, or vice versa.

There exist in the literature a few approaches to integrating biological knowledge with statistical measures of genetic association. One line of research considers Bayesian methods that use external data sources to determine prior distributions over genes or proteins that influence a biological response (Hill et al. Citation2012; Li and Zhang Citation2010; Lo et al. Citation2012; Peng et al. Citation2013; Stingo et al. Citation2011). Other studies develop biologically-informed regularization terms in graph-regularized methods for reconstructing gene networks (Li and Jackson Citation2015; Zhang et al. Citation2013). In another work, Nepomuceno et al. (Citation2015) propose an algorithm for biclustering gene expression data using gene ontology annotations. However, less attention has been given to using both data and prior biological knowledge to identify and model dominant patterns in the complex temporal dynamics of gene expression, e.g. with ODEs.

Our technical contribution in this work is a Bayesian method for constructing biologically-meaningful clusters and networks of genes from time-course expression data, using a new similarity measure between two genes called the Bayesian lead-lag R² (LLR²). The Bayesian LLR² is derived from ODE models of temporal gene expression, and is based on associations in both the time-course data and prior biological annotations. The balance between data and prior information is controlled by data-driven hyperparameters, making our approach an empirical Bayes method. As indicated by the name, the Bayesian LLR² is based on the familiar R² statistic (the coefficient of determination) and is simple and fast to compute for all $(\begin{matrix} N \\ 2 \end{matrix})$ gene pairs, where N is the number of genes under study. Importantly, external biological information regularizes the set of significant gene-gene associations found within a time-course dataset. In , for instance, we present a network of 1735 genes in Drosophila melanogaster (fruit fly) constructed both without external information, using an ordinary least-squares version of the LLR² proposed by Farina et al. (Citation2008), and with external information, using our proposed Bayesian LLR²; the latter is a noticeably sparsified revision of the former, and retains only edges connecting genes with either known or highly plausible biological relationships.

Figure 1. Networks of 1735 genes profiled in a time-course gene expression dataset collected by Schlamp et al. (Citation2021). Vertices represent genes and edges connect two genes if their lead-lag R² exceeds 0.9. Left: lead-lag R² is computed using ordinary least squares regression according to Farina et al. (Citation2008), without any external biological information. All 1735 genes form a single connected component (599,896 edges). Right: lead-lag R² is computed using our proposed Bayesian approach, which leverages external sources of biological information about gene-gene relationships. Red edges (11,380 edges) connect genes known to be associated. Blue edges (2830 edges) connect genes whose relationship is unknown but is supported by the data.

The remainder of this paper is organized as follows. Section 2 describes the ODE model of temporal gene expression that we adopt. Section 3 details our empirical Bayes method for fitting the ODE model and obtaining our proposed LLR² metric. Section 4 demonstrates the application of our method to real gene expression data collected by Schlamp et al. (Citation2021). We recover a tradeoff between immune response and metabolism that has been observed in several studies and present examples of biologically-meaningful gene clusters identified with the Bayesian LLR². We also discuss the method’s potential to conjecture new data-driven hypotheses about gene-gene interactions. Sections 4.3 and Appendices D and E compare the Bayesian LLR² to its non-Bayesian counterpart as well as the commonly used Pearson correlation between genes. Proofs and additional examples are provided in the Appendix.

2. Dynamic Models of Gene Expression

2.1. Model Derivation

We consider an ODE model of gene expression proposed by Farina et al. (Citation2007). Let $m_{A} (t)$ denote the expression of some gene A at time t, measured for instance as the $\log_{2}$ -fold change in mRNA levels relative to time 0. The model assumes the rate of change in gene A’s expression is given by some regulatory signal p(t): $\frac{d m_{A} (t)}{d t} = p (t) - κ_{A} m_{A} (t) + η_{A},$ where κ_A denotes the mRNA decay rate for gene A, and η_A denotes natural and experimental noise. This model can be naturally extended to consider two genes A and B that might be associated with one another, i.e. that are governed by the same underlying p(t), yielding a pair of coupled differential equations: (1) $\frac{d m_{A} (t)}{d t} = α_{A} p (t) + β_{A} - κ_{A} m_{A} (t) + η_{A},$ (1) (2) $\frac{d m_{B} (t)}{d t} = α_{B} p (t) + β_{B} - κ_{B} m_{B} (t) + η_{B} .$ (2)

The common signal p(t) accounts for the effect of one or more transcription factors that potentially regulate both genes A and B. The coefficients α_A and α_B measure the strength of p(t) in the expression patterns of genes A and B, respectively. β_A and β_B are affine coefficients that allow $m_{A} (t)$ and $m_{B} (t)$ to reach a non-zero steady state when p(t) = 0, i.e. in the absence of a regulatory signal.Footnote¹

We now obtain a model of gene A’s expression in terms of gene B’s expression by first rearranging (2) to isolate p(t): (3) $p (t) = \frac{1}{α_{B}} (\frac{d m_{B} (t)}{d t} - β_{B} + κ_{B} m_{B} (t) - η_{B}) .$ (3)

Substituting (3) into (1) yields $\frac{d m_{A} (t)}{d t} = \frac{α_{A}}{α_{B}} \frac{d m_{B} (t)}{d t} + \frac{α_{A} κ_{B}}{α_{B}} m_{B} (t) - κ_{A} m_{A} (t) + β_{A} - \frac{α_{A} β_{B}}{α_{B}} + η_{A} - \frac{α_{A} η_{B}}{α_{B}} .$

Integrating from 0 to t, we obtain: (4) $\begin{matrix} m_{A} (t) = \frac{α_{A}}{α_{B}} m_{B} (t) + \frac{α_{A} κ_{B}}{α_{B}} \int_{0}^{t} m_{B} (s) d s - κ_{A} \int_{0}^{t} m_{A} (s) d s + (β_{A} - \frac{α_{A} β_{B}}{α_{B}}) t \\ + \int_{0}^{t} (η_{A} - \frac{α_{A} η_{B}}{α_{B}}) d s + c \\ = c_{1} m_{B} (t) + c_{2} \int_{0}^{t} m_{B} (s) d s + c_{3} \int_{0}^{t} m_{A} (s) d s + c_{4} t + c_{5} + ε, \end{matrix}$ (4) where $c_{1} = \frac{α_{A}}{α_{B}},$ $c_{2} = \frac{α_{A} κ_{B}}{α_{B}},$ $c_{3} = - κ_{A},$ $c_{4} = β_{A} - \frac{α_{A} β_{B}}{α_{B}}, c_{5} = c,$ and $ε = \int_{0}^{t} (η_{A} - \frac{α_{A} η_{B}}{α_{B}}) ds .$

Observe that (4) is linear in the parameters c_k. Thus, given measurements ${m_{A} (t_{1}), \dots, m_{A} (t_{n})}$ and ${m_{B} (t_{1}), \dots, m_{B} (t_{n})}$ of the expression levels of genes A and B at time points $t_{1}, \dots, t_{n},$ we can express (4) as the linear model $Y = X β + ε,$ where (5) $Y = [\begin{matrix} m_{A} (t_{1}) \\ m_{A} (t_{2}) \\ \dots \\ m_{A} (t_{n}) \end{matrix}], X = [\begin{matrix} m_{B} (t_{1}) & \int_{0}^{t_{1}} m_{B} (s) d s & \int_{0}^{t_{1}} m_{A} (s) d s & t_{1} & 1 \\ m_{B} (t_{2}) & \int_{0}^{t_{2}} m_{B} (s) d s & \int_{0}^{t_{2}} m_{A} (s) d s & t_{2} & 1 \\ \dots & \dots & \dots & \dots & \dots \\ m_{B} (t_{n}) & \int_{0}^{t_{n}} m_{B} (s) d s & \int_{0}^{t_{n}} m_{A} (s) d s & t_{n} & 1 \end{matrix}], β = [\begin{matrix} c_{1} \\ c_{2} \\ c_{3} \\ c_{4} \\ c_{5} \end{matrix}],$ (5) and $ε \sim N (0, σ^{2} I_{n}),$ where $I_{n}$ denotes the n × n identity matrix; it is natural to assume that η_A and η_B are centered Gaussian processes with independent increments, in which case the integral defining ε above is normally distributed. Although only samples from the functions $m_{A} (t)$ and $m_{B} (t)$ are given, we can estimate the integral entries of the second and third columns of X by numerically integrating spline or polynomial interpolants fitted to these samples.Footnote²

In fitting the model (4) to the expression data of genes A and B, we obtain the ordinary least-squares (OLS) estimator $\hat{β} = {(X^{T} X)}^{- 1} X^{T} Y .$ The amount of variation in gene A’s expression that is captured by the estimated linear model is expressed as the familiar R² value: $R^{2} = | | X \hat{β} - \bar{Y} 1_{n} | |^{2} / | | Y - \bar{Y} 1_{n} | |^{2},$ where $\bar{Y} = \frac{1}{n} Y^{T} 1_{n}$ and $1_{n}$ denotes the $n \times 1$ vector of ones. Adopting the terminology in Farina et al. (Citation2008), we refer to this R² as the lead-lag R² between genes A and B. A high lead-lag R² may indicate that the two genes are co-regulated in time by common transcription factors, or are at least associated with one another in some way. The term “lead-lag” comes from the lead-lag compensator in control theory. In this context, a “lead-lag relationship” between genes refers to the presence of a common regulatory signal (input) that, in conjunction with the process of mRNA decay, modulates the expression of genes with the same biological function (output).

2.2. Motivating the Bayesian Lead-Lag R²

Our primary contribution in this work is a biologically-informed method for clustering genes based on their temporal dynamics. Clustering involves measuring the similarity between two objects, which can also be thought of as defining an edge between two nodes in an undirected network. Our similarity measure is a Bayesian version of the lead-lag R² that uses both temporal expression data for genes A and B as well as a prior indication of whether they are biologically associated.

We can motivate the Bayesian lead-lag R² via , which shows examples of false positive gene pairs: genes that have a spuriously high lead-lag R², but do not have similar expression patterns nor a biological relationship. The data comes from an experiment on fruit flies by Schlamp et al. (Citation2021) that aimed to profile the dynamics of genes involved in or affected by immune response immediately following an infection. More details on this dataset can be found in Section 4.2.

Figure 2. Plots of the temporal expression profiles of three gene pairs for which the lead-lag R² is spuriously high. Spline interpolants were fit through the observed points, which are indicated by solid dots. Functional annotations for these genes in Flybase (Larkin et al. Citation2021) do not suggest a clear link within each pair. By contrast, the lower Bayesian lead-lag R² values more accurately reflect the degree of these associations.

Figure 2. Plots of the temporal expression profiles of three gene pairs for which the lead-lag R2 is spuriously high. Spline interpolants were fit through the observed points, which are indicated by solid dots. Functional annotations for these genes in Flybase (Larkin et al. Citation2021) do not suggest a clear link within each pair. By contrast, the lower Bayesian lead-lag R2 values more accurately reflect the degree of these associations.

Spuriously high lead-lag R² values are likely to arise in large datasets. For example, if gene A’s expression levels increase or decrease monotonically with time, the response vector Y in (5) will be highly correlated with the time integrals and time points in the third and fourth columns of X. The lead-lag R² between genes A and B will be large, but not because the genes are associated either in time or biologically.

In contrast to gene pairs of the kind shown in , we can consider , which displays genes from two well-studied functional groups, known as pathways: circadian rhythms and immune response. Within each group, we expect to see high pairwise lead-lag R² values (true positives).

Figure 3. Left: The uncharacterized gene CG33511 exhibits similar time dynamics to known circadian rhythm genes with 24-hour cyclic temporal expressions. Right: Uncharacterized genes CG43920 and CG45045 exhibit similar temporal expression patterns to known immune response genes, which are up-regulated in response to infection.

Incorporating pathway membership or protein-protein interaction networks into the lead-lag R² enables us to encourage genes in the same pathways to receive higher pairwise similarity scores, thus separating true positives from false positives. Importantly, we can also suggest possible pathways for previously uncharacterized genes. In our method, external biological information is used to determine the locations of normal prior distributions on the parameters $c_{1}, \dots, c_{5}$ in the model (4). Upon obtaining the posterior estimates of these parameters, we recompute the lead-lag R² to obtain the Bayesian lead-lag R² (LLR²). In doing so, we observe a desirable shrinkage effect in the distribution of the Bayesian lead-lag R² values that pares down the number of significant associations. We next detail the hierarchical model and its hyperparameters in Section 3.

3. Empirical Bayes Methodology

In this section, we propose our empirical Bayes approach to deriving biologically-informed similarity metrics between genes for clustering and network analysis. The components of our method are: (1) encoding external biological information into a prior adjacency matrix, (2) defining a normal-inverse gamma prior, specifically Zellner’s g-prior, on the parameters of the ODE-based model of gene expression (4), (3) optimally selecting the hyperparameter g in Zellner’s g-prior, and (4) calculating the Bayesian lead-lag R². Note that parts 2–4 lead to the computation of the Bayesian lead-lag R² for a single gene pair; a summary of the full algorithm for all pairs is provided in Appendix B.

3.1. Part 1: Leveraging Biological Priors

There exist numerous databases that extensively document known or predicted interactions between genes as well as their functional roles. For instance, Gene Ontology (GO) terms are keywords that describe a gene’s molecular function, role in a biological process (e.g. “lipid metabolism”), or cellular localization (e.g. “nucleus”) (Ashburner et al. Citation2000). Semantic similarity methods, such as the R package GOSemSim (Yu et al. Citation2010), have been developed to determine how related genes are based on their associated GO terms. Other curated databases that similarly assign genes to pathways include KEGG (Kanehisa and Goto Citation2000), Reactome (Fabregat et al. Citation2018), and BioCyc (Karp et al. Citation2019). The STRING database (Szklarczyk et al. Citation2019) aggregates multiple sources of information to generate a more holistic measurement of the association between two genes. For each pair of genes in an organism, STRING provides a score between 0 and 1 indicating how likely the two genes’ encoded proteins are to interact physically based on experimental evidence, belong to the same pathways according to the aforementioned databases, be conserved across species, or be mentioned in the same studies.

Regardless of which sources of external biological information one employs, the first step of our method is to encode this information into a matrix W of size N × N, where N is the total number of genes under study. The entries of W are: $W_{i j} = {\begin{matrix} 1 & if prior evidence suggests that the i^{th} and j^{th} genes are associated \\ NA & if the relationship between the i^{th} and j^{th} genes is unknown \\ 0 & if the i^{th} and j^{th} genes are unlikely to be associated . \end{matrix}$ Intuitively, W can be viewed as an adjacency matrix for a network that reflects the known relationships amongst the genes. Our proposed method uses this network as an informed basis for the network constructed from the time-course gene expression data itself. Importantly, the data can indicate what the “NA” entries of W ought to be, as well as confirm (or refute) the “0” or “1” entries.

In the event that W consists largely of “NA” entries, one can also use time-course gene expression data from previous studies to fill in some of them. Such datasets often consist of multiple replicates of the expression measurements, possibly gathered under the same experimental conditions to account for sampling variation, or different conditions to assess the effect of a treatment.

3.2. Part 2: The Normal-Inverse Gamma Model and Zellner’s g-Prior

Recall from Section 2.1 that given measurements ${m_{A} (t_{1}), \dots, m_{A} (t_{n})}$ and ${m_{B} (t_{1}), \dots, m_{B} (t_{n})}$ of the expressions of two genes A and B at times $t_{1}, \dots, t_{n},$ the model we aim to fit is $m_{A} (t) = c_{1} m_{B} (t) + c_{2} \int_{0}^{t} m_{B} (s) d s + c_{3} \int_{0}^{t} m_{A} (s) d s + c_{4} t + c_{5},$ where $c_{1}, \dots, c_{5}$ are unknown parameters, and the integrals $\int_{0}^{t} m_{A} (s) ds$ and $\int_{0}^{t} m_{B} (s) ds$ are estimated by numerically integrating spline interpolants of the given data. As seen in (5), this model can be represented in matrix form as $Y = X β + ε,$ where $ε \sim N (0, σ^{2} I_{n}) .$

Since c₁ and c₂ link the expressions of genes A and B, we intuitively ought to place priors of non-zero mean on these two parameters if W indicates that the genes are associated. To do this, we adopt the normal-inverse gamma model for $β$ and $σ^{2},$ which is used frequently in Bayesian regression and allows for flexible modeling of the prior mean and covariance matrix of $β .$ The normal-inverse gamma model specifies $β | σ^{2} \sim N (β_{0}, σ^{2} V_{0})$ and $σ^{2} \sim Γ^{- 1} (a, b)$ where $β_{0} \in R^{p \times 1}, V_{0} \in R^{p \times p}$ is a positive semidefinite matrix, and $a, b > 0 .$ It is then said that $(β, σ^{2})$ jointly follow a normal-inverse gamma distribution with parameters $(β_{0}, V_{0}, a, b) .$ This is a conjugate prior, so the posterior distribution of $(β, σ^{2})$ is also normal-inverse gamma with parameters $(β_{*}, V_{*}, a_{*}, b_{*})$ defined as (6) $β_{*} = {(V_{0}^{- 1} + X^{T} X)}^{- 1} (V_{0}^{- 1} β_{0} + X^{T} Y),$ (6) (7) $V_{*} = {(V_{0}^{- 1} + X^{T} X)}^{- 1},$ (7) (8) $a_{*} = a + \frac{n}{2},$ (8) (9) $b_{*} = b + \frac{1}{2} (β_{0}^{T} V_{0}^{- 1} β_{0} + Y^{T} Y - β_{*}^{T} V_{*}^{- 1} β_{*}) .$ (9) That is, the conditional posterior of $β$ given $σ^{2}$ and the posterior of $σ^{2}$ are $β | σ^{2}, Y \sim N (β_{*}, σ^{2} V_{*})$ and $σ^{2} | Y \sim Γ^{- 1} (a_{*}, b_{*}) .$ The posterior mean $β_{*}$ can be used as the estimated regression parameters. This estimator has the desirable properties of consistency and asymptotic efficiency in large samples and admissibility in finite samples (Giles and Rayner Citation1979).

The hyperparameters $β_{0}$ and $V_{0}$ are of particular interest as they allow us to incorporate biological information into our model. In defining $β_{0},$ recall that we wish to place priors with non-zero mean on the parameters c₁ and c₂ when external sources suggest that genes A and B are co-regulated or at least associated. We noted in Section 2.1 that c₁ represents the ratio $α_{A} / α_{B},$ where α_A and α_B denote the strength of a common regulatory signal in the first-order dynamics of the two genes. If the genes are associated, it is reasonable to believe a priori that $α_{A} = α_{B},$ implying $c_{1} = 1 .$ Then we could also say a priori that $c_{2} = 1,$ since c₂ represents a parameter that is proportional to c₁. When prior information about genes A and B is lacking or suggests that they are unrelated, a prior mean of zero for c₁ and c₂ is appropriate. Supposing genes A and B are the $i^{th}$ and $j^{th}$ genes in the dataset, we can thus set the prior mean $β_{0}$ as follows: $β_{0} = {\begin{matrix} {[1, 1, 0, 0, 0]}^{T} & if W_{i j} = 1 \\ {[0, 0, 0, 0, 0]}^{T} & if W_{i j} = 0 or NA . \end{matrix}$

As for the prior covariance matrix $σ^{2} V_{0}$ of $β,$ we first note that for the linear model $Y = X β + ε$ where $ε$ has the covariance matrix $σ^{2} I_{n},$ the least-squares estimator ${\hat{β}}_{OLS} = {(X^{T} X)}^{- 1} X^{T} Y$ has the covariance matrix $σ^{2} {(X^{T} X)}^{- 1} .$ A popular choice for $V_{0}$ is therefore $g {(X^{T} X)}^{- 1},$ where g > 0. This choice of $V_{0}$ yields a particularly tractable special case of the normal-inverse gamma model known as Zellner’s g-prior (Zellner Citation1986). Substituting this choice of $V_{0}$ into the posterior mean $β_{*}$ in (6) and covariance matrix $V_{*}$ in (7), we obtain (10) $β_{*} = \frac{1}{1 + g} β_{0} + \frac{g}{1 + g} {\hat{β}}_{OLS},$ (10) (11) $V_{*} = \frac{g}{1 + g} {(X^{T} X)}^{- 1} .$ (11) (10) Reveals that under Zellner’s g-prior, the posterior mean $β_{*}$ is a convex combination of the prior mean $β_{0}$ and the least-squares estimator ${\hat{β}}_{OLS} .$ The parameter g balances the weights placed on external information encoded by $β_{0}$ and on the data used to compute ${\hat{β}}_{OLS},$ so the selection of g is an important component of our modeling strategy. We next describe our data-driven approach to choosing it.

3.3. Part 3: Optimal Selection of g in Zellner’s g-Prior

Several methods for choosing g in Zellner’s g-prior have been proposed previously. For instance, George and Foster (Citation2000) discuss an empirical Bayes method in which one selects the value of g that maximizes the marginal likelihood of Y. Liang et al. (Citation2008) provide a closed form expression for this maximizing value of g that is nearly identical to the F-statistic for testing the hypothesis that $β = 0 .$ They show that this maximum marginal likelihood approach has the desirable property that the Bayes factor for comparing the full model to the null (intercept-only) model diverges as the R² approaches 1. For our application, one concern is that the F-statistic defining this particular estimate of g is likely to be overwhelmingly large for many gene pairs. If g is large, then $β_{*}$ will be very close to ${\hat{β}}_{OLS}$ according to (10). As a result, any biological evidence of association captured by $β_{0}$ will have a negligible impact on the model.

A fully Bayesian approach to selecting g involves placing a prior distribution on g that is then integrated out when defining the prior on $β .$ This method is motivated by Zellner and Siow (Citation1980), who propose placing a Cauchy prior on $β$ to derive a closed-form posterior odds ratio for hypothesis testing purposes. These priors can be represented as a mixture of g-priors with an inverse gamma prior on g, although a closed-form posterior estimate of g is unavailable. Cui and George (Citation2008) and Liang et al. (Citation2008) alternatively consider a class of priors of the form $π (g) = \frac{a - 2}{2} {(1 + g)}^{a / 2}$ for a > 2, known as the hyper-g priors. Under these priors, the posterior mean of g can be expressed in closed form in terms of the Gaussian hypergeometric function.

Because our application involves fitting regression models for potentially thousands of gene pairs, the computational cost of fully Bayesian methods for selecting g requires us to consider alternative approaches. One idea is to select the value of g that minimizes the sum of squared residuals $| | Y - \overset{⁁}{Y} | |^{2},$ where $\overset{⁁}{Y} = X β_{*}$ is the vector of fitted values: (12) $\overset{⁁}{Y} = X β_{*} = X (\frac{1}{1 + g} β_{0} + \frac{g}{1 + g} {\hat{β}}_{OLS}) = \frac{1}{1 + g} Y_{0} + \frac{g}{1 + g} {\overset{⁁}{Y}}_{OLS},$ (12) where $Y_{0} = X β_{0}$ and ${\overset{⁁}{Y}}_{OLS} = X {\hat{β}}_{OLS} .$ However, we found that there are no analytical solutions to $g_{*} = {argmin}_{g > 0} | | Y - \overset{⁁}{Y} | |^{2} = 0 .$ Instead, we can minimize Stein’s unbiased risk estimate (SURE), which is an unbiased estimate of $| | \overset{⁁}{Y} - X β | |^{2} .$ Below is a rephrased version of Theorem 8.7 in Fourdrinier et al. (Citation2018), which defines SURE for the general problem of estimating $E (Y)$ using a linear estimator of the form $\overset{⁁}{Y} = a + SY .$ This theorem statement differs from its original version in that we have rewritten the divergence of $X \hat{β}$ with respect to Y using the generalized degrees of freedom developed by Efron (Citation2004).

Theorem 3.1

(SURE for linear models). Let $Y \sim N (X β, σ^{2} I_{n}),$ where the dimensions of X are n × p, and let $\hat{β} = \hat{β} (Y)$ be a weakly differentiable function of the least squares estimator ${\hat{β}}_{OLS}$ such that $\overset{⁁}{Y} = X \hat{β}$ can be written in the form $\overset{⁁}{Y} = a + SY$ for some vector a and matrix S. Let ${\hat{σ}}^{2} = | | Y - X {\hat{β}}_{OLS} | |^{2} / (n - p) .$ Then, (13) $δ_{0} (Y) = | | Y - X \hat{β} | |^{2} + (2 Tr (S) - n) {\hat{σ}}^{2}$ (13) is an unbiased estimator of $| | \overset{⁁}{Y} - X β | |^{2} .$

From (12), observe that we can write $\overset{⁁}{Y}$ as $\frac{1}{1 + g} Y_{0} + \frac{g}{1 + g} H Y,$ where $H = X {(X^{T} X)}^{- 1} X^{T} .$ Therefore the matrix S in Theorem 3.1 is $g H / (1 + g),$ whose trace is $g p / (1 + g) .$ SURE in (13) then becomes (14) $δ_{0} (Y) = | | Y - X β_{*} | |^{2} + (\frac{2 g p}{1 + g} - n) {\hat{σ}}^{2},$ (14) where we have also substituted the posterior mean $β_{*}$ in (10) for $\hat{β} .$

We next present the value of g that minimizes SURE.

Theorem 3.2

(SURE minimization with respect to g). The value of g that minimizes SURE in (14) is $g_{*} = \frac{| | {\overset{⁁}{Y}}_{OLS} - Y_{0} | |^{2}}{p {\hat{σ}}^{2}} - 1.$

The proof of Theorem 3.2 is provided in Appendix A.

It is quite possible that $p {\hat{σ}}^{2}$ is small, resulting in a large value of $g_{*}$ (i.e. $g_{*} ≫ 1$ ) in Theorem 3.2. In this case, $β_{*}$ in (10) will be largely influenced by the data via ${\hat{β}}_{OLS},$ rather than by prior information via $β_{0} .$ This is desirable when the relationship between the two genes is unknown (i.e. $W_{i j} = NA$ ), but not if the relationship is known to be unlikely (i.e. $W_{i j} = 0$ ). In the latter case, we prefer to shrink the regression coefficients towards the prior mean $β_{0} = 0$ so as to yield a smaller lead-lag R² value. To address this, we set g conditionally on $W_{i j}$ as (15) $g = {\begin{matrix} g_{*} & if W_{i j} = NA or 1 \\ 1 & if W_{i j} = 0, \end{matrix}$ (15) where $g_{*}$ is defined according to Theorem 3.2.

3.4. Part 4: Computing the R² for Bayesian Regression Models

Once a posterior estimate of the model coefficients (10) has been obtained, with the parameter g selected optimally, we can compute the Bayesian lead-lag R² between genes A and B.

Recall that for a linear model $Y = X β + ε$ where $β$ is estimated by the least-squares estimator ${\hat{β}}_{OLS} = {(X^{T} X)}^{- 1} X^{T} Y,$ the R² is defined as (16) $R^{2} = \frac{| | X {\hat{β}}_{OLS} - \bar{Y} 1_{n} | |^{2}}{| | Y - \bar{Y} 1_{n} | |^{2}},$ (16) where $\bar{Y} = (1 / n) Y^{T} 1_{n} .$ In Bayesian regression, however, the standard decomposition of total sum of squares into residual and regression sums of squares no longer holds. Thus, when we replace ${\hat{β}}_{OLS}$ with an estimator such as the posterior mean of $β,$ the formula (16) can potentially yield $R^{2} > 1 .$ As a remedy to this issue, we compute the R² as the ratio of the variance of the model fit to itself plus the variance of the residuals. This ratio is within $[0, 1]$ by construction, and is given by (17) $R^{2} = \frac{\hat{Var} (X β_{*})}{\hat{Var} (X β_{*}) + \hat{Var} (Y - X β_{*})},$ (17) where, for a vector $Z = {[z_{1} \dots z_{n}]}^{T}$ with mean $\bar{z},$ we define $\hat{Var} (Z) = \frac{1}{n - 1} \sum_{i = 1}^{n} {(z_{i} - \bar{z})}^{2} .$ This calculation is based on the approach to computing R² for Bayesian regression models proposed in Gelman et al. (Citation2018). For our application, we refer to (17) as the Bayesian lead-lag R² (LLR²).

3.5. Clustering and Empirical Analysis with the Bayesian Lead-Lag R²

Given a dataset of N genes whose expressions are measured at n time points, our objective is to cluster the genes based on their temporal expression patterns. To do this, we compute a N × N similarity matrix S where $S_{i, j}$ stores the Bayesian LLR² in (17) between the $i^{th}$ and $j^{th}$ genes. We then apply hierarchical clustering to the distance matrix $J - S,$ where J is the N × N matrix of ones.

Note that the Bayesian LLR² is asymmetric: $LL R^{2} (i, j) \neq LL R^{2} (j, i) .$ Here, $LL R^{2} (i, j)$ denotes the Bayesian LLR² where we treat the $i^{th}$ gene as the response (gene A) in the model (4). This asymmetry could potentially be informative. In particular, $LL R^{2} (i, j) > LL R^{2} (j, i)$ could possibly suggest that the expression of the $i^{th}$ gene is driven by that of the $j^{th}$ gene; that is, a model in which the $i^{th}$ gene is treated as the response is a better fit to the data than one in which the $j^{th}$ gene is the response. Further research is required to determine how this asymmetry can be efficiently used to detect causal or lagged co-regulation relationships in large genomic datasets. For our purpose of clustering many genes for empirical analysis using standard algorithms such as hierarchical clustering, we symmetrize the Bayesian LLR² by setting: $S_{i, j} = \max {LL R^{2} (i, j), LL R^{2} (j, i)} .$

Intuitively, using the maximum over the minimum or the average should improve the power of this metric to measure associations in either direction. In Section 4.1, we use simulated gene expression data to demonstrate that the false positive rate remains controlled under these symmetrization methods.

In practice, it is also common to use similarity measures such as the Bayesian LLR² to produce a ranked list of gene-gene associations. To aid this procedure, we further propose two additional metrics that one could use in conjunction with the Bayesian LLR². These metrics are derived from the following two sub-models of the original model (4): (18) $m_{A} (t) = c_{1} m_{B} (t) + c_{2} \int_{0}^{t} m_{B} (s) d s + c_{5}, (Sub-model 1)$ (18) (19) $m_{A} (t) = c_{3} \int_{0}^{t} m_{A} (s) d s + c_{4} t + c_{5} . (Sub-model 2)$ (19)

The first sub-model describes the temporal expression of gene A as a function only of the potentially co-regulated gene B. The second sub-model is “autoregressive” in the sense that it describes gene A’s expression only in terms of its own past and linear time trends. We again apply our Bayesian approach to fitting these two sub-models and compute new variants of the Bayesian LLR² from each, denoted $LL R_{other}^{2}$ and $LL R_{own}^{2}$ respectively. The $LL R_{other}^{2}$ value indicates the amount of variation in gene A’s temporal expression that can be explained by the dynamics of another gene B. $LL R_{own}^{2}$ indicates the amount of variation in gene A that is explained by its own past, via its time integrals, and linear time trends. We can view $LL R^{2} - LL R_{own}^{2}$ as the amount of additional variation in gene A’s temporal dynamics that can be explained by considering gene B, on top of the variation captured by gene A’s own past via sub-model 2. Intuitively, if LLR² is large, then a large value of $LL R^{2} - LL R_{own}^{2}$ suggests that the LLR² value is unlikely to mean a false positive relationship between the two genes. In Section 4.5, we demonstrate empirically that using $LL R_{other}^{2}$ and $LL R^{2} - LL R_{own}^{2}$ together can help identify gene pairs with highly similar time dynamics.

3.6. Significance of the Bayesian Lead-Lag R²

One may further desire a notion of statistical significance for the Bayesian LLR². One option is to simulate the posterior distribution of the LLR² using draws from the posterior distribution of $β,$ as described in Gelman et al. (Citation2018). Recall from Section 3.2 that the posterior distribution of $(β, σ^{2})$ is normal-inverse gamma with parameters $(β_{*}, V_{*}, a_{*}, b_{*}),$ defined respectively in (10), (11), (8), and (9). To draw samples from this posterior distribution, we can first sample $σ^{2}$ from the $Γ^{- 1} (a_{*}, b_{*})$ distribution, and then sample $β$ from the $N (β_{*}, g σ^{2} V_{*})$ distribution. In particular, if $W_{i j} = 1,$ we may wish to simulate the posterior distribution of the Bayesian LLR² under a null hypothesis of no relationship between genes A and B. This can be reflected in the sampling procedure by calculating $β_{*}$ with $β_{0}$ set to $0 .$

Alternatively, one could use the Bayes factors presented in Liang et al. (Citation2008) to corroborate the Bayesian lead-lag R². Let $M_{F}$ denote the model (4) of gene expression, which has an intercept and p = 4 “covariates” with coefficients c₁ through c₄. Let $M_{N}$ denote the null (intercept-only) model. Then the Bayes factor for comparing $M_{F}$ to $M_{N}$ is $BF (M_{F} : M_{N}) = {(1 + g)}^{(n - p - 1) / 2} \frac{1}{{(1 + g (1 - R^{2}))}^{(n - 1) / 2}},$ where R² is the usual coefficient of determination in (16). Kass and Raftery (Citation1995) interpret a $\log_{10} (BF (M_{F} : M_{N}))$ value between 1 and 2 as “strong” evidence in favor of $M_{F},$ or above 2 as “decisive” evidence.

In Appendix B, we give a summary of our methodology in the form of a generic algorithm that can be run on any time-course gene expression dataset.

3.7. Possible Modifications to Prior Hyperparameters

In Section 3.2, we set the prior mean $β_{0}$ of the parameters of the ODE model to ${[1, 1, 0, 0, 0]}^{T}$ when $W_{i j} = 1,$ i.e. there is prior evidence suggesting genes A and B are associated. To make the method even more data-driven, one could alternatively set $β_{0} = {[ξ, ξ, 0, 0, 0]}^{T}$ in the $W_{i j} = 1$ case, where $ξ \neq 0$ is chosen adaptively from the data. The following theorem presents the values of ξ and g that simultaneously minimize SURE in (14) in this setting.

Theorem 3.3

(SURE minimization with respect to ξ and g). Assume the entries of the least-squares estimator ${\hat{β}}_{OLS}$ are all distinct and the expression of gene B is non-zero for at least one time point, i.e. $m_{B} (t_{i}) \neq 0$ for at least one i. Let $β_{0} = {[ξ, ξ, 0, 0, 0]}^{T}$ in the case that $W_{i j} = 1 .$ Then the values of ξ and g that minimize SURE in (14) are $ξ_{*} = \frac{Y^{T} X_{12}}{| | X_{12} | |^{2}} and g_{*} = \frac{| | {\overset{⁁}{Y}}_{OLS} | |^{2} | | X_{12} | |^{2} - {(Y^{T} X_{12})}^{2}}{| | X_{12} | |^{2} p {\hat{σ}}^{2}} - 1,$ where $X_{12} \in R^{n \times 1}$ is the element-wise sum of the first two columns of X.

The proof of Theorem 3.3 is provided in Appendix A.

4. Results

4.1. Analysis of Simulated Data

We first validate our methodology on a simulated dataset of temporal gene expression trajectories. We generate data from the model discussed in Section 2.1: $\frac{d m_{A} (t)}{d t} = α_{A} p (t) + β_{A} - κ_{A} m_{A} (t) + η_{A}$ where we set p(t) = 0 to assume that there is no shared regulatory signal between genes. We also assume that observations are taken in the $m_{A} (t)$ space, and thus noise is additive to $m_{A} (t)$ rather than $d m_{A} (t) / d t .$ This leaves us with $\frac{d m_{A} (t)}{d t} = β_{A} - κ_{A} m_{A} (t),$

which is a linear ODE with the solution $m_{A} (t) = \frac{β_{A}}{κ_{A}} + c e^{- κ_{A} t}, c \in R .$

We generate 100 gene expression trajectories from this model, using parameters $β_{A} \sim Uniform (0, 2), κ_{A} \sim Uniform (0, 1),$ and $c \sim Uniform (0, 1),$ at 17 time pointsFootnote³ ranging from t = 0 to t = 48. We then add N(0, 1) noise to each point.

We apply our methodology to this simulated dataset by computing the Bayesian LLR² between each pair of simulated genes. We set the prior adjacency matrix $W_{i j} = NA$ for all gene pairs, i.e. we assume that prior knowledge about each pair’s relationship is unavailable. Owing to the process of generating data with “independent” trajectories in the absence of shared regulatory signals, we expect to see low Bayesian LLR² values between the vast majority of gene pairs. Observing only a small number of gene pairs with LLR² values above reasonable thresholds confirms that the false positive rateFootnote⁴ remains controlled; it is also possible that some gene pairs do indeed exhibit similar temporal patterns through this data generating process, which we will demonstrate in .

Figure 4. Groups of simulated gene expression trajectories whose pairwise Bayesian lead-lag R² values fell above the $99 th$ percentile of the three empirical LLR² distributions shown in . Each group exhibits visually similar temporal trajectories despite having been generated independently, demonstrating that the LLR² remains able to detect such pairs even in the absence of prior information and shared regulatory signals.

Figure 4. Groups of simulated gene expression trajectories whose pairwise Bayesian lead-lag R2 values fell above the 99th percentile of the three empirical LLR2 distributions shown in Figure 5. Each group exhibits visually similar temporal trajectories despite having been generated independently, demonstrating that the LLR2 remains able to detect such pairs even in the absence of prior information and shared regulatory signals.

In Section 3.5, it is mentioned that the Bayesian LLR² is symmetrized by taking its value to be $\max {LL R^{2} (i, j), LL R^{2} (j, i)},$ where $LL R^{2} (i, j)$ denotes the lead-lag R² calculated with the $i^{th}$ gene treated as the response. In , we show the distribution of Bayesian LLR² values between all simulated gene pairs under three symmetrization methods: the minimum, mean, and maximum of $LL R^{2} (i, j)$ and $LL R^{2} (j, i) .$ We indicate where the $95 th$ and $99 th$ percentiles of each distribution fall, and we also mark $LL R^{2} = 0.9$ falls on the plot, as this threshold was used to define network edges between genes in the analysis of real gene expression data that we present in subsequent sections.

Figure 5. Distributions of the Bayesian lead-lag R² when it is symmetrized by taking the maximum, mean, or minimum of $LL R^{2} (i, j)$ and $LL R^{2} (j, i),$ where $LL R^{2} (i, j)$ denotes the lead-lag R² obtained by treating the $i^{th}$ gene as the response in model (4). Gene trajectories are simulated from the model $m_{A} (t) = β_{A} / κ_{A} + c e^{- κ_{A} t},$ to which standard Gaussian noise is added at each of 17 time points.

Figure 5. Distributions of the Bayesian lead-lag R2 when it is symmetrized by taking the maximum, mean, or minimum of LLR2(i,j) and LLR2(j,i), where LLR2(i,j) denotes the lead-lag R2 obtained by treating the ith gene as the response in model (4). Gene trajectories are simulated from the model mA(t)=βA/κA+ce−κAt, to which standard Gaussian noise is added at each of 17 time points.

No simulated gene pairs have a Bayesian LLR² above 0.9 under any symmetrization method, and at most 24 pairs (under the maximum symmetrization) have a LLR² above 0.8. This demonstrates that the false positive rate of the method is controlled in the absence of prior knowledge about the relationships between genes, and when the genes do not share common regulatory signals. Of the gene pairs whose LLR² values were above the $99 th$ percentile in each distribution, 48% were shared across at least two symmetrization methods. In , we present four examples of groups of simulated genes whose $LL R^{2}$ values were above the $99 th$ percentile under all three symmetrization methods; each group of trajectories exhibits visually similar temporal behavior.

4.2. Collecting a Time-Course Gene Expression Dataset for Drosophila melanogaster

The expression of a gene is typically measured via RNA sequencing (RNA-seq) as the count of a particular type of messenger RNA found in a tissue sample. These counts can be normalized to library size and transformed using the limma-voom transformation (Law et al. Citation2014). This transformation produces near-normally distributed gene expression measurements, making them more amenable to analysis with linear models such as those described in Section 2.1.

Our primary time-course gene expression dataset, introduced in Schlamp et al. (Citation2021), profiles the expression dynamics of 12,657 genes in Drosophila melanogaster (fruit fly) in response to an immune challenge. Immune responses were triggered in flies by injecting them with commercial lipopolysaccharide, which contains peptidoglycan (cell wall molecules) derived from the bacterium E. coli. Following injection, the flies were sampled via RNA-seq at 20 time points over five days. The data was normalized by the aforementioned limma-voom transformation and expressed as the $\log_{2}$ -fold change with respect to the first time point, which was sampled pre-injection as a control. We focus on the first 17 time points, ranging from zero to 48 h post-injection, as this is when most of the variation in expression occurs.

Differential expression analysis is typically used to identify genes exhibiting significant expression changes, and thus reduce a set of thousands of genes into a smaller set meriting further study. In Appendix C, we provide further details on how we use such methods to reduce our set of 12,657 genes into a set of 1735 genes of interest. We also describe therein how we define our prior adjacency matrix W, introduced in Section 3.1, using the STRING database.

4.3. Small-Scale Case Study: Immunity and Metabolism

We now validate our methodology on a small set of genes whose behavior exhibits a known interplay between immunity and metabolism. Schlamp et al. (Citation2021) observe that exposure to bacterial peptidoglycan has an effect not only on the time dynamics of immune response, but also on the expression of genes involved in metabolism. In particular, some genes involved in immune response are up-regulated immediately following peptidoglycan injection, while other genes associated with metabolic processes are down-regulated more slowly. Interestingly, the metabolic genes return to their pre-infection levels of expression well before the immune response has subsided. This phenomenon can be observed in , which shows the expression patterns of two immune response genes (IM1, IM2) and four metabolic process genes (FASN1, UGP, mino, fbp).

Figure 6. Temporal expression patterns of selected genes involved in immune response (IM1, IM2) and metabolic processes (FASN1, UGP, mino, fbp). Upon infection, immune response genes are immediately up-regulated. At the same time, metabolic genes are down-regulated but return to pre-infection expression levels after 12–24 h, which is before the immune response is resolved.

show the prior adjacency matrix W, the Bayesian LLR² values, and the non-Bayesian LLR² values (computed via ordinary least-squares regression) corresponding to these six genes.

Table 1. Portion of the prior adjacency matrix W corresponding to the six genes in .

Download CSV Display Table

Table 2. Bayesian LLR² values corresponding to each gene pair in .

Display Table

Table 3. Non-Bayesian LLR² values corresponding to each gene pair in .

Display Table

Within this set of six genes, indicates that there is prior evidence of association between the immune response genes IM1 and IM2, as well as between the metabolic genes mino and UGP. The off-diagonal “NA” entries in signify that the relationships between the immune and metabolic genes are uncharacterized in the STRING database. However, the Bayesian LLR² values between the metabolic gene FASN1 and both immune response genes are high, as shown in , indicating that the relationship identified by Schlamp et al. (Citation2021) between these gene groups is automatically uncovered by our proposed Bayesian method. Indeed, shows that the temporal expression pattern of FASN1 resembles a vertically-reflected copy of that of either IM1 or IM2. shows the non-Bayesian LLR² values for each gene pair and demonstrates that computing the LLR² without biologically-informed priors yields inflated scores that make it difficult to discern either within- or between-group associations. For further validation of these results, we present in Appendix E. These tables display the values of the metric $LL R^{2} - LL R_{own}^{2},$ which we introduced in Section 3.5 as a means of assessing whether associations indicated by the LLR² alone are false positives.

4.4. Biologically-Meaningful Clustering with the Bayesian $LL R^{2}$

We now apply hierarchical clustering using Ward’s method (Ward Citation1963) to the N × N distance matrix $J - S,$ where J is a matrix of ones and $S_{i, j} = \max {LL R^{2} (i, j), LL R^{2} (j, i)}$ is the similarity matrix containing the symmetrized Bayesian LLR² values between all gene pairs. These genes come from the dataset described in Section 4.2, so N = 1735. Cutting the resulting dendrogram at a height of ten yields 12 clusters, which we display in . A robustness analysis of the within-cluster sum of squares was also performed to confirm that this selection of dendrogram height was appropriate.

Figure 7. Plots of the temporal expression patterns of genes in each cluster obtained by applying hierarchical clustering to the distance matrix $J - S,$ where S contains the Bayesian LLR² for all gene pairs. Cluster sizes range from 40 to 727 genes, with a mean of 145 and a median of 76. These plots visually demonstrate that the Bayesian LLR² is capable of capturing similarities in the overall shapes of two temporal expression patterns, even if one gene is a time-delayed copy of the other or is reflected across a horizontal axis, for instance.

From each cluster, we can also construct a network in which an edge is defined between two genes if their corresponding Bayesian LLR² exceeds a certain threshold. In our analysis, we choose this threshold to be 0.9, which is also the $99 th$ percentile of the empirical distribution of the Bayesian LLR² for this dataset.

To understand the biological processes that are represented by these clusters, we make use of the Gene Ontology (GO) resource (Ashburner et al. Citation2000; Carbon et al. Citation2021). GO links genes with standardized terms that describe their functions. To determine whether a GO term is significantly enriched within a cluster, i.e. whether the cluster contains significantly more genes associated with that term than expected by chance, we perform a hypergeometric test using the R package ClusterProfiler (Yu et al. Citation2012). We use Benjamini-Hochberg (B–H) corrected p-values below 0.05 (Benjamini and Hochberg Citation1995) from this test to determine “significant” enrichment.

Our analysis shows that with the exception of cluster 8, all clusters are significantly enriched for specific biological functions. Recall that our dataset profiles the dynamics of immune response, which is an energetically costly process that is also associated with metabolic changes (DiAngelo et al. Citation2009). The interplay between immunity and metabolism, which we briefly explored in Section 4.3, is represented particularly well in these clusters. Clusters 1, 4, 6, and 7 are significantly enriched for metabolic processes; cluster 10 is significantly enriched for immune response; and cluster 5 is significantly enriched for both metabolic processes and immune responses. Below, we highlight biologically relevant findings from one cluster, and we discuss three additional clusters in Appendix F. These examples show that clustering with the Bayesian LLR² allows genes with similar but lagged expression patterns to be grouped together, even in the absence of known prior information. Finally, the Bayesian LLR² is not influenced by the direction of gene expression changes (i.e. positive or negative changes), making it easier to detect tradeoffs or negative regulatory interactions between genes.

We compare our methodology to other gene clustering techniques in the Appendix. In Appendix D, we show clustering results obtained using the non-Bayesian LLR² as well as Pearson correlation between genes. In Appendix G, we present clusters derived from weighted correlation network analysis (WGCNA) (Langfelder and Horvath Citation2008). However, we find that these methods do not group together genes with similarly-shaped temporal trajectories as effectively as the Bayesian LLR².

4.4.1. Analysis of Cluster 10

In cluster 10, which consists of 44 genes, one of the most significantly enriched GO terms is “defense response”. This GO term is supported by 24 genes in the cluster, within which there are two distinct groups that we display in : eight genes that are known to respond to “Imd” signaling and eight genes that respond to “Toll” signaling. The Imd and Toll signaling pathways represent well-studied molecular responses to infection in flies. The Imd pathway is tailored to fight off infections from gram-negative bacteria Kaneko et al. Citation2004; Zaidman-Rémy et al. Citation2011; Hanson and Lemaitre Citation2020, while the Toll pathway fights off infections from gram-positive bacteria and fungi (Gobert et al. Citation2003; Hanson and Lemaitre Citation2020).

Figure 8. Temporal expression patterns of selected genes in cluster 10 during the first ten hours following peptidoglycan injection. The eight red lines correspond to Imd-regulated genes. The eight blue lines correspond to Toll-regulated genes, which show a smaller and delayed up-regulation. The four black lines correspond to genes that exhibited high Bayesian LLR² values (> 0.9) with Imd-regulated genes; while there is no prior information in STRING linking them with the Imd pathway, their co-clustering with Imd-regulated genes was also observed in Schlamp et al. (Citation2021).

Since the flies profiled in this gene expression dataset were injected with peptidoglycan derived from E. coli, a gram-negative bacterium, we expect to see an activation of Imd-regulated genes. Indeed, as seen in , the eight Imd-regulated genes in this cluster immediately underwent strong up-regulation and reached their highest expression one to two hours after peptidoglycan injection. By contrast, Toll-regulated genes underwent a delayed up-regulation of smaller magnitude, and reached their highest expression two to four hours after injection. Overall, the Bayesian LLR² method successfully grouped together functionally related genes with distinct activation kinetics in this cluster.

In addition to recovering known dynamics of immune response pathways in cluster 10, the Bayesian LLR² metric identified several new relationships between genes. In this cluster, four genes (CG44404, also known as IBIN; CG43236, also known as Mtk-like; CG43202, also known as BomT1; and CG43920) had no prior information available in the STRING database to link them with Imd-regulated genes. However, these four genes exhibit similar expression patterns to Imd-regulated genes, as seen in , although previous studies examine CG44404/IBIN and CG43202/BomT1 expression downstream of Toll signaling (Clemmons et al. Citation2015; Valanne et al. Citation2019). This suggests that these four genes are not exclusively controlled by Toll signaling, and that they can also respond to Imd signaling after a gram-negative immune challenge. While Imd-regulation of CG43236/Mtk-like and CG43920 has not been experimentally validated, their co-clustering pattern with Imd-regulated genes was also observed by Schlamp et al. (Citation2021).

In , we show a network consisting of the four aforementioned genes (CG44404/IBIN, CG43236/Mtk-like, CG43202/BomT1, CG43920) and their neighbors, i.e. the genes with which they have a Bayesian LLR² of at least 0.9. Red edges in connect genes that were known to be associated according to prior information, i.e. $W_{i j} = 1$ for these pairs. Blue edges connect genes with an uncharacterized relationship in the STRING database, i.e. $W_{i j} = NA$ for these pairs.

Figure 9. Network of genes formed from CG44404/IBIN, CG43236/Mtk-like, CG43202/BomT1, CG43920, and their neighbors. Two genes are connected by an edge if their Bayesian $LL R^{2} > 0.9 .$ Red and blue edges connect genes with known and unknown relationships, respectively. Darkened edges connect genes within cluster 10. Blue edges connect the four genes of interest with genes known to be regulated by the Imd signaling pathway, suggesting a possible role for them in fighting gram-negative infections.

4.4.2. Transcription Factor Targets Cluster Together

As mentioned in Sections 1 and 2, genes can be co-regulated in time by the same transcription factor(s), which are proteins that control gene expression. The ODE model of gene expression we employ in EquationEquations (1)(1) $\frac{d m_{A} (t)}{d t} = α_{A} p (t) + β_{A} - κ_{A} m_{A} (t) + η_{A},$ (1) and Equation(2)(2) $\frac{d m_{B} (t)}{d t} = α_{B} p (t) + β_{B} - κ_{B} m_{B} (t) + η_{B} .$ (2) assumes that two genes that are co-regulated by a common transcription factor will have similar time dynamics; in particular, the rates of change in the expressions of both genes will be governed by the same regulatory signal p(t). In light of this model, we expect that co-regulated genes ought to cluster together, i.e. that the target genes of a transcription factor would not be dispersed across many different clusters. displays a heatmap showing the proportion of each transcription factor’s targets appearing in each of the clusters identified by our Bayesian LLR² technique, and demonstrates that target genes do indeed appear in the same cluster for many transcription factors. Targets of transcription factors in each cluster were identified using the RcisTarget R package (Aibar et al. Citation2017).

Figure 10. Heatmap displaying the proportion of each transcription factor’s targets that are assigned to each of the 12 clusters. Transcription factors are named on the right-hand side. The number of target genes they have ranges from three to 1090, with a median of 21. Many transcription factors have their targets grouped in the same cluster, demonstrating that the Bayesian LLR² successfully identifies clusters containing co-regulated genes.

4.5. The Bayesian LLR² Produces a Sparse Set of Associations

The lead-lag R² can be computed with biological information via our proposed Bayesian methodology, or without such information via ordinary least squares regression. In Section 2.2, we discuss the challenge of observing many spuriously high lead-lag R² values under the least squares approach, which does not incorporate existing knowledge of the biological relationship between genes. In , we show the empirical distributions of the non-Bayesian and Bayesian LLR² values in our Drosophila gene expression dataset. The non-Bayesian LLR² distribution is heavily left-skewed, which complicates the definition of suitable LLR² thresholds above which two genes can be considered to be associated. The Bayesian approach effectively “regularizes” the LLR² values, yielding a distribution that reflects the reality of only a relatively small set of noteworthy pairwise associations existing within the dataset.

Figure 11. Empirical distributions of the non-Bayesian (left) and Bayesian (right) lead-lag R² values calculated on the Drosophila gene expression dataset under consideration in this study. The $95 th$ and $99 th$ empirical percentiles are marked in each plot.

Figure 11. Empirical distributions of the non-Bayesian (left) and Bayesian (right) lead-lag R2 values calculated on the Drosophila gene expression dataset under consideration in this study. The 95th and 99th empirical percentiles are marked in each plot.

In particular, the distribution of the Bayesian LLR² is conducive to identifying pairs or groups of genes with highly similar time dynamics. In Section 3.5, we introduced the metrics $LL R_{other}^{2}$ and $LL R^{2} - LL R_{own}^{2} .$ As described therein, the former indicates how much variation in gene A’s expression can be explained only through the dynamics of another gene B. The latter indicates how much additional variation in gene A can be explained by considering gene B, on top of considering gene A’s own past trajectory. Intuitively, both of these quantities should be large if the two genes are indeed associated in a way that manifests in highly similar temporal dynamics.

In , we randomly select 150 genes and place all $(\begin{matrix} 150 \\ 2 \end{matrix}) = 1$ 1,175 pairs on two scatterplots whose horizontal axes display the $LL R_{other}^{2}$ values and vertical axes display the $LL R^{2} - LL R_{own}^{2}$ values. These R² values are computed via our Bayesian method in one scatterplot and via ordinary least squares regression in the other. Gene pairs of particular interest fall into the upper right quadrant of the scatterplot.

Figure 12. Scatterplots of $LL R_{other}^{2}$ (horizontal axis) and $LL R^{2} - LL R_{own}^{2}$ (vertical axis) for a random selection of 150 genes, resulting in 11,175 gene pairs. Left: LLR² values are computed via ordinary least squares, without external biological information. Points are colored according to how prior information characterizes the corresponding two genes: unlikely to be associated (gray); uncharacterized association (blue); or known association (red). Right: LLR² values are computed with the proposed Bayesian approach.

shows that when we use the ordinary least-squares approach, i.e. without incorporating external biological information, we obtain an overwhelmingly large number of gene pairs with high $LL R_{other}^{2}$ scores. Many are those that are unlikely to be associated, according to our chosen sources of prior information. By contrast, the Bayesian approach leverages this information to shift the distribution of the R² values noticeably, yielding a smaller set of gene pairs that are worth examining further. This distributional shift is due to both the estimator $β_{*}$ in (10) for the coefficients in the model (4), as well as the way in which the parameter g is set in (15). In particular, g controls how much $β_{*}$ is influenced by either the data or prior information.

Importantly, shows that gene pairs with previously uncharacterized relationships but highly similar time dynamics are more easily identified with the Bayesian LLR². A few of these gene pairs, which fall in the fairly sparse upper-right region of the left-hand scatterplot, are shown in in Appendix D. shows gene pairs in the same region of the scatterplot with well-known relationships, further demonstrating that the proposed method successfully recovers familiar associations.

5. Discussion

Time-course gene expression datasets are a valuable resource for understanding the complex dynamics of interconnected biological systems. An important statistical task in the analysis of such data is to identify clusters or networks of genes that exhibit similar temporal expression patterns. These patterns yield systems-level insight into the roles of biological pathways and processes such as disease progression and recovery.

The main statistical challenges in detecting associations within time-course gene expression datasets stem from high dimensionality and small sample sizes, combined with the nonlinearity of gene expression time dynamics. To overcome these difficulties, we proposed a method for identifying potentially associated genes that treats temporal gene expression as a dynamical system governed by ODEs, whose parameters are determined in a Bayesian way using gene networks curated a priori from biological databases. The ODE model is fit to a pair of genes via Bayesian regression and is used to derive the biologically-informed Bayesian lead-lag R² similarity measure. The Bayesian regression procedure leverages Zellner’s g-prior and ideas from shrinkage estimation, namely minimization of Stein’s unbiased risk estimate (SURE), to balance the posterior ODE model’s fit to the data and to prior knowledge of the relationship between the genes. As a result, we automatically encourage gene pairs with known associations to receive higher Bayesian lead-lag R² scores, while reducing the scores of gene pairs that are unlikely to be related and allowing new relationships to be discovered. Future extensions of this work could adapt this procedure to groups of $m \geq 2$ genes by defining a similar m-dimensional system of ODEs that share a common regulatory signal. Dimensionality reduction or regularization may be required to fit the resulting linear model in this setting.

In Section 4, we analyzed clusters and networks of genes that were identified by our method as having similar temporal dynamics. In particular, the clusters highlighted the known interplay between immune response and metabolism, and suggested roles for uncharacterized genes displaying remarkably similar temporal patterns to more well-studied ones. We contrasted our results to those obtained by using only the ordinary least-squares version of the lead-lag R² and demonstrated how the inclusion of prior biological information greatly aids the identification of biologically relevant gene groups.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by National Institutes of Health and National Science Foundation.

Notes

1 Suppose p(t) = 0. Then we have

\frac{d ​ m_{A} (t)}{d ​ t} = β_{A} - κ_{A} m_{A} (t),

which has the solution

m_{A} (t) = \frac{β_{A}}{κ_{A}} + c e^{- κ_{A} t}

for some

c \in R .

t \to \infty,

we have

m_{A} (t) \to \frac{β_{A}}{κ_{A}}

(a non-zero steady state, if

β_{A} \neq 0

2 In practice, Farina et al. (Citation2008) employ piecewise cubic Hermite polynomials to interpolate as well as add additional samples to each time interval. The same authors use the trapezoidal rule for numerical integration. Our software employs default R methods for interpolation and integration, which are natural cubic splines (via the splinefun function) and Gauss-Kronrod quadrature (via the integrate function) respectively.

3 We use the same time points that are used in our analysis of real gene expression data beginning in Section 4.2. These points are 0, 1, 2, 4, 5, 6, 8, 10, 12, 14, 16, 20, 24, 30, 36, 42, 48.

4 As mentioned in Section 2.2, we use the term “false positives” in this context to refer to gene pairs whose lead-lag R² value is spuriously high but that have neither similar expression patterns nor a biological relationship.

References

Aibar S, González-Blas CB, Moerman T, Huynh-Thu VA, Imrichova H, Hulselmans G, Rambow F, Marine J-C, Geurts P, Aerts J, et al. 2017. Scenic: single-cell regulatory network inference and clustering. Nat Methods. 14(11):1083–1086.
PubMed Web of Science ®Google Scholar
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. 2000. Gene ontology: tool for the unification of biology. Nat Genet. 25(1):25–29.
PubMed Web of Science ®Google Scholar
Bansal M, Gatta GD, di Bernardo D. 2006. Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics. 22(7):815–822.
PubMed Web of Science ®Google Scholar
Bar-Joseph Z, Gitter A, Simon I. 2012. Studying and modelling dynamic biological processes using time-series gene expression data. Nat Rev Genet. 13(8):552–564.
PubMed Web of Science ®Google Scholar
Benjamini Y, Hochberg Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Statist Soc. 57(1):289–300.
Google Scholar
Carbon S, Douglass E, Good BM, Unni DR, Harris NL, Mungall CJ, Basu S, Chisholm RL, Dodson RJ, Hartline E, et al. 2021. The gene ontology resource: enriching a GOld mine. Nucleic Acids Res. 49(D1):D325–D334.
PubMed Web of Science ®Google Scholar
Chen T, He HL, Church GM. 1999. Modeling gene expression with differential equations. In: Altman RB, Lauderdale K, Dunker AK, Hunter L, Klein TE, eds. Biocomputing’99. Proceedings of the Pacific Symposium; 1999 January 4–9; Mauna Lani, Hawaii. Singapore: World Scientific, p. 29–40.
Google Scholar
Clemmons AW, Lindsay SA, Wasserman SA. 2015. An effector peptide family required for Drosophila toll-mediated immunity. PLOS Pathog. 11(4):e1004876.
PubMed Web of Science ®Google Scholar
Cui W, George EI. 2008. Empirical Bayes vs. fully Bayes variable selection. J Statist Plan Infer. 138(4):888–900.
Web of Science ®Google Scholar
Cyran SA, Buchsbaum AM, Reddy KL, Lin M-C, Glossop NR, Hardin PE, Young MW, Storti RV, Blau J. 2003. vrille, Pdp1, and dClock form a second feedback loop in the Drosophila circadian clock. Cell. 112(3):329–341.
PubMed Web of Science ®Google Scholar
De Jong H. 2002. Modeling and simulation of genetic regulatory systems: a literature review. J Comput Biol. 9(1):67–103.
PubMed Web of Science ®Google Scholar
D’haeseleer P, Wen X, Fuhrman S, Somogyi R. 1999. Linear modeling of mRNA expression levels during CNS development and injury. In: Altman RB, Lauderdale K, Dunker AK, Hunter L, Klein TE, eds. Biocomputing’99. Proceedings of the Pacific Symposium; 1999 January 4–9; Mauna Lani, Hawaii. Singapore: World Scientific, p. 41–52.
Google Scholar
DiAngelo JR, Bland ML, Bambina S, Cherry S, Birnbaum MJ. 2009. The immune response attenuates growth and nutrient storage in Drosophila by reducing insulin signaling. Proc Natl Acad Sci USA. 106(49):20853–20858.
PubMed Web of Science ®Google Scholar
Efron B. 2004. The estimation of prediction error: covariance penalties and cross-validation. J Am Stat Assoc. 99(467):619–632.
Web of Science ®Google Scholar
Eisen MB, Spellman PT, Brown PO, Botstein D. 1998. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 95(25):14863–14868.
PubMed Web of Science ®Google Scholar
Fabregat A, Jupe S, Matthews L, Sidiropoulos K, Gillespie M, Garapati P, Haw R, Jassal B, Korninger F, May B, et al. 2018. The Reactome pathway knowledgebase. Nucleic Acids Res. 46(D1):D649–D655.
PubMed Web of Science ®Google Scholar
Farina L, De Santis A, Morelli G, Ruberti I. 2007. Dynamic measure of gene co-regulation. IET Syst Biol. 1(1):10–17.
PubMed Web of Science ®Google Scholar
Farina L, De Santis A, Salvucci S, Morelli G, Ruberti I. 2008. Embedding mRNA stability in correlation analysis of time-series gene expression data. PLOS Comput Biol. 4(8):e1000141.
PubMed Web of Science ®Google Scholar
Fourdrinier D, Strawderman WE, Wells MT. 2018. Shrinkage estimation. Switzerland: Springer.
Google Scholar
Gaudet P, Livstone MS, Lewis SE, Thomas PD. 2011. Phylogenetic-based propagation of functional annotations within the gene ontology consortium. Brief Bioinform. 12(5):449–462.
PubMed Web of Science ®Google Scholar
Gelman A, Goodrich B, Gabry J, Vethari A. 2018. R-squared for Bayesian regression models. American Statistician. 73: 307–309.
Google Scholar
George EI, Foster DP. 2000. Calibration and empirical Bayes variable selection. Biometrika. 87(4):731–747.
Web of Science ®Google Scholar
Giles D, Rayner A. 1979. The mean squared errors of the maximum likelihood and natural-conjugate Bayes regression estimators. J Economet. 11(2–3):319–334.
Google Scholar
Gobert V, Gottar M, Matskevich AA, Rutschmann S, Royet J, Belvin M, Hoffmann JA, Ferrandon D. 2003. Dual activation of the Drosophila toll pathway by two pattern recognition receptors. Science. 302(5653):2126–2130.
PubMed Web of Science ®Google Scholar
Gonzalez EA, Garg A, Tang J, Nazario-Toole AE, Wu LP. 2013. A glutamate-dependent redox system in blood cells is integral for phagocytosis in Drosophila melanogaster. Curr Biol. 23(22):2319–2324.
PubMed Web of Science ®Google Scholar
Goto A, Kadowaki T, Kitagawa Y. 2003. Drosophila hemolectin gene is expressed in embryonic and larval hemocytes and its knock down causes bleeding defects. Dev Biol. 264(2):582–591.
PubMed Web of Science ®Google Scholar
Hanson MA, Lemaitre B. 2020. New insights on Drosophila antimicrobial peptide function in host defense and beyond. Curr Opin Immunol. 62:22–30.
PubMed Web of Science ®Google Scholar
Hill SM, Neve RM, Bayani N, Kuo WL, Ziyad S, Spellman PT, Gray JW, Mukherjee S. 2012. Integrating biological knowledge into variable selection: an empirical Bayes approach with an application in cancer biology. BMC Bioinf. 13(1):1–15.
PubMedGoogle Scholar
Kanehisa M, Goto S. 2000. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28(1):27–30.
PubMed Web of Science ®Google Scholar
Kaneko T, Goldman WE, Mellroth P, Steiner H, Fukase K, Kusumoto S, Harley W, Fox A, Golenbock D, Silverman N. 2004. Monomeric and polymeric gram-negative peptidoglycan but not purified LPS stimulate the Drosophila IMD pathway. Immunity. 20(5):637–649.
PubMed Web of Science ®Google Scholar
Karp PD, Billington R, Caspi R, Fulcher CA, Latendresse M, Kothari A, Keseler IM, Krummenacker M, Midford PE, Ong Q, et al. 2019. The BioCyc collection of microbial genomes and metabolic pathways. Brief Bioinform. 20(4):1085–1093.
PubMed Web of Science ®Google Scholar
Kass RE, Raftery AE. 1995. Bayes factors. J Am Stat Assoc. 90(430):773–795.
Web of Science ®Google Scholar
Krejčová G, Danielová A, Nedbalová P, Kazek M, Strych L, Chawla G, Tennessen JM, Lieskovská J, Jindra M, Doležal T, et al. 2019. Drosophila macrophages switch to aerobic glycolysis to mount effective antibacterial defense. Elife. 8:e50414.
PubMed Web of Science ®Google Scholar
Langfelder P, Horvath S. 2008. Wgcna: an r package for weighted correlation network analysis. BMC Bioinf. 9(1):1–13.
PubMedGoogle Scholar
Larkin A, Marygold SJ, Antonazzo G, Attrill H, Dos Santos G, Garapati PV, Goodman JL, Gramates LS, Millburn G, Strelets VB, et al. 2021. FlyBase: updates to the Drosophila melanogaster knowledge base. Nucleic Acids Res. 49(D1):D899–D907.
PubMed Web of Science ®Google Scholar
Law CW, Chen Y, Shi W, Smyth GK. 2014. Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15(2):R29.
PubMed Web of Science ®Google Scholar
Li F, Zhang NR. 2010. Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. J Am Stat Assoc. 105(491):1202–1214.
Web of Science ®Google Scholar
Li Y, Jackson SA. 2015. Gene network reconstruction by integration of prior biological knowledge. G3. 5(6):1075–1079.
Web of Science ®Google Scholar
Liang F, Paulo R, Molina G, Clyde MA, Berger JO. 2008. Mixtures of g priors for Bayesian variable selection. J Am Stat Assoc. 103(481):410–423.
Web of Science ®Google Scholar
Lo K, Raftery AE, Dombek KM, Zhu J, Schadt EE, Bumgarner RE, Yeung KY. 2012. Integrating external biological knowledge in the construction of regulatory networks from time-series expression data. BMC Syst Biol. 6(1):17520509.
Google Scholar
Nepomuceno JA, Troncoso A, Nepomuceno-Chamorro IA, Aguilar-Ruiz JS. 2015. Integrating biological knowledge based on functional annotations for biclustering of gene expression data. Comput Methods Programs Biomed. 119(3):163–180.
PubMed Web of Science ®Google Scholar
Peng B, Zhu D, Ander BP, Zhang X, Xue F, Sharp FR, Yang X. 2013. An integrative framework for Bayesian variable selection with informative priors for identifying genes and pathways. PLOS One. 8(7):e67672.
PubMed Web of Science ®Google Scholar
Polynikis A, Hogan S, di Bernardo M. 2009. Comparing different ODE modelling approaches for gene regulatory networks. J Theor Biol. 261(4):511–530.
PubMed Web of Science ®Google Scholar
Purvis JE, Lahav G. 2013. Encoding and decoding cellular information through signaling dynamics. Cell. 152(5):945–956.
PubMed Web of Science ®Google Scholar
Qian J, Dolled-Filhart M, Lin J, Yu H, Gerstein M. 2001. Beyond synexpression relationships: local clustering of time-shifted and inverted gene expression profiles identifies new, biologically relevant interactions. J Mol Biol. 314(5):1053–1066.
PubMed Web of Science ®Google Scholar
W. Research. 2019. Mathematica, version 12.0. Champaign, IL: Wolfram Research, Inc.
Google Scholar
Schlamp F, Delbare SY, Early AM, Wells MT, Basu S, Clark AG. 2021. Dense time-course gene expression profiling of the Drosophila melanogaster innate immune response. BMC Genomics. 22(1):1–22.
PubMed Web of Science ®Google Scholar
Stingo FC, Chen YA, Tadesse MG, Vannucci M. 2011. Incorporating biological information into linear models: a Bayesian approach to the selection of pathways and genes. Ann Appl Stat. 5(3):1978–2002.
PubMed Web of Science ®Google Scholar
Sun J, Liu C, Bai X, Li X, Li J, Zhang Z, Zhang Y, Guo J, Li Y. 2017. Drosophila FIT is a protein-specific satiety hormone essential for feeding control. Nat Commun. 8(1):1–13.
PubMedGoogle Scholar
Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, Simonovic M, Doncheva NT, Morris JH, Bork P, et al. 2019. STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 47(D1):D607–D613.
PubMed Web of Science ®Google Scholar
Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM. 1999. Systematic determination of genetic network architecture. Nat Genet. 22(3):281–285.
PubMed Web of Science ®Google Scholar
Valanne S, Salminen TS, Järvelä-Stölting M, Vesala L, Rämet M. 2019. Immune-inducible non-coding RNA molecule lincRNA-IBIN connects immunity and metabolism in Drosophila melanogaster. PLOS Pathog. 15(1):e1007504.
PubMed Web of Science ®Google Scholar
Wang X, Wang T, Jiao Y, von Lintig J, Montell C. 2010. Requirement for an enzymatic visual cycle in Drosophila. Curr Biol. 20(2):93–102.
PubMed Web of Science ®Google Scholar
Ward JH. Jr. 1963. Hierarchical grouping to optimize an objective function. J Am Stat Assoc. 58(301):236–244.
Web of Science ®Google Scholar
Wu L, Qiu X, Xiang Yuan Y, Wu H. 2019. Parameter estimation and variable selection for big systems of linear ordinary differential equations: a matrix-based approach. J Am Stat Assoc. 114(526):657–667.
PubMed Web of Science ®Google Scholar
Yosef N, Regev A. 2011. Impulse control: temporal dynamics in gene transcription. Cell. 144(6):886–896.
PubMed Web of Science ®Google Scholar
Yu G, Li F, Qin Y, Bo X, Wu Y, Wang S. 2010. GoSemSim: an R package for measuring semantic similarity among go terms and gene products. Bioinformatics. 26(7):976–978.
PubMed Web of Science ®Google Scholar
Yu G, Wang L-G, Han Y, He Q-Y. 2012. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 16(5):284–287.
PubMed Web of Science ®Google Scholar
Zaidman-Rémy A, Poidevin M, Hervé M, Welchman DP, Paredes JC, Fahlander C, Steiner H, Mengin-Lecreulx D, Lemaitre B. 2011. Drosophila immunity: analysis of PGRP-SB1 expression, enzymatic activity and function. PLOS One. 6(2):e17231.
PubMed Web of Science ®Google Scholar
Zellner A. 1986. On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In: Goel P, Zellner A, eds. Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti. New York, NY: Elsevier. p. 233–243.
Google Scholar
Zellner A, Siow A. 1980. Posterior odds ratios for selected regression hypotheses. Trabajos de Estadistica Y de Investigacion Operativa. 31(1):585–603.
Google Scholar
Zhang B, Horvath S. 2005. A general framework for weighted gene co-expression network analysis. Statist Appl Genet Mol Biol. 4(1): Article 17.
PubMedGoogle Scholar
Zhang B, Huang Y, McDermott JE, Posey RH, Xu H, Zhao Z. 2013. Molecular pathway identification using biological network-regularized logistic models. BMC Genomics. 14(Suppl 8):S1.
PubMedGoogle Scholar

Appendix A.

Proofs

In this section, we provide proofs of Theorems 3.2 and 3.3.

Proof of Theorem 3.2.

We write $δ_{0} (Y)$ as $δ_{0} (g),$ treating Y as fixed. Expanding the expression for $δ_{0} (g),$ we obtain (A1) $\begin{matrix} δ_{0} (g) = | | Y - X \hat{β} | |^{2} + \frac{2 g p {\hat{σ}}^{2}}{1 + g} - n {\hat{σ}}^{2} \\ = {(Y - \frac{1}{1 + g} Y_{0} - \frac{g}{1 + g} {\overset{⁁}{Y}}_{OLS})}^{T} (Y - \frac{1}{1 + g} Y_{0} - \frac{g}{1 + g} {\overset{⁁}{Y}}_{OLS}) + \frac{2 g p {\hat{σ}}^{2}}{1 + g} - n {\hat{σ}}^{2} \\ = | | Y | |^{2} - \frac{2}{1 + g} Y^{T} Y_{0} - \frac{2 g}{1 + g} Y^{T} {\overset{⁁}{Y}}_{OLS} + \frac{1}{{(1 + g)}^{2}} | | Y_{0} | |^{2} + \frac{2 g}{{(1 + g)}^{2}} {\overset{⁁}{Y}}_{OLS}^{T} Y_{0} \\ + \frac{g^{2}}{{(1 + g)}^{2}} | | {\overset{⁁}{Y}}_{OLS} | |^{2} + \frac{2 g p {\hat{σ}}^{2}}{1 + g} - n {\hat{σ}}^{2} . \end{matrix}$ (A1)

Next, observe that $Y^{T} {\hat{Y}}_{OLS} = | | {\overset{⁁}{Y}}_{OLS} | |^{2},$ because (A2) $| | {\overset{⁁}{Y}}_{OLS} | |^{2} = {(HY)}^{T} (HY) = Y^{T} H Y = Y^{T} {\overset{⁁}{Y}}_{OLS} .$ (A2)

Furthermore, observe that ${\overset{⁁}{Y}}_{OLS}^{T} Y_{0} = Y^{T} Y_{0},$ because (A3) ${\overset{⁁}{Y}}_{OLS}^{T} Y_{0} = {(HY)}^{T} Y_{0} = Y^{T} H Y_{0} = Y^{T} Y_{0},$ (A3) where the last equality follows from the fact that $Y_{0} = X β_{0},$ i.e. $Y_{0}$ is in the column span of X, which is the space onto which H projects. The identities (A2) and (A3) can now be used to write $δ_{0} (g)$ in (A1) as $\begin{matrix} δ_{0} (g) = | | Y | |^{2} - \frac{2}{1 + g} Y^{T} Y_{0} - \frac{2 g}{1 + g} | | {\overset{⁁}{Y}}_{OLS} | |^{2} + \frac{1}{{(1 + g)}^{2}} | | Y_{0} | |^{2} + \frac{2 g}{{(1 + g)}^{2}} Y^{T} Y_{0} \\ + \frac{g^{2}}{{(1 + g)}^{2}} | | {\overset{⁁}{Y}}_{OLS} | |^{2} + \frac{2 g p {\hat{σ}}^{2}}{1 + g} - n {\hat{σ}}^{2} \\ = a - \frac{2 b}{1 + g} - \frac{2 g c}{1 + g} + \frac{d}{{(1 + g)}^{2}} + \frac{2 g b}{{(1 + g)}^{2}} + \frac{g^{2} c}{{(1 + g)}^{2}} + \frac{2 g p {\hat{σ}}^{2}}{1 + g} - n {\hat{σ}}^{2}, \end{matrix}$ where $a = | | Y | |^{2}, b = Y^{T} Y_{0}, c = | | {\overset{⁁}{Y}}_{OLS} | |^{2},$ and $d = | | Y_{0} | |^{2} .$ Differentiating $δ_{0} (g)$ with respect to g, we obtain $\begin{matrix} \frac{d δ_{0} (g)}{dg} = \frac{2 b}{{(1 + g)}^{2}} - \frac{2 c}{1 + g} + \frac{2 g c}{{(1 + g)}^{2}} - \frac{2 d}{{(1 + g)}^{3}} + \frac{2 b}{{(1 + g)}^{2}} - \frac{4 g b}{{(1 + g)}^{3}} \\ + \frac{2 g c}{{(1 + g)}^{2}} - \frac{2 g^{2} c}{{(1 + g)}^{3}} + \frac{2 p {\hat{σ}}^{2}}{1 + g} - \frac{2 g p {\hat{σ}}^{2}}{{(1 + g)}^{2}} \\ = \frac{2 p {\hat{σ}}^{2} - 2 c}{1 + g} + \frac{4 b + 4 g c - 2 g p {\hat{σ}}^{2}}{{(1 + g)}^{2}} - \frac{2 d + 4 g b + 2 g^{2} c}{{(1 + g)}^{3}} \\ = \frac{2 p {\hat{σ}}^{2} - 2 c + 2 g p {\hat{σ}}^{2} - 2 g c}{{(1 + g)}^{2}} + \frac{4 b + 4 g c - 2 g p {\hat{σ}}^{2}}{{(1 + g)}^{2}} - \frac{2 d + 4 g b + 2 g^{2} c}{{(1 + g)}^{3}} \\ = \frac{4 b + 2 g c + 2 p {\hat{σ}}^{2} - 2 c}{{(1 + g)}^{2}} - \frac{2 d + 4 g b + 2 g^{2} c}{{(1 + g)}^{3}} . \end{matrix}$

Setting this derivative to zero and rearranging yields: $\begin{matrix} \frac{2 d + 4 g b + 2 g^{2} c}{{(1 + g)}^{3}} = \frac{4 b + 2 g c + 2 p {\hat{σ}}^{2} - 2 c}{{(1 + g)}^{2}} \\ \Rightarrow 2 d + 4 g b + 2 g^{2} c = (1 + g) (4 b + 2 g c + 2 p {\hat{σ}}^{2} - 2 c) \\ = 4 b + 2 g c + 2 p {\hat{σ}}^{2} - 2 c + 4 g b + 2 g^{2} c + 2 g p {\hat{σ}}^{2} - 2 g c \\ \Rightarrow 2 d = 4 b + 2 p {\hat{σ}}^{2} - 2 c + 2 g p {\hat{σ}}^{2} \\ \Rightarrow g_{*} = \frac{c + d - 2 b}{p {\hat{σ}}^{2}} - 1. \end{matrix}$

We now substitute the definitions of b, c, and d back into this expression to obtain (A4) $g_{*} = \frac{| | {\overset{⁁}{Y}}_{OLS} | |^{2} + | | Y_{0} | |^{2} - 2 Y^{T} Y_{0}}{p {\hat{σ}}^{2}} - 1.$ (A4)

The numerator of (A4) can be simplified by observing that $\begin{matrix} | | {\overset{⁁}{Y}}_{OLS} | |^{2} + | | Y_{0} | |^{2} - 2 Y^{T} Y_{0} = | | {\overset{⁁}{Y}}_{OLS} | |^{2} + | | Y_{0} | |^{2} - 2 {\overset{⁁}{Y}}_{OLS}^{T} Y_{0} \\ = {({\overset{⁁}{Y}}_{OLS} - Y_{0})}^{T} {({\overset{⁁}{Y}}_{OLS} - Y_{0})}^{T} \\ = | | {\overset{⁁}{Y}}_{OLS} - Y_{0} | |^{2} . \end{matrix}$

Therefore, (A4) becomes (A5) $g_{*} = \frac{| | {\overset{⁁}{Y}}_{OLS} - Y_{0} | |^{2}}{p {\hat{σ}}^{2}} - 1.$ (A5)

When $β_{0} = 0,$ we have $Y_{0} = X β_{0} = 0 .$ In this case, (A5) becomes $g_{*} = \frac{| | {\overset{⁁}{Y}}_{OLS} | |^{2}}{p {\hat{σ}}^{2}} - 1.$

Finally, the second derivative of $δ_{0} (g)$ evaluated at $g = g_{*}$ in (A5) is equal to $\frac{d ^{2} δ_{0} (g)}{d g^{2}} |_{g = g_{*}} = \frac{2 p^{4} {\hat{σ}}^{8}}{| | {\overset{⁁}{Y}}_{OLS} - Y_{0} | |^{6}},$

which is positive, thus confirming that $δ_{0} (g)$ is indeed minimized at $g = g_{*} .$ This second derivative calculation was verified in Mathematica.

Proof of Theorem 3.3.

We write $δ_{0} (Y)$ as $δ_{0} (g, ξ),$ treating Y as fixed. First, if $β_{0} = {[ξ, ξ, 0, 0, 0]}^{T},$ then $X β_{0} = ξ X_{12},$ where $X_{12}$ is the element-wise sum of the first two columns of X. We now proceed similarly to the proof of Theorem 3.2 by expanding $δ_{0} (g, ξ) :$ (A6) $\begin{matrix} δ_{0} (g, ξ) = | | Y - X \hat{β} | |^{2} + \frac{2 g p {\hat{σ}}^{2}}{1 + g} - n {\hat{σ}}^{2} \\ = {‖ Y - X (\frac{1}{1 + g} β_{0} + \frac{g}{1 + g} {\hat{β}}_{OLS}) ‖}^{2} + \frac{2 g p {\hat{σ}}^{2}}{1 + g} - n {\hat{σ}}^{2} \\ = {‖ Y - \frac{ξ}{1 + g} X_{12} - \frac{g}{1 + g} {\overset{⁁}{Y}}_{OLS} ‖}^{2} + \frac{2 g p {\hat{σ}}^{2}}{1 + g} - n {\hat{σ}}^{2} \\ = {(Y - \frac{ξ}{1 + g} X_{12} - \frac{g}{1 + g} {\overset{⁁}{Y}}_{OLS})}^{T} (Y - \frac{ξ}{1 + g} X_{12} - \frac{g}{1 + g} {\overset{⁁}{Y}}_{OLS}) + \frac{2 g p {\hat{σ}}^{2}}{1 + g} - n {\hat{σ}}^{2} \\ = | | Y | |^{2} - \frac{2 ξ}{1 + g} Y^{T} X_{12} - \frac{2 g}{1 + g} Y^{T} {\overset{⁁}{Y}}_{OLS} + \frac{ξ^{2}}{{(1 + g)}^{2}} | | X_{12} | |^{2} + \frac{2 ξ g}{{(1 + g)}^{2}} {\overset{⁁}{Y}}_{OLS}^{T} X_{12} \\ + \frac{g^{2}}{{(1 + g)}^{2}} | | {\hat{Y}}_{OLS} | |^{2} + \frac{2 g p {\hat{σ}}^{2}}{1 + g} - n {\hat{σ}}^{2} . \end{matrix}$ (A6)

Next, observe that ${\overset{⁁}{Y}}_{OLS}^{T} X_{12} = Y^{T} X_{12},$ because (A7) ${\overset{⁁}{Y}}_{OLS}^{T} X_{12} = {(HY)}^{T} X_{12} = Y^{T} H X_{12} = Y^{T} X_{12},$ (A7) where the last equality follows from the fact that $X_{12}$ is in the column span of X, which is the space onto which $H$ projects. The identities (A2) and (A7) can now be used to write $δ_{0} (g, ξ)$ in (A6) as $\begin{matrix} δ_{0} (g, ξ) = | | Y | |^{2} - \frac{2 ξ}{1 + g} Y^{T} X_{12} - \frac{2 g}{1 + g} | | {\overset{⁁}{Y}}_{OLS} | |^{2} + \frac{ξ^{2}}{{(1 + g)}^{2}} | | X_{12} | |^{2} + \frac{2 ξ g}{{(1 + g)}^{2}} Y^{T} X_{12} \\ + \frac{g^{2}}{{(1 + g)}^{2}} | | {\overset{⁁}{Y}}_{OLS} | |^{2} + \frac{2 g p {\hat{σ}}^{2}}{1 + g} - n {\hat{σ}}^{2} \\ = a - \frac{2 ξ b}{1 + g} - \frac{2 g c}{1 + g} + \frac{ξ^{2} d}{{(1 + g)}^{2}} + \frac{2 ξ g b}{{(1 + g)}^{2}} + \frac{g^{2} c}{{(1 + g)}^{2}} + \frac{2 g p {\hat{σ}}^{2}}{1 + g} - n {\hat{σ}}^{2}, \end{matrix}$ where $a = | | Y | |^{2}, b = Y^{T} X_{12}, c = | | {\overset{⁁}{Y}}_{OLS} | |^{2},$ and $d = | | X_{12} | |^{2} .$ Differentiating $δ_{0} (g, ξ)$ with respect to ξ, we obtain $\frac{\partial δ_{0} (g, ξ)}{\partial ξ} = - \frac{2 b}{1 + g} + \frac{2 ξ d}{{(1 + g)}^{2}} + \frac{2 g b}{{(1 + g)}^{2}} .$

Setting this derivative to zero and rearranging yields: $\begin{matrix} \frac{2 b}{1 + g} = \frac{2 ξ d + 2 g b}{{(1 + g)}^{2}} \\ \Rightarrow b (1 + g) = ξ d + g b \\ \Rightarrow b = ξ d \\ \Rightarrow ξ_{*} = \frac{b}{d} . \end{matrix}$

We now substitute the definitions of b and d back into this expression to obtain (A8) $ξ_{*} = \frac{Y^{T} X_{12}}{| | X_{12} | |^{2}} .$ (A8)

Next, we differentiate $δ_{0} (g, ξ)$ with respect to g: $\begin{matrix} \frac{\partial δ_{0} (Y)}{\partial g} = \frac{2 ξ b}{{(1 + g)}^{2}} - \frac{2 c}{1 + g} + \frac{2 g c}{{(1 + g)}^{2}} - \frac{2 ξ^{2} d}{{(1 + g)}^{3}} + \frac{2 ξ b}{{(1 + g)}^{2}} - \frac{4 ξ g b}{{(1 + g)}^{3}} \\ + \frac{2 g c}{{(1 + g)}^{2}} - \frac{2 g^{2} c}{{(1 + g)}^{3}} + \frac{2 p {\hat{σ}}^{2}}{1 + g} - \frac{2 g p {\hat{σ}}^{2}}{{(1 + g)}^{2}} \\ = \frac{2 p {\hat{σ}}^{2} - 2 c}{1 + g} + \frac{4 ξ b + 4 g c - 2 g p {\hat{σ}}^{2}}{{(1 + g)}^{2}} - \frac{2 ξ^{2} d + 4 ξ g b + 2 g^{2} c}{{(1 + g)}^{3}} \\ = \frac{2 p {\hat{σ}}^{2} - 2 c + 2 g p {\hat{σ}}^{2} - 2 g c}{{(1 + g)}^{2}} + \frac{4 ξ b + 4 g c - 2 g p {\hat{σ}}^{2}}{{(1 + g)}^{2}} - \frac{2 ξ^{2} d + 4 ξ g b + 2 g^{2} c}{{(1 + g)}^{3}} \\ = \frac{4 ξ b + 2 g c + 2 p {\hat{σ}}^{2} - 2 c}{{(1 + g)}^{2}} - \frac{2 ξ^{2} d + 4 ξ g b + 2 g^{2} c}{{(1 + g)}^{3}} . \end{matrix}$

Setting this derivative to zero and rearranging yields: $\begin{matrix} \frac{2 ξ^{2} d + 2 ξ g b + 2 g^{2} c}{{(1 + g)}^{3}} = \frac{4 ξ b + 2 g c + 2 p {\hat{σ}}^{2} - 2 c}{{(1 + g)}^{2}} \\ \Rightarrow 2 ξ^{2} d + 4 ξ g b + 2 g^{2} c = (1 + g) (4 ξ b + 2 g c + 2 p {\hat{σ}}^{2} - 2 c) \\ = 2 ξ b + 2 g c + 2 p {\hat{σ}}^{2} - 2 c + 4 ξ g b + 2 g^{2} c + 2 g p {\hat{σ}}^{2} - 2 g c \\ \Rightarrow 2 ξ^{2} d = 4 ξ b + 2 p {\hat{σ}}^{2} - 2 c + 2 g p {\hat{σ}}^{2} \\ \Rightarrow g_{*} = \frac{c + ξ^{2} d - 2 ξ b}{p {\hat{σ}}^{2}} - 1. \end{matrix}$

We now substitute the definitions of b, c, and d back into this expression to obtain $g_{*} = \frac{| | {\overset{⁁}{Y}}_{OLS} | |^{2} + ξ^{2} | | X_{12} | |^{2} - 2 ξ Y^{T} X_{12}}{p {\hat{σ}}^{2}} - 1.$

Substituting ξ in (A8) into this expression for g yields (A9) $\begin{matrix} g_{*} = \frac{| | {\overset{⁁}{Y}}_{OLS} | |^{2} + {(\frac{Y^{T} X_{12}}{| | X_{12} | |^{2}})}^{2} | | X_{12} | |^{2} - 2 (\frac{Y^{T} X_{12}}{| | X_{12} | |^{2}}) Y^{T} X_{12}}{p {\hat{σ}}^{2}} - 1 \\ = \frac{| | {\overset{⁁}{Y}}_{OLS} | |^{2} + \frac{{(Y^{T} X_{12})}^{2}}{| | X_{12} | |^{2}} - 2 \frac{{(Y^{T} X_{12})}^{2}}{| | X_{12} | |^{2}}}{p {\hat{σ}}^{2}} - 1 \\ = \frac{| | {\overset{⁁}{Y}}_{OLS} | |^{2} | | X_{12} | |^{2} - {(Y^{T} X_{12})}^{2}}{| | X_{12} | |^{2} p {\hat{σ}}^{2}} - 1. \end{matrix}$ (A9)

Finally, the determinant of the Hessian matrix of $δ_{0} (g, ξ),$ denoted $\nabla^{2} δ_{0},$ evaluated at $g = g_{*}$ in (A9) and $ξ = ξ_{*}$ in (A8) is (A10) $\begin{matrix} \det \nabla^{2} δ_{0} |_{(g_{*}, ξ_{*})} = \frac{\partial^{2} δ_{0}}{\partial g^{2}} |_{(g_{*}, ξ_{*})} \frac{\partial^{2} δ_{0}}{\partial ξ^{2}} |_{(g_{*}, ξ_{*})} - {(\frac{\partial^{2} δ_{0}}{\partial g \partial ξ} |_{(g_{*}, ξ_{*})})}^{2} \\ = \frac{- 4 | | X_{12} | |^{12} p^{6} {\hat{σ}}^{12}}{{({(Y^{T} X_{12})}^{2} - | | {\overset{⁁}{Y}}_{OLS} | |^{2} | | X_{12} | |^{2})}^{5}} . \end{matrix}$ (A10)

We must now verify that $\det \nabla^{2} δ_{0} |_{(g_{*}, ξ_{*})} > 0,$ i.e. that $(g_{*}, ξ_{*})$ is indeed an extremum of δ₀. First, we observe that (A11) $| | {\overset{⁁}{Y}}_{OLS} | |^{2} | | X_{12} | |^{2} > {(Y^{T} X_{12})}^{2},$ (A11) by direct application of the Cauchy-Schwarz inequality (recall that $Y^{T} X_{12} = {\overset{⁁}{Y}}_{OLS}^{T} X_{12}$ from (A7)). The inequality is strict because of the assumption that the entries of ${\hat{β}}_{OLS}$ are distinct, which prevents ${\overset{⁁}{Y}}_{OLS}$ and $X_{12}$ from being linearly dependent. Thus, (A11) implies that the denominator in (A10) is strictly negative. The numerator is also strictly negative, because the assumption that gene B is not zero everywhere ensures that $| | X_{12} | |^{2} \neq 0$ (recall that X is defined in (5), and its first two columns consist of gene B’s expression measurements over time and its time integrals). Therefore, (A10) is strictly positive. To verify that $(g_{*}, ξ_{*})$ is indeed a minimizer of δ₀, we now check that $\frac{\partial^{2} δ_{0}}{\partial ξ^{2}} |_{(g_{*}, ξ_{*})} > 0$ as well. We have: $\frac{\partial^{2} δ_{0}}{\partial ξ^{2}} |_{(g_{*}, ξ_{*})} = \frac{2 | | X_{12} | |^{6} p^{2} {\hat{σ}}^{4}}{{({(Y^{T} X_{12})}^{2} - | | {\overset{⁁}{Y}}_{OLS} | |^{2} | | X_{12} | |^{2})}^{2}}$

which is strictly positive. This second derivative calculation, as well as those in (A10), were verified in Mathematica.

Appendix B.

Bayesian lead-lag R² algorithm

Following is an algorithm for computing the Bayesian lead-lag R² for all gene pairs in a time-course gene expression dataset, consisting of N genes measured at n time points $t_{1}, \dots, t_{n} .$

Algorithm 1:

Bayesian lead-lag R² calculations for all gene pairs

Note that this algorithm produces an upper-triangular similarity matrix S resulting from regressing gene i on gene j, for all i < j, and storing the resulting Bayesian $LL R^{2} (i, j)$ value. In the empirical analysis presented in this paper, we set $S_{i, j} = \max {LL R^{2} (i, j), LL R^{2} (j, i)} .$

To instead compute the Bayesian $LL R_{other}^{2}$ from sub-model 1 in (18), it suffices to change the definition of X to $[x_{j} {\tilde{s}}_{j} 1_{n}],$ and to define $β_{0}$ as ${[1, 1, 0]}^{T}$ if $W_{i j} = 1$ and ${[0, 0, 0]}^{T}$ otherwise. To compute the Bayesian $LL R_{own}^{2}$ from sub-model 2 in (19), we change X to $[{\tilde{s}}_{i} t 1_{n}],$ and $β_{0} = [0, 0, 0]$ regardless of the value of $W_{i j} .$ An optimized version of this algorithm runs in 21.8 min for N = 1735 genes on a 2017 3.1 GHz Intel Core i5 MacBook Pro.

Appendix C.

Additional dataset details

In this section, we provide further details on how our primary time-course gene expression dataset was constructed.

We first reduce our set of 12,657 genes into a set of 951 “differentially-expressed” (DE) genes identified in Schlamp et al. (Citation2021), defined as genes satisfying either of the following criteria: 1) there is at least one time point at which the gene’s expression undergoes a $\log_{2}$ -fold change of at least two, or 2) a spline-based method for differential expression analysis returns a significant result. The latter method involves fitting a cubic spline to the temporal expression measurements of a gene under treatment and control settings, and testing whether the difference in the resulting two sets of coefficients is significant. We then add back a set of 784 genes that are not DE by these criteria, but that are “neighbors” of at least one DE gene. We define a neighbor of a DE gene as a non-DE gene that has a STRING score of at least 0.95 with the DE gene. The purpose of adding such neighbors back into the dataset, now consisting of $951 + 784 = 1735$ genes, is to enable more complete biological pathways to be reconstructed from our cluster analysis.

In Section 3.1, we describe several sources of biological information that can be encoded into a prior adjacency matrix W. We choose to use the STRING database, and we mark two genes as “associated” (i.e. $W_{i j} = 1)$ if their STRING score is greater than 0.5. We additionally use replicate information from the time-course dataset to fill in some of the unknown STRING scores. Specifically, if two genes have entries in the STRING database but have an unknown STRING score, we set $W_{i j} = 1$ if the correlation between their replicated temporal expressions is greater than 0.8. We keep $W_{i j} = NA$ for the gene pairs that do not have entries in STRING.

Appendix D.

Additional figures

Figure D1. Sets of genes in the upper-right region of the Bayesian LLR² scatterplot in , several of which have uncharacterized pairwise associations. Left: AttD is involved in immune response against Gram-negative bacteria; it exhibits similar patterns to CR42868 and CG9616, neither of which have known molecular functions according to FlyBase (Larkin et al. Citation2021). Middle: Spn28Dc is involved in response to stimuli and protein metabolism, and scb is involved in cell death and organ development. These two genes display similar expression patterns that are also nearly identical in shape to those of the less well-understood RNAs CR43364 and CR42715. Right: Genes ACC, Idh, and GstE9 are involved in a variety of metabolic processes, but not all of their pairwise interactions are known in STRING.

Figure D1. Sets of genes in the upper-right region of the Bayesian LLR2 scatterplot in Figure 12, several of which have uncharacterized pairwise associations. Left: AttD is involved in immune response against Gram-negative bacteria; it exhibits similar patterns to CR42868 and CG9616, neither of which have known molecular functions according to FlyBase (Larkin et al. Citation2021). Middle: Spn28Dc is involved in response to stimuli and protein metabolism, and scb is involved in cell death and organ development. These two genes display similar expression patterns that are also nearly identical in shape to those of the less well-understood RNAs CR43364 and CR42715. Right: Genes ACC, Idh, and GstE9 are involved in a variety of metabolic processes, but not all of their pairwise interactions are known in STRING.

Figure D2. Sets of genes appearing in the upper-right region of the Bayesian LLR² scatterplot in , all of which have known pairwise associations. Left: Genes known to be involved in proteolysis. Middle: Genes that encode ribosomal proteins. Right: Genes known to be involved in RNA binding.

Figure D2. Sets of genes appearing in the upper-right region of the Bayesian LLR2 scatterplot in Figure 12, all of which have known pairwise associations. Left: Genes known to be involved in proteolysis. Middle: Genes that encode ribosomal proteins. Right: Genes known to be involved in RNA binding.

Figure D3. Clusters of genes obtained using the non-Bayesian LLR² as the similarity metric between genes. Each sub-plot shows the temporal expressions of genes in the corresponding cluster. Clusters were obtained via hierarchical clustering with Ward’s method. The dominant pattern in many clusters is either flat or not immediately discernible, indicating that the non-Bayesian LLR² is alone insufficient for identifying groups of genes whose temporal patterns are of similar shapes over time.

Figure D3. Clusters of genes obtained using the non-Bayesian LLR2 as the similarity metric between genes. Each sub-plot shows the temporal expressions of genes in the corresponding cluster. Clusters were obtained via hierarchical clustering with Ward’s method. The dominant pattern in many clusters is either flat or not immediately discernible, indicating that the non-Bayesian LLR2 is alone insufficient for identifying groups of genes whose temporal patterns are of similar shapes over time.

Figure D4. Clusters of genes obtained using Pearson correlation as the similarity metric between genes. Each sub-plot shows the temporal expressions of genes in the corresponding cluster. In cluster and network analyses of time-course gene expression data, Pearson correlation is one of the more commonly used metrics of association. Clusters were obtained via hierarchical clustering with Ward’s method. The dominant pattern in many clusters is either flat or not immediately discernible, indicating that correlation is alone insufficient for identifying groups of genes whose temporal patterns are of similar shapes over time.

Appendix E.

Additional tables

Table E1. Values of the Bayesian $LL R^{2} - LL R_{own}^{2}$ corresponding to each gene pair in .

Display Table

Table E2. Values of the non-Bayesian $LL R^{2} - LL R_{own}^{2}$ corresponding to each gene pair in .

Display Table

Appendix F.

Additional results

We continue our analysis in Section 4.4 of the results of hierarchical clustering with the Bayesian lead-lag R² (LLR²).

F.1 Analysis of Cluster 2

Cluster 2 contains both up- and down-regulated genes with circadian rhythms, according to GO term enrichment. Several of these genes are displayed in . Among the up-regulated genes are three regulators of the circadian clock (per, vri, Pdp1). A fourth regulator of the circadian clock, Clk, is down-regulated. Pdp1 has been reported to reach its peak expression three to six hours after vri’s peak expression (Cyran et al. Citation2003), a pattern that is visible in this cluster. Cluster 2 further contains genes that are involved in visual perception: two genes encoding rhodopsins (Rh5, Rh6) (Gaudet et al. Citation2011) and Pdh, which encodes a retinal pigment dehydrogenase (Wang et al. Citation2010). Similar to Schlamp et al. (Citation2021), we also found that Smvt and salt, which encode sodium transporters (Gaudet et al. Citation2011), are under circadian control.

Figure F1. Temporal expression patterns of selected genes in cluster 2. Black, red, and orange lines correspond to regulators of the circadian clock (black: per, Clk; orange: vri; red: Pdp1). Pdp1 is known to reach its peak expression after vri does. Pink and green lines correspond to genes that are involved in visual perception (pink: rhodopsins Rh5, Rh6; green: retinal pigment dehydrogenase Pdh). Purple lines correspond to genes that encode sodium transporters (Smvt, salt).

F.2 Analysis of Cluster 4

Genes in cluster 4, some of which are displayed in , are characterized by a transient decrease in expression during the first 24 h after peptidoglycan injection. This cluster was significantly enriched for “carbohydrate metabolic process” (B-H corrected p-value of $2 \times 10^{- 23}$ ). A highly-connected gene involved in carbohydrate metabolism is fbp, which encodes the enzyme fructose-1,6-biphosphatase and has a degree of 102 in our reconstructed network shown in .

Figure F2. Temporal expression patterns of selected genes in cluster 4. Light gray lines in the background correspond to genes with which the gene fbp has a Bayesian $LL R^{2} > 0.9 .$ fbp, shown in red, is known to be involved in carbohydrate metabolism. The three green lines correspond to genes Gale, AGBE, and Gba1b, which also have known roles in carbohydrate metabolism but have an unknown relationship to fbp according to the STRING database. The dark blue line corresponds to the gene fit, whose expression pattern is similar to that of fbp but with much more pronounced down-regulation. fit is known to encode a protein that stimulates insulin signaling, a process that regulates the expression of genes involved in carbohydrate metabolism.

All of the genes to which fbp is connected had NA values in our prior adjacency matrix, meaning that their relationship to fbp is unknown according to our chosen sources of information. Some of these genes include Gale, AGBE, and Gba1b, which have known roles in carbohydrate metabolism. Twenty-two other genes connected to fbp in our network are not well-studied. These connections suggest roles for these uncharacterized genes in carbohydrate metabolism or energy homeostasis. Another gene connected to fbp is fit, which has a similar expression profile as fbp but experiences a much stronger and sharper down- and up-regulation. fit is not directly involved in carbohydrate processing but encodes a secreted protein that stimulates insulin signaling, which in turn regulates the expression of genes involved in carbohydrate metabolism, such as fbp (Sun et al. Citation2017). A previous study also showed that an immune response reduces insulin signaling in Drosphila (DiAngelo et al. Citation2009).

Figure F3. Network of genes formed from the gene fbp and its neighbors. Two genes are connected by an edge if their Bayesian $LL R^{2} > 0.9 .$ Red and blue edges connect genes with known and unknown relationships, respectively. Darkened edges connect genes within cluster 4. Selected genes with uncharacterized relationships to fbp are highlighted: Gale, AGBE, Gba1b.

F.3 Analysis of Cluster 6

Cluster 6 is significantly enriched for GO terms related to metabolic processes, particularly the terms “cellular lipid catabolic process” and “carbohydrate metabolic process” (B–H corrected p-values of $2^{- 10}$ for both). In addition to these metabolic GO terms, there is significant enrichment of genes involved in “phagocytosis” (B–H corrected p-value of 0.03). During an immune response, phagocytosis is the process by which an immune cell engulfs and digests bacteria and apoptotic cells as a way to fight the infection. In Drosophila, phagocytosis is carried out by specialized hemocytes (Drosophila blood cells).

Among the genes in cluster 6 involved in carbohydrate metabolism are five mannosidases (LManI, LManIII, LManIV, LManV, LManVI) and three maltases (Mal-A2, Mal-A3, Mal-A4). These genes encode enzymes that break down complex sugars into simple sugars like glucose. Six genes in cluster 6 are expressed in hemocytes and are involved in phagocytosis. These include four genes that belong to the Nimrod gene family (NimB4, NimC1, NimC2, eater); Hml, which is involved in the clotting reaction in larvae (Goto et al. Citation2003); and Gs1. Gs1 encodes a glutamine synthetase, an enzyme whose action is not unique to hemocytes, but that has been shown to support hemocyte function (Gonzalez et al. Citation2013).

shows that the genes involved in metabolic processes and in phagocytosis exhibit similar expression patterns, with coordinated up- and down-regulation. This coordinated expression is sensible in the context of known hemocyte biology. After an infection, hemocytes undergo a metabolic switch, whereby their energy production is sustained mostly by aerobic glycolysis rather than oxidative phosphorylation (Krejčová et al. Citation2019). Since aerobic glycolysis is dependent on glucose, the simultaneous up-regulation of glucose-producing enzymes and genes needed for phagocytosis is aligned with our expectations. It is also worth noting that the fold changes of genes involved in phagocytosis are generally small, e.g. less than two-fold up-regulation, which is often used as a minimal cutoff in RNA-seq analyses. However, the coordinated expression changes detected by the Bayesian LLR² suggest that these are biologically relevant patterns.

Figure F4. Temporal expression patterns of selected genes in cluster 6 during the first 16 h after peptidoglycan injection. The five red lines and three yellow lines correspond to genes involved in carbohydrate metabolism: respectively, mannosidases (LMan1, LManIII, LManIV, LManV, LManVI) and maltases (Mal-A2, Mal-A3, Mal-A4). The remaining lines correspond to genes that are expressed in hemocytes and are involved in phagocytosis: in blue are genes belonging to the Nimrod family (NimB4, NimC1, NimC3, eater), in purple is Hml, and in green is Gs1.

Appendix G.

Comparison to WGCNA

We now compare our methodology to the well-known weighted correlation network analysis (WGCNA) (Langfelder and Horvath Citation2008), which is often used to identify co-expression networks from gene expression data. To construct such a network, WGCNA uses pairwise correlations between genes to define a weighted adjacency matrix $A$ as $A_{i j} = | Corr (x_{i}, x_{j}) |^{β},$ where $x_{i} \in R^{n}$ is the expression profile of the $i^{th}$ gene and $β \geq 1$ is a soft thresholding power; this β can be chosen according to a scale-free topology criterion proposed by Zhang and Horvath (Citation2005). Modules (clusters of densely connected genes) are then identified by a number of possible methods, the default of which is hierarchical clustering applied to $A .$ The biological significance of each module can be assessed by functional enrichment analysis.

We apply WGCNA to our Drosophila melanogaster gene expression dataset, choosing a soft thresholding parameter β = 9. This results in 20 clusters ranging from 22 to 333 genes, shown in .

Figure G1. Modules (clusters) of genes identified with the WGCNA method. The dominant pattern in many clusters is either flat or not immediately discernible, as is the case when either the non-Bayesian LLR² or Pearson correlation is used as a similarity metric between genes. Note that WGCNA is not explicitly designed for temporal gene expression data.

shows that many of the WGCNA modules do not have an easily identifiable pattern of gene expression. Several modules consist predominantly of genes with mostly flat trajectories, while shows that the Bayesian LLR² largely grouped such genes into at most two clusters (clusters 1 and 5). The genes that the Bayesian LLR² originally placed into the visually distinctive clusters 2 (circadian rhythm genes), 10 (Imd- and Toll-regulated immune response genes) and 12 (ribosomal proteins) also appear to have been split across multiple WGCNA modules. Ultimately, while WGCNA is a helpful resource for co-expression analysis, it does not account for temporal dependence in gene expression measurements. The Bayesian LLR² is more suitable for such data because it is derived from a temporal model of gene expression and leverages prior information about pairwise relationships.

An Empirical Bayes Approach to Estimating Dynamic Models of Co-Regulated Gene Expression

Abstract

1. Introduction

2. Dynamic Models of Gene Expression

2.1. Model Derivation

2.2. Motivating the Bayesian Lead-Lag R2

3. Empirical Bayes Methodology

3.1. Part 1: Leveraging Biological Priors

3.2. Part 2: The Normal-Inverse Gamma Model and Zellner’s g-Prior

3.3. Part 3: Optimal Selection of g in Zellner’s g-Prior

3.4. Part 4: Computing the R2 for Bayesian Regression Models

3.5. Clustering and Empirical Analysis with the Bayesian Lead-Lag R2

3.6. Significance of the Bayesian Lead-Lag R2

3.7. Possible Modifications to Prior Hyperparameters

4. Results

4.1. Analysis of Simulated Data

4.2. Collecting a Time-Course Gene Expression Dataset for Drosophila melanogaster

4.3. Small-Scale Case Study: Immunity and Metabolism

Table 1. Portion of the prior adjacency matrix W corresponding to the six genes in Figure 6.

Table 2. Bayesian LLR2 values corresponding to each gene pair in Figure 6.

Table 3. Non-Bayesian LLR2 values corresponding to each gene pair in Figure 6.

4.4. Biologically-Meaningful Clustering with the Bayesian LLR2

4.4.1. Analysis of Cluster 10

4.4.2. Transcription Factor Targets Cluster Together

4.5. The Bayesian LLR2 Produces a Sparse Set of Associations

5. Discussion

Disclosure statement

Additional information

Funding

Notes

References

Appendix A.

Proofs

Appendix B.

Bayesian lead-lag R2 algorithm

Appendix C.

Additional dataset details

Appendix D.

Additional figures

Appendix E.

Additional tables

Table E1. Values of the Bayesian LLR2−LLRown2 corresponding to each gene pair in Figure 6.

Table E2. Values of the non-Bayesian LLR2−LLRown2 corresponding to each gene pair in Figure 6.

Appendix F.

Additional results

F.1 Analysis of Cluster 2

F.2 Analysis of Cluster 4

F.3 Analysis of Cluster 6

Appendix G.

Comparison to WGCNA

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature

2.2. Motivating the Bayesian Lead-Lag R²

3.4. Part 4: Computing the R² for Bayesian Regression Models

3.5. Clustering and Empirical Analysis with the Bayesian Lead-Lag R²

3.6. Significance of the Bayesian Lead-Lag R²

Table 1. Portion of the prior adjacency matrix W corresponding to the six genes in .

Table 2. Bayesian LLR² values corresponding to each gene pair in .

Table 3. Non-Bayesian LLR² values corresponding to each gene pair in .

4.4. Biologically-Meaningful Clustering with the Bayesian $LL R^{2}$

4.5. The Bayesian LLR² Produces a Sparse Set of Associations

Bayesian lead-lag R² algorithm

Table E1. Values of the Bayesian $LL R^{2} - LL R_{own}^{2}$ corresponding to each gene pair in .

Table E2. Values of the non-Bayesian $LL R^{2} - LL R_{own}^{2}$ corresponding to each gene pair in .