Abstract
Large-scale observational health databases are increasingly popular for conducting comparative effectiveness and safety studies of medical products. However, increasing number of patients poses computational challenges when fitting survival regression models in such studies. In this article, we use Graphics Processing Units (GPUs) to parallelize the computational bottlenecks of massive sample-size survival analyses. Specifically, we develop and apply time- and memory-efficient single-pass parallel scan algorithms for Cox proportional hazards models and forward-backward parallel scan algorithms for Fine-Gray models for analysis with and without a competing risk using a cyclic coordinate descent optimization approach. We demonstrate that GPUs accelerate the computation of fitting these complex models in large databases by orders of magnitude as compared to traditional multi-core CPU parallelism. Our implementation enables efficient large-scale observational studies involving millions of patients and thousands of patient characteristics. The above implementation is available in the open-source R package Cyclops.
1 Introduction
Increasing accessibility of large-scale observational health data provides rich opportunities to study comparative effectiveness and safety of medical products, but also poses unprecedented challenges. Typical administrative claims and Electronic Health Record (EHR) databases now follow tens to hundreds of millions of individuals (Hripcsak et al. Citation2016) with tens of thousands of possible health conditions, drugs and procedures occurring over decades of patient lives (Suchard et al. Citation2013). The massive scales of these databases offer more power for statistical analyses to learn about the effects of these products on health outcomes but also bring taxing computational burden.
The increasing complexity of common statistical models further exacerbated this big-data problem. For instance, the Cox proportional hazards model and the Fine-Gray model are widely applied in comparative effectiveness and safety studies. The computational complexity of likelihood evaluation for the Cox model and the Fine-Gray model naively grows quadratically with sample size. In addition, some form of regularization (Madigan et al. Citation2010) is often needed to achieve parsimonious model selection when entertaining hundreds to thousands of patient characteristics. This regularization typically requires computationally intensive cross-validation to select the optimal regularization parameter(s), further straining limited computational resources.
One can distribute computationally intensive work to the cloud or dedicated clusters that house multiple Central Processing Units (CPUs) across separate compute nodes that are linked together loosely through an Ethernet or InfiniBand network (Holbrook et al. Citation2021). However, for problems that require communication between nodes, communication latency may become a severe bottleneck. Since fitting survival models generally requires iterative algorithms, the communication latency costs often overpower the gains from the parallelized work within each iteration. To minimized communications between CPUs, we often follow coarse-scale parallelism, where programs are split into small number of large tasks (Barney Citation2010). However, this may result in load imbalance and only achieve “embarrassingly parallel” benefits (Suchard et al. Citation2010). Furthermore, CPU clusters and the cloud can be costly and arcane for many clinical researchers.
Multi-core CPU parallelization is another choice for distributing intensive computational works, as modern CPU chip usually consists of 2–18 or more cores that can run independently. One limitation of this architecture is that all cores share a single “memory bandwidth,” the amount of data that can be written to or read from memory in a given period of time (Holbrook et al. Citation2021). This approach also suffers from the limited number of cores. Thus, multi-core CPU parallelization is often only useful for modest-scale problems.
In contrast, graphics processing units (GPUs) offer a relatively inexpensive and efficient approach for speeding up fine-scale parallel computation. In fine-scale parallelism, we decompose programs into a large number of small tasks to facilitate load balancing and achieve much higher level of parallelism than coarse grain approaches (Barney Citation2010). The GPU is an ideal device for fine-scale parallelism because (a) it consists of hundreds to thousands of compute cores and (b) the shared memory architecture of a GPU’s coupled thread block allows for threads to communicate and share data among each other at a very high speed. Finally, GPUs are often conveniently available on standard laptops and desktop computers and can be externally connected to a personal computer.
Accelerating statistical computing via GPUs is an emerging discipline. As examples, Zhou, Lange, and Suchard (Citation2010) attain 100-fold speedups with GPUs in high-dimensional optimization. Suchard et al. (Citation2013) demonstrate that GPU parallelization achieves one to two orders of magnitude improvement over CPUs for a Bayesian self-controlled case series model. Beam, Ghosh, and Doyle (Citation2016) accelerate Hamiltonian Monte Carlo using GPUs by efficient evaluation of their probability kernel and its gradient. Terenin, Dong, and Draper (Citation2019) demonstrate that Gibbs sampling can run orders of magnitude faster than on a CPU. Holbrook et al. (Citation2021) apply GPU computing to a Bayesian multidimensional scaling model and deliver more than 100-fold speedups over serial calculations. Ko et al. (Citation2022) explore the GPU parallelization for proximal gradient descent on modest-sized regularized dense Cox regression using PyTorch. In this article, we leverage GPU parallelization to the Cox model and the Fine-Gray model through innovative algorithmic mapping that play to the GPU’s strengths for accelerating observational studies using massive healthcare data. Specifically, we identify the computational bottleneck of the Cox model and the Fine-Gray model and take advantage of the cutting-edge GPU-accelerated library CUB (Merrill and Garland Citation2016) that navigate this bottleneck. We further implement our GPU advances in the easy-to-use R package Cyclops (Suchard et al. Citation2013). Our implementation supports a sparse data format, considering that observational healthcare data are generally sparse; the vast majority of patient characteristics are encoded as the presence or absence of some clinical condition, drug exposure, medical procedure or laboratory measurement above or below a cutoff point within given time-frames. We finally demonstrate that our GPU implementation accelerates the computation of fitting these complex models by order-of-magnitude compared to a similar CPU implementation on multiple GPUs and CPUs with different technical specifications.
2. Methods
2.1. Cox Proportional Hazards Model
We first establish notation under a typical survival analysis setting. Suppose there are N observed individuals available in a study. For individual , let represent their survival time, where Ti and Ci are the time-to-event time and right-censoring time, respectively. Let δi be the indicator variable such that if we observe the event occurrence of individual i and if the individual i is censored. Let be a P-dimensional vector of time-independent covariates for individual i. The survival data then consist of triplets .
The cumulative distribution function of survival times gives the probability that the event of interest has occurred by time t, that is, . The survival function gives the probability that the event has not occurred by time t, that is, . Then we define the hazard function of time-to-event time as (1) (1) where is the density function of random variable T.
Let be a P-dimensional vector of unknown, underlying model parameters. Assuming survival times are independent and identically distributed from density and parameterized the survival function , Cox (Citation1972) proposes a semiparametric hazard function as the product of an unspecified baseline hazard function and an exponential link function of covariates: (2) (2)
Parameter estimation of the Cox proportional hazards model follows from the log-partial likelihood (3) (3) where denotes the set of subjects who are “at risk” for event at time Yi. Then one often estimates by its maximum log-partial likelihood estimator .
This log-partial likelihood has a complicated form due to the repeated calculation of the risk sets, and thus brings a high computation burden. In practice, we need to keep track of the sum of many terms for each subject that usually requires number of operations and will explode quickly as N increases (Kawaguchi et al. Citation2020). This computational burden prevents traditional model fitting as there can be millions of observations available in observational health databases.
2.2 Fine-Gray Model
The Fine-Gray model (Fine and Gray Citation1999) generalizes the Cox proportional hazards model for competing risks time-to-event data that consist of more than one type of event. Unlike the standard survival analysis setting such as under the Cox model where individuals are only susceptible to one type of event during follow-up, competing risks arise when individuals can experience more than one type of event and the occurrence of one type of event will either prevent the occurrence or change the underlying risk of the others (Noordzij et al. Citation2013). For individual i, competing risks data inherit the definition of event time Ti, possible right censoring time Ci, event indicator δi, and covariates from the standard survival data setting, and additionally include an event type variable ϵi. Without loss of generality, we assume there are two types of events, where indicates that Ti refers to the time of primary event and indicates the competing risk event.
The Cumulative Incidence Function (CIF) for competing risks data describes the probability of failing from the event of interest before the other possible (competing) event. Under the above setting when ϵ = 1 indicating the event of interest, the CIF and hazards function are defined as (4) (4) (5) (5)
To model the covariate effects on , Fine and Gray (Citation1999) propose the proportional subdistribution hazards function: (6) (6) where is an unspecified baseline subdistribution hazard, and is a P-dimensional vector of model parameters.
Parameter estimation of the Fine-Gray subdistribution proportional hazards model follows from the log-pseudo likelihood (7) (7) where or and denotes the risk set at time Yi, and is a time-dependent weight based on an Inverse Probability of Censoring Weighting (IPCW) technique (Robins and Rotnitzky Citation1992). Assuming two event types exist, the risk set contains two disjoint set: and , where is the regular risk set that includes the observations that have an observed time equaling or after Yi and includes the observations that have observed the competing event before time Yi. Here we follow the design of weighted score functions for incomplete data with right censoring in Fine and Gray (Citation1999) for unbiased estimation from the complete-data partial likelihood. The time-dependent weights are defined as , where is the Kaplan-Meier estimate of G(t) and is the survival function of censoring variable C. Combined, one can estimate by its maximum log-partial likelihood estimator .
2.3 Statistical Regularization
Observational healthcare datasets often include a large number of patient characteristics. For example, administrative claims usually contain information on all drug prescriptions, medical procedures and diagnosis codes for patients, and EHRs generally further contain demographics, medical history notes, laboratory results, and other health status indications (Madigan et al. Citation2014). A statistical regularization approach is typical in such high-dimensional data analysis to avoid overfitting. We can conveniently add a penalty for to the log-partial likelihood of Cox model or the log-pseudo likelihood of Fine-Gray model and estimate through these joint penalized likelihoods to achieve regularization.
For regularization that shrinks many components of to be zero, we define a separable penalty for each dimension βj in through (8) (8) where the tuning parameters γj control the degree of regularization for each dimension. Similarly, one may employ an penalty on the dimensions of , such that (9) (9)
Usually one assumes and , and we choose γ or τ through cross-validation.
Note that statistical inference in the context of regularization remains a challenge. Various standard errors estimators based on the nonparametric bootstrap have been proposed (Casella et al. Citation2010; Chatterjee and Lahiri Citation2011), but an approach that is both computationally efficient and statistically valid still remains out of reach. Since we are focusing on computational bottleneck in this article, we decide to follow the standard practice of regularization in large-scale and high-dimensional observational health studies (Mueller-Using et al. Citation2016; Shortreed and Ertefaie Citation2017; Bramante et al. Citation2021) despite the limitations of regularization.
2.4 Maximum Likelihood Estimation Using Cyclic Coordinate Descent
Following Genkin, Lewis, and Madigan (Citation2007) and Mittal et al. (Citation2014), we exploit a Cyclic Coordinate Descent (CCD) algorithm to reduce the high-dimensional penalized likelihood optimization down to a large set of simple one-dimensional optimizations (Wu and Lange Citation2008). This method cycles through each covariate and updates it using a Newton step while holding all other covariates as constants. The advantage of CCD is it only requires the calculation of scalar gradients and Hessians and avoids the inversion of large Hessian matrices in high-dimensional regression.
Specifically, for each one-dimensional optimization problem, we pick the by maximizing while holding all other βj’s unchanged. The second-order Taylor approximation of the penalized log-likelihood at current βj is (10) (10)
Then the new estimate falls out (11) (11)
We employ a trust region approach similar to Genkin, Lewis, and Madigan (Citation2007) to restrict the step size so that the quadratic remains a reasonable approximation to the objective and improve convergence. In particular, we update βj during iteration k by (12) (12) (13) (13) where we update the trust region half-width as , starting with .
Note that both the negated log-likelihood of the Cox model and the Fine-Gray model are convex in , as well as the and penalty terms. Although the -norm is nondifferentiable at origin, we can follow the approach of Wu and Lange (Citation2008) to compute the directional derivatives along each forward and backward coordinate direction for our objective function. In particular, we compute the directional derivatives in both directions by plugging in and when , and only update βj in a direction when the directional derivative is negative, otherwise we keep . Since the objective function is convex, it is impossible for both directional derivatives to be negative, but either direction with a negative directional derivative will result in a successful update. Although we lack of rigorous proof of convergence when employing a trust region, as the induced step sizes fail to meet the strict convergence conditions for this optimization problem (Xu and Yin Citation2017), we have not observed this issue in our work.
2.5 Massive Parallelization on GPUs
Parallelization through clusters and multi-core CPUs exhibits a number of drawbacks that makes these devices ill-suited for massive survival analysis, as discussed in the Introduction. As such, this article exploits massive parallelization on GPUs through new fine-scale algorithm decomposition for speeding up large-scale computations. Here we begin with an overview on GPU computing and summarize its main strengths and weaknesses.
The modern GPU contains an array of multithreaded Streaming Multiprocessors (SMs), where hundreds to thousands of work threads execute simultaneously (Nickolls et al. Citation2008). Many threads group together as a thread block, in which threads communicate through shared memory and cooperate through barrier synchronization. Thread blocks are further grouped into kernel grids. The programmer specifies the number of threads per block and number of blocks forming the grid. In our code, we program this ensemble via CUDA, a parallel computing platform that allows general-purpose computing on GPUs (GPGPU) using a familiar C-like syntax.
Understanding the memory hierarchy of GPUs is important for achieving optimal performance for parallel programs. Each thread has its own set of processor registers and local memory for thread-private variables, which provide the fastest memory access. Each thread block has a limited shared memory pool that is only visible to the threads within this block. All threads also have access to a large high-bandwidth, but off-chip (global) memory embedded on the GPU card. Shared memory provide high-speed access, while accessing global memory is hundreds of times slower (Micikevicius Citation2009).
In our implementation, GPUs handle only the most computationally intensive tasks. When such a task is scheduled, relevant data are first copied from host CPU memory to the global memory on the GPU device. Then the GPU kernel is launched, which loads data to on-chip memory for defined operations and writes results to global memory. Finally, results are copied from the device back to the host.
This parallel programming model has several limitations that we should keep in mind. First, we should minimize data transfer between the host and device because the transfer is extremely slow. Second, accessing global memory on the GPU is also relatively slow, so we want to minimize the reads from global memory and the writes to global memory. When we do read or write from global memory, we want sequential threads to access sequential addresses in memory. In this manner, the GPU “coalesces” multiple memory requests into a smaller number of 128-byte transactions, and we want to over-subscribe each GPU core with multiple threads, so that the cores remain active through thread-context-switching and the latency in memory access can be hidden. Third, launching a kernel also has overhead on the order of microseconds, so it is preferred to combine a series of kernels into a larger “fused” one. Finally, contemporary GPUs issue single instructions to a “warp” of 32 threads simultaneously, such that all threads within a warp must execute the same instruction each clock cycle. When threads within a warp follow different data-dependent conditional branches, their execution becomes temporally serialized; this can cause a performance penalty. To avoid this issue, one attempts to minimize the number of diverging branches within a warp.
2.6 Tree-based Parallel Algorithms: Reduction and Scan
Here we review two useful building blocks for massively parallel algorithms: reduction and scan. Reductions convert an array of elements into a single result. For example, if the reduction operator is addition, then the reduction takes an array and returns a single value . Reductions are useful for implementing log-likelihood calculations, since independent samples contribute additively to the model log-likelihood. Taking an array , the scan operation returns the array . If we start from the beginning of the input array as above, the resulting array is called a prefix sum. While the resulting array is called a suffix sum if we start from the end and proceed toward the beginning. Scans are useful in accumulating statistics about individuals in the risk set in survival analysis. Implementing a sequential version of a reduction or scan both require ˜n additions on an array of length n.
The parallel versions of reduction and scan use a tree-based approach shown in . Note that effective parallelization of these types of binary tree traversals requires low latency sharing of partial sums across threads with appropriate synchronization, both aspects in which the large number of threads concurrently executing on the GPU greatly outperform multi-core CPU threads.
To obtain a parallel, work-efficient, and communication-avoiding prefix scan, we invoke the CUB library (Merrill and Garland Citation2016). The efficiency of their prefix-scan approaches a simple copy operation, as their prefix-scan requires the optimal ˜2n data movements: n reads and n writes to the global memory. The scan is constructed on two levels of organization: (a) a global device-wide scan and (b) a set of local block-wide scans within each thread block. The local block-wide scan use a reduce-then-scan strategy that can be visually resembled as an “hourglass” shape comprising an up-sweep and a down-sweep as shown in (c). In the up-sweep phase, we traverse the tree from leaves to root for computing the partial sums. Then the running prefixes are aggregated in the block-wide down-sweep traversing back up the tree from root using the partial sums computed in the up-sweep phase. The global scan implementation within CUB applies a single-pass chained-scan approach to achieve just ˜2n global data movements. The global scan further propose a decoupled look-back strategy by assigning each thread block a status flag indicating one of the three status: (a) aggregate (partial sum) of the block is available; (b) prefix of the block is available; and (c) no information about the block is available for other processor. Then each block will perform the computation conditional on its predecessor’s status flag. We refer readers to Merrill and Garland (Citation2016) for more details on this algorithm. In this article, we extend these parallel algorithms for Cox models and Fine-Gray models based on the implementation of reduction and scan from the CUB library.
2.7 GPU Massive Parallelization for Parameter Estimation
CCD is an inherently serial algorithm in which each iteration is based on the result of the last iteration. As we mentioned earlier, even within an iteration t, CCD cycles through each covariate j for one by one and the computation work for the next covariate cannot begin until the current update finishes. This serial algorithm however can still benefit greatly from parallelization by exploiting fine-grain problem decomposition within each iteration. Within each iterate’s covariate update, careful benchmarks reveal that over 95% of the runtime lies in computing the log-likelihood gradient and Hessian .
To understand the computational work, let be an N-dimensional column vector and M be an N × N loading matrix with entries (14) (14)
Recall that the risk set contains all observations that have an observed event time equaling or after Yi. Thus, if we arrange the observations by their observed time Yi in decreasing order, matrix M is clearly a lower triangular, binary matrix, and matrix-vector multiplication becomes a prefix sum over the elements in , where we define exponentiation (exp) as element-wise operation. For example, given , the set consists of all the observations from and the set , then . Making these substitutions in (3), we arrive at (15) (15) where we define forming the logarithm (log) as element-wise operation, and as the prefix sum of arbitrary vector . Then the unidimensional gradient and Hessians under the Cox proportional hazards model falls out as (16) (16) (17) (17) where (18) (18) (19) (19) and vector is the jth column of X. Note that we define here multiplication (×) and division () as element-wise operations as well.
While matrix-vector multiplication involving M takes operations, identifying the cumulative structure reduces the time complexity to linear by calculating prefix sum in series, and parallelization further reduces the time complexity to due to parallel scan’s tree-based structure (Harris, Sengupta, and Owens Citation2007). Finally, the vector-vector multiplication involving can be calculated as an element-wise multiplication in parallel in constant time and a reduction through a binary-tree in .
Under the Fine-Gray model, let be an N-dimensional column vector of precomputed censoring weights described in previous section, and N as an N × N loading matrix with entries (20) (20)
Recall that and and are disjointed, so N is an upper triangular, binary matrix and is a suffix sum if we arrange the observations by their observed time Yi in decreasing order. Then making the following substitutions in (16) and (17) yields the unidimensional gradient and Hessians under the Fine-Gray model (21) (21) (22) (22) where we define as suffix sum of vector .
It is worth noting that the risk set under the competing risk setting consists of two disjoint sets due to multiple types of event, thus, a single pass scan furnishes neither the numerator nor denominator. Instead, the numerator and denominator of (21) and (22) can be computed through a forward (prefix) scan plus a backward (suffix) scan together (Kawaguchi et al. Citation2020).
In summary, we can decompose the calculation of the gradient and Hessian for βj in the following four sequential steps:
1. Read in nonzero xij for and update as
Then perform three element-wise embarrassingly parallel transformations that read from the new estimate of , and output , and . Here we should keep in mind that X is generally sparse such that many elements xij are zeros, and exponentiation is an expensive operation in floating-point.
Scans:
Under the Cox model, we perform three forward scans that take the three output vectors of (1) as input, and return , and .
Under the Fine-Gray model, we perform three forward scans and three backward scans that take the three output vectors of (1) as input, and return , , , , and .
An element-wise transformation that takes the output vectors of (2) as well as the indicator vector as input, and outputs two new vectors and .
Two reductions that take the output vectors of (3) as input, and output two double summations and . Note that the first term in gradient can be precomputed and the value does not change during CCD.
Although parallel computing of the above numerical operations is much faster than serial evaluation, one crucial limitation is that a memory transaction involving reading or writing from global memory may take up to two orders of magnitude more time than a regular numerical operation applied to the value sitting in the limited number of on-chip registers within a GPU (or CPU for that matter) (Holbrook et al. Citation2021). In order to minimize memory transactions, we fuse several of our operations together for both our serial and parallel implementations. First, we fuse the multiple scans in Step 2 together as a tuple-scan, which takes a tuple of three vectors as input, and outputs a tuple of three (or six) vectors. Note that in Fine-Gray model, we perform the forward tuple-scan and the backward tuple-scan on the same tuple of three vectors, but read the input in two opposite directions simultaneously. Similarly, we can fuse the multiple reductions in Step 4 as a tuple-reduction. Since the element-wise transformations in Step 3 take time with GPU parallelization, Step 3 and Step 4 can be regarded as a transformation-reduction. Finally, since both scan and reduction parallelization share the same binary-tree structure, we further fuse Steps 2–4 together into a single kernel. Through fusion, for example, the output tuple from the scans never need to be written to global memory, nor read back for the later transformation-reductions. The fused kernel saves 2/3 of the reads and writes than executing three separated kernels. It is also worth noting that the computational work of the gradient and Hessian evaluation share a similar structure and even some component such as G, such that we have fused the computation of gradient and Hessian together to circumvent unnecessary kernel overhead and facilitate data reuse of these intermediate terms.
To exploit the sparsity of the design matrix X , we parallelize the transformation in Step 1 using a sparse CUDA kernel, which only reads in and processes the nonzero entries while keeping other entries as zeros all the time during CCD updates. This sparse kernel saves data movement as well as reduces memory bandwidth requirements significantly when X is sparse, which is common in real-world scenarios.
illustrates our fused kernel for evaluating the log-likelihood gradient and Hessian on the GPU. In this kernel, we spawn threads, where B is the number of concurrent threads forming a thread block, IPT is the number of work-items (elements of input) evaluated per thread, and denotes the number of thread blocks. Both of the block size B and the thread grain size IPT are constrained by the hardware and are tunable constants. In practice, we follow the parameter settings in CUB with B = 128 and IPT = 15 for the fused kernel. For the sparse kernel we wrote, we choose B = 256 and IPT = 1. In the figure, each box in dashed line represent a thread block and recall that all threads within a block can access a shared memory that is a low-latency and on-chip (Nickolls et al. Citation2008) and is useful for performing the efficient scan in step (2) and reduction in step (4). The threads in parallel first read the values of tuple for the current covariate column j from global memory and then conduct the single-pass adaptive look-back tuple-scan (Merrill and Garland Citation2016) using a reduce-then-scan strategy. Next, the threads read in the values of δ and perform the transformation with the on-chip cumulative sums and δ. The threads then perform a binary-tree tuple-reduction using shared memory, and one thread from each block writes its partial aggregates to global memory. Finally, a single-block reduction kernel sums over the G partial aggregates and writes the gradient and Hessian back to global memory.
When CCD processes the jth covariate, we only need to update βj and corresponding vector if . Recall that the computation of requires and when regularization applies. Since both gradient and Hessian are computed on the GPU, we avoid P data transfers between GPU and CPU in one CCD cycle by using a GPU kernel to check if and then update βj and if needed. details the workflow to implement CCD using GPU parallelization.
It is worth noting that thread-divergent branches in a CUDA kernel can substantially impact performance as execution gets temporally serialized. However, this is not an issue in our implementation because such branch divergence penalties only occur when threads within the same warp, but not across warps, need to execute alternative instructions. The branches in scan and reduction kernel in CUB are mainly due to slightly different instructions for the first processing tile and the last processing tile. Thus, the divergence occurs across warps and does not have an impact on performance penalty.
2.8 Multi-Stream Cross-Validation
We use k-fold cross-validation to search for the optimum tuning parameters γ or τ that controls the strength of regularization. We search for different values for the tuning parameter. For each value, the procedures are as follows:
Randomly split data into k partitions.
For each of the k partitions, we fit survival models via CCD on the remaining k – 1 partitions and compute the predictive log-likelihood of this partition using the estimated .
Average out-of-sample likelihood across all k folds.
Repeat Step 1–Step 3 r times to reduce spurious effects from random data partitioning.
Finally, we select the tuning parameter with the smallest average out-of-sample likelihood as the desired optimal value.
We further improve the efficiency of cross-validation using multi-stream GPU and multi-threaded CPU approaches. Here, a stream denotes a sequence of operations (kernels) that execute in issue-order on a GPU (Rennich Citation2011). Instead of the fitting k partitioned models serially in a single (default) stream, we fit the k models across s streams in parallel, where and each GPU stream is scheduled by an independent CPU thread. Likewise, for the multi-threaded CPU approach, s CPU threads evaluate the k models in parallel.
2.9 Comparison with an Alternative Massive Parallelization of Cox Models
In Ko et al. (Citation2022), the authors presented sample PyTorch code for parallelizing proximal gradient descent on modest-sized regularized dense Cox regression. Here we provide a qualitative comparison of our method with this alternative approach, focusing on the per-cycle cost of cyclic coordinate descent and proximal gradient descent.
A single iteration of proximal gradient descent on -regularized Cox regression requires the following steps:
A matrix-vector multiplication .
An element-wise transformation on the output vector of (1).
A scan on the output vector of (2).
Two matrix-vector multiplications of and , where the matrix is defined as .
Without considering the specific parallel library, the above steps contain operations, respectively. Meanwhile, one sweep of coordinate descent implemented in Cyclops requires no more than operations (the worst case is that the data is dense), as each of the four steps outlined in Section 2.7 only at most requires operations. Additionally, step (4) in the proximal gradient descent approach described earlier can also be reduced to by using the same trick discussed in Section 2.7.
It is important to note that the number of iterations required to achieve an equivalent termination criterion is highly dependent on the data and is beyond the scope of our article.
We would also like to emphasize that one of the main feature of our implementation is that it supports and benefits from a sparse data format when available, as discussed in Section 2.7.
3 Results
We examine the performance of GPU versus CPU computing for fitting our massive sample-size survival models. To accomplish this, we conduct a series of synthetic experiments to investigate the relative compute-time of our parallelization across different sample sizes. We then reproduce a real-world study using our GPU implementation under a Cox model and extend the study to the competing risks setting. If not specified, we use a system equipped with a 10 core 3.3 GHz Intel(R) Xeon(R) W-2155 CPU and an NVIDIA Quadro GV100 with 5120 CUDA cores and 32GB RAM that can achieve up to 7.4 Tflops double-precision point performance.
3.1 Kernel Fusion
Kernel fusion is one of the main strategy we apply for achieving optimal performance. Taking the Cox model as an example, we compare three different implementations for Steps 2–4 in Section 2.7:
Separated kernels
Three separated scans using cub::DeviceScan:: InclusiveSum,
One transformation operation using thrust:: transform, and
Two separated reductions using cub::DeviceReduce:: Sum.
Partially-fused kernels
A tuple-scan using cub::DeviceScan::InclusiveScan, and
A transformed tuple-reduction using cub::TransformInputIterator and cub::DeviceReduce::Reduce.
3. Single fused kernel
We simulate the input vectors with to from . shows the speedup of the fused kernel and the partially fused kernels over the separated kernels. The partially fused kernels are 8–31 times faster than the separated kernels. The single fused kernel further generates up to a 42-fold speedup compared with the separated kernels. shows the runtimes (in milliseconds) in details when . Interestingly, we find performing a tuple-scan on three vectors is almost three times faster than performing three separated scans (0.0036 ms vs. 0.0098 ms), demonstrating the values of fusion.
3.2 Synthetic Experiments
We simulate indicator data X with to samples and P = 1000 covariates with sparsity of 95%, where we randomly choose 5% of the entries uniformly to be 1s. We then draw where the specifies that, on average, 80% of the βj are set to 0 to induce sparsity, and Ti represent the time-to-event time for individual i. For generating competing risk times, we first draw: where is the regression parameter of primary event and is the regression parameter of competing event (Kawaguchi et al. Citation2020). Then we follow the design of Fine and Gray (Citation1999), where the cumulative incidence function (CIF) of primary event is a unit exponential mixture with mass at when : and draw survival time of competing event using an exponential distribution with rate . We set p = 0.5 in practice.
We fit these simulants under a fixed penalty with on a single CPU core and the default CUDA stream. To provide a comprehensive comparison for the performance gains, we conduct the experiments on two systems with different technical specifications. System 1 is equipped with a 3.3 GHz Intel(R) Xeon(R) W-2155 CPU (launch date 2017) and an NVIDIA Quadro GV100 (2018) with 5120 CUDA cores and 32GB RAM that can achieve up to 7.4 Tflops double-precision point performance. System 2 is equipped with a 2.2 GHz Intel(R) Xeon(R) Silver 4214 CPU (2019) and an NVIDIA A100 (2020) with 6912 CUDA cores and 80GB RAM that can achieve up to 9.7 Tflops double-precision point performance. presents runtimes comparisons across computing devices with a fixed number of covariates (P = 1000) with 95% sparsity and varying sample size N. We report both the total model fitting time and the time for computing gradients and Hessians, where the latter is our target of parallelization. On system 1 which is equipped with a powerful CPU, we see that the GPU parallelization delivers up to a 42-fold speedup for both Cox and Fine-Gray models in terms of computing gradient and Hessians. Despite additional data transfer and device initialization, GPU parallelization is still 35 × faster for this Cox model and 39 × faster for this Fine-Gray model relatively to our CPU implementation with one million samples. On system 2 which is equipped with a more powerful GPU, we see that the GPU parallelization achieves up to a 52-fold speedup for this Cox model and a 70-fold speedup for this Fine-Gray model. The data transfer between host and device during CCD updates only accounts for of the total model fitting time. We can see a rapid increase of runtimes on the CPU with increasing sample size, while the GPU approach continues to yield relatively shorter runtimes across varying sample sizes, as the devices is still not fully occupied.
3.3 Multi-Stream Cross-Validated Experiments
We further use these synthetic experiments to explore the performance of our approach using multi-threaded CPU and multi-stream GPU computing by simultaneously searching for an optimal strength of regularization. Here, we use 10-fold cross-validation with 10 repetitions per fold, resulting in 100 cross-validation replicates to estimate an optimal γ under regularization. On the GPU, we distribute the 100 replicates to s CUDA streams driven by s CPU threads. We also allow each of the CPU threads to process the 100 replicates directly on the CPU to demonstrate the performance of corresponding multi-threaded CPU parallelization. shows the runtimes of regularized Cox regression on varying number s of threads or streams. Generally, our parallelization achieves more speedup through multi-stream GPU computing than through multi-core CPU threads alone. For example, the runtime of fitting a Cox model on samples reduces from nearly 9 hr across 8 CPU threads to 14 min across 8 CUDA streams. We also observe that the curves flatten out as number of CUDA streams and CPU threads increases, and this pattern is particularly obvious in the experiments on smaller sample sizes and on the CPU. This pattern suggests that there is both less computation to parallelize with relatively small sample sizes and that multi-core CPU parallelization remains limited by the smaller memory bandwidth available to the CPU.
3.4 Cardiovascular Effectiveness of Antihypertensive Drug Classes
The large-scale evidence generation and evaluation across a network of databases for hypertension (LEGEND-HTN) study (Suchard et al. Citation2019) provided real-world evidence on the comparative effectiveness and safety of five first-line antihypertensive drug classes using a retrospective, comparative new-user cohort design. Specifically, LEGEND-HTN studied the relative risk of 55 health outcomes of interest , including three major cardiovascular events (acute myocardial infarction, hospitalization for heart failure, and stroke), six secondary effectiveness outcomes, and 46 safety outcomes. Within each of nine observational health data sources, LEGEND-HTN employed propensity score matching or stratification for confounding adjustment and Cox proportional hazards modeling for hazard ratio (HR) estimation between new-users of each of the different drug classes. Interestingly, LEGEND-HTN identified that new-users of thiazide or thiazide-like diuretics (THZs) have a lower risk as compared to new-users of angiotensin-converting enzyme inhibitors (ACEIs) in terms of several effectiveness and safety outcomes, even though ACEIs are the most commonly initiated monotherapy across databases.
Here we examine patients initiating ACEIs and THZs, where the outcome is hospitalization for heart failure from the Optum® de-identified Electronic Health Record dataset (Optum EHR). Optum EHR represents data from 85 million individuals that are commercially or Medicare insured in the United States. The data contain medical claims, pharmacy claims, laboratory tests, vital signs, body measurements, and information derived from clinical notes. A total of patients are included in our study, 75.8% of whom initiated an ACEI and 24.2% of whom initiated a THZ. We consider the main treatment covariate (ACEI or THZ exposure), in addition to 9811 baseline patient characteristic covariates. presents a small selection of major patient baseline characteristics. From all baseline characteristics, we build a propensity score model and match ACEI and THZ new-users in a 1:1 ratio. A total of patients were kept for further analysis after propensity score matching. We find ACEI new-users are more likely to be male, have diabetes, hyperlipidemia or heart disease comparing with patients initiating a THZ before propensity score matching. The THZ and ACEi cohorts are well-balanced on all 9811 baseline patient characteristics after matching, identified through low standardized difference of population proportions. The average sparsity of patient characteristic covariates is 97.11%, which means 2.89% of the entries are nonzero, occupying about 2.96GB RAM in sparse matrix format. We first fit a Cox proportional hazards model to estimate the HR between THZ and ACEi initiation for the risk of hospitalization for heart failure. We also fit a Fine-Gray model in which we consider acute myocardial infarction as a competing risk of hospitalization for heart failure, given myocardial infarction substantially elevates the future risk of heart failure. We include all patient characteristics as additional covariates in the Cox and Fine-Gray models with regularization on all variables except the treatment variable to achieve a limited form of adjustment for possible residual confounding, and again employ a 10-fold 10-replicate cross-validation to search for optimal tuning parameters. The analyses require almost two days to fit the regularized Cox model and more than three days to fit the regularized Fine-Gray model using eight CPU cores, while only taking 3.87 hr and 8.57 hr for the regularized Cox model and the regularized Fine-Gray model with our GPU implementation. reports the runtimes for regularized Cox and Fine-Gray models on varying numbers of CUDA streams versus the corresponding multi-core CPU parallelization.
Through massive parallelization, we find that initializing with a THZ shows better effectiveness than initializing with an ACEI in terms of hospitalization for heart failure risk under both the Cox model (HR 0.83, 95% bootstrapped percentile interval [BPI] 0.72–0.94) and the Fine-Gray model (HR 0.83, 95% BPI 0.71–0.95), where we provide the BPIs as crude measures of sampling variability given the challenges in constructing nominal confidence intervals for regularized models (Casella et al. Citation2010; Chatterjee and Lahiri Citation2011). There are 215 and 134 out of 9811 baseline covariates with nonzero effect sizes in the Cox and Fine-Gray models, respectively; these covariates include age, gender, hyperlipidemia, diabetes, osteoarthritis, and heart disease. While our estimates remain in line with the LEGEND-HTN study, they are reassuring in two ways. First, massive parallelization has enabled us to provide additional adjustment for possible residual confounding due to unobserved imbalance after propensity score matching without overfitting through cross-validation. Second, a consistent estimate after controlling for an obvious competing risk reduces concern over bias from informative censoring. Neither including thousands of baseline covariates nor handling a competing risk at this scale were possible in the original LEGEND-HTN study due to their erroneous time requirements; both are now feasible.
4 Discussion
This article presents a time- and memory-efficient GPU implementation of regularized Cox and Fine-Gray regression models for analyzing large-scale time-to-event data with competing risks. We efficiently implement it in the open-source R package Cyclops (Suchard et al. Citation2013). In simulation studies, our GPU implementation is 35–70 times faster than the equivalent CPU implementation with up to 1 million samples. In our real-world example with ∼ 400,000 hypertension patients and ∼ 9000 covariates, massive parallelization reduces the total runtimes of both regularized Cox regression and regularized Fine-Gray regression with cross-validation from a few days on multi-core CPUs to few hours on a GPU.
The observed speed-up is a combination of algorithmic advances as well as hardware optimization. First, we observe that the cumulative structure of the risk set can be computed by a single-pass parallel scan algorithm, which can be very efficiently computed in parallel using a tree-based approach. This enables us to develop a work efficient and communication-avoiding tuple-scan algorithm in accumulating statistics about individuals in the risk set, dramatically speeding up likelihood, gradient, and Hessian evaluations. In particular, this tree-based scan algorithm takes great advantage of the high-speed shared memory of thread block on GPUs. Furthermore, as CCD is an inherently serial algorithm, we fuse multiple serial steps (a) transformations, (b) scans and (c) reductions together into a single vectorizable kernel. This fusion is possible because transformation is relatively lightweight operation, and scan and reduction algorithm share the same binary-tree structure. This kernel fusion reduces expensive memory transactions and kernel overheads which usually prevent the serial algorithm from benefiting by parallel computing. We also exploit the sparsity of the design matrix X through a sparse CUDA kernel to further save data movements. Finally, we only off-load the most computationally intensive routines to the GPU without rewriting the whole codebase.
There are numerous opportunities for improvement. For instance, our multi-stream implementation for k-fold cross-validation has not achieved full concurrency for > 1 replicates, especially for relatively small sample size (∼ 100,000) due to the low GPU utilization. The other potential problem is that we have to replicate the design matrix on each CUDA stream, which increases the memory bandwidth requirement. To overcome these limitations, one may use a single larger kernel to perform many repetitions of k-fold cross-validation on a single stream. By increasing the computational work in a single kernel, we can over-subscribe GPU threads to achieve a higher utilization and improve data reuse, but of course this creates the danger of exhausting register memory. In addition, we plan to add inference on parameter estimates via bootstrapping by synchronizing the computation of sub-samples. Both bootstrapping and cross-validation require independent computations on different sub-samples, which may benefit from parallel computing.
Finally, GPU hardware and libraries are rapidly maturing, and statisticians stand to benefit from GPU parallelization. We conclude this work with a few tips for educational purposes. First, simple building blocks such as scans and reductions are ubiquitous in statistical inference. One can easily apply cutting-edge parallel implementations of these building blocks to gain instant speed-up. Second, kernel fusions should be employed whenever possible for serial algorithms to reduce expensive memory transactions and kernel overheads. Finally, when optimize an existing program, one should consider off-loading the most computationally intensive routines to the GPU in stages, before rewriting the whole codebase.
Supplementary Materials
Cyclops:R-package Cyclops containing code to perform the analysis described in the article is available at https://github.com/OHDSI/Cyclops/tree/gpu_cox.
R scripts for executing the experiments:R scripts for executing the experiments in Section 3. are available at https://github.com/suchard-group/gpu_survival_analysis_supplement. Unfortunately, we are unable to provide access to the Optum de-identified Electronic Health Records due to data licensing constraints. However, we provide the cohorts definition that can reproduce our real-world analysis for reader who do license the Optum de-identified Electronic Health Records data source.
Disclosure Statement
MJS is an employee and share-holder of Johnson & Johnson. The remaining authors report there are no competing interests to declare.
Additional information
Funding
References
- Barney, B. (2010), “Introduction to Parallel Computing,” Lawrence Livermore National Laboratory, 6, 10.
- Beam, A.L., Ghosh, S. K., and Doyle, J. (2016), “Fast Hamiltonian Monte Carlo using GPU Computing,” Journal of Computational and Graphical Statistics, 25, 536–548. DOI: 10.1080/10618600.2015.1035724.
- Bramante, C.T., Ingraham, N.E., MurrayT.A., Marmor, S., Hovertsen, S., Gronski, J., McNeil C., Feng R., Guzman G., Abdelwahab, N., et al. (2021), “Metformin and Risk of Mortality in Patients Hospitalised with Covid-19: A Retrospective Cohort Analysis,” The Lancet Healthy Longevity, 2, e34–e41. DOI: 10.1016/S2666-7568(20)30033-7.
- Casella, G., Ghosh, M., Gill, J., and Kyung, M. (2010), “Penalized Regression, Standard Errors, and Bayesian Lassos,” Bayesian Analysis, 5, 369–411. DOI: 10.1214/10-BA607.
- Chatterjee, A., and Lahiri, S.N. (2011), “Bootstrapping Lasso Estimators,” Journal of the American Statistical Association, 106, 608–625. DOI: 10.1198/jasa.2011.tm10159.
- Cox, D.R. (1972), “Regression Models and Life-Tables,” Journal of the Royal Statistical Society, Series B, 34, 187–202. DOI: 10.1111/j.2517-6161.1972.tb00899.x.
- Fine, J.P., and Gray, R.J. (1999), “A Proportional Hazards Model for the Subdistribution of a Competing Risk,” Journal of the American Statistical Association, 94, 496–509. DOI: 10.1080/01621459.1999.10474144.
- Genkin, A., Lewis, D.D., and Madigan, D. (2007), “Large-Scale Bayesian Logistic Regression for Text Categorization,” Technometrics, 49, 291–304.
- Harris, M., Sengupta, S., and Owens, J.D. (2007), “Parallel Prefix Sum (Scan) with CUDA,” GPU Gems, 3, 851–876.
- Holbrook, A.J., Lemey, P., Baele, G., Dellicour, S., Brockmann, D., Rambaut, A., and Suchard, M.A. (2021), “Massive Parallelization Boosts Big Bayesian Multidimensional Scaling,” Journal of Computational and Graphical Statistics, 30, 11–24. DOI: 10.1080/10618600.2020.1754226.
- Hripcsak, G., Ryan, P.B., Duke, J.D., Shah, N.H., Park, R.W., Huser, V., Suchard, M.A., Schuemie, M.J., DeFalco, F.J., Perotte, A., et al. (2016), “Characterizing Treatment Pathways at Scale Using the OHDSI Network,” Proceedings of the National Academy of Sciences, 113, 7329–7336. DOI: 10.1073/pnas.1510502113.
- Kawaguchi, E.S., Shen, J.I., Suchard, M.A., and Li, G. (2020), “Scalable Algorithms for Large Competing Risks Data,” Journal of Computational and Graphical Statistics, 30, 685–693. DOI: 10.1080/10618600.2020.1841650.
- Ko, S., Zhou, H., Zhou, J.J., and Won, J.-H. (2022), “High-Performance Statistical Computing in the Computing Environments of the 2020s,” Statistical Science, 37, 494–518.
- Madigan, D., Ryan, P., Simpson, S., and Zorych, I. (2010), “Bayesian Methods in Pharmacovigilance,” Bayesian Statistics, 9, 421–438.
- Madigan, D., Stang, P.E., Berlin, J.A., Schuemie, M., Overhage, J.M., Suchard M.A., Dumouchel, B., Hartzema, A.G., and Ryan, P.B. (2014), “A Systematic Statistical Approach to Evaluating Evidence from Observational Studies,” Annual Review of Statistics and Its Application, 1, 11–39. DOI: 10.1146/annurev-statistics-022513-115645.
- Merrill, D., and Garland, M. (2016), “Single-Pass Parallel Prefix Scan with Decoupled Look-Back,” NVIDIA, Tech. Rep. NVR-2016-002.
- Micikevicius, P. (2009), “3D Finite Difference Computation on GPUs Using CUDA,” in Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp. 79–84. DOI: 10.1145/1513895.1513905.
- Mittal, S., Madigan, D., Burd, R.S., and Suchard, M.A. (2014), “High-Dimensional, Massive Sample-Size Cox Proportional Hazards Regression for Survival Analysis,” Biostatistics, 15, 207–221.
- Mueller-Using, S., Feldt, T., Sarfo, F.S., and Eberhardt, K.A. (2016), “Factors Associated with Performing Tuberculosis Screening of HIV-Positive Patients in Ghana: Lasso-based Predictor Selection in a Large Public Health Data Set,” BMC Public Health, 16, 1–8.
- Nickolls, J., Buck, I., Garland, M., and Skadron, K. (2008), “Scalable Parallel Programming with CUDA,” Queue, 6, 40–53. DOI: 10.1145/1365490.1365500.
- Noordzij, M., Leffondré, K., van Stralen, K.J., Zoccali, C., Dekker, F.W., and Jager, K.J. (2013), “When Do We Need Competing Risks Methods for Survival Analysis in Nephrology?” Nephrology Dialysis Transplantation, 28, 2670–2677. DOI: 10.1093/ndt/gft355.
- Rennich, S. (2011), “CUDA C/C++ Streams and Concurrency,” in GPU Technology Conference.
- Robins, J.M., and Rotnitzky, A. (1992), “Recovery of Information and Adjustment for Dependent Censoring Using Surrogate Markers,” in AIDS Epidemiology, eds. N. P. Jewell, K. Dietz, and V. T. Farewell, pp. 297–331, Boston, MA: Springer.
- Shortreed, S.M., and Ertefaie, A. (2017), “Outcome-Adaptive Lasso: Variable Selection for Causal Inference,” Biometrics, 73, 1111–1122.
- Suchard, M.A., Wang, Q., Chan, C., Frelinger, J., Cron, A., and West, M. (2010), “Understanding GPU Programming for Statistical Computation: Studies in Massively Parallel Massive Mixtures,” Journal of Computational and Graphical Statistics, 19, 419–438. DOI: 10.1198/jcgs.2010.10016.
- Suchard, M.A., Simpson, S.E., Zorych, I., Ryan, P., and Madigan, D. (2013), “Massive Parallelization of Serial Inference Algorithms for a Complex Generalized Linear Model,” ACM Transactions on Modeling and Computer Simulation (TOMACS), 23, 1–17. DOI: 10.1145/2414416.2414791.
- Suchard, M.A., Schuemie, M.J., Krumholz, H.M., You, S.C., Chen, R., Pratt, N., Reich, C.G., Duke, J., Madigan, D., Hripcsak, G., et al. (2019), “Comprehensive Comparative Effectiveness and Safety of First-Line Antihypertensive Drug Classes: A Systematic, Multinational, Large-Scale Analysis,” The Lancet, 394, 1816–1826.
- Terenin, A., Dong, S., and Draper, D. (2019), “GPU-Accelerated Gibbs Sampling: A Case Study of the Horseshoe Probit Model,” Statistics and Computing, 29, 301–310.
- Wu, T.T., and Lange, K. (2008), “Coordinate Descent Algorithms for Lasso Penalized Regression,” The Annals of Applied Statistics, 2, 224–244. DOI: 10.1214/07-AOAS147.
- Xu, Y., and Yin, W. (2017), “A Globally Convergent Algorithm for Nonconvex Optimization based on Block Coordinate Update,” Journal of Scientific Computing, 72, 700–734.
- Zhou, H., Lange, K., and Suchard, M.A. (2010), “Graphics Processing Units and High-Dimensional Optimization,” Statistical Science: A Review Journal of the Institute of Mathematical Statistics, 25, 311–324. DOI: 10.1214/10-STS336.