707
Views
0
CrossRef citations to date
0
Altmetric
Survival Analysis

Massive Parallelization of Massive Sample-Size Survival Analysis

, , &
Pages 289-302 | Received 12 Apr 2022, Accepted 05 May 2023, Published online: 26 Jun 2023

Abstract

Large-scale observational health databases are increasingly popular for conducting comparative effectiveness and safety studies of medical products. However, increasing number of patients poses computational challenges when fitting survival regression models in such studies. In this article, we use Graphics Processing Units (GPUs) to parallelize the computational bottlenecks of massive sample-size survival analyses. Specifically, we develop and apply time- and memory-efficient single-pass parallel scan algorithms for Cox proportional hazards models and forward-backward parallel scan algorithms for Fine-Gray models for analysis with and without a competing risk using a cyclic coordinate descent optimization approach. We demonstrate that GPUs accelerate the computation of fitting these complex models in large databases by orders of magnitude as compared to traditional multi-core CPU parallelism. Our implementation enables efficient large-scale observational studies involving millions of patients and thousands of patient characteristics. The above implementation is available in the open-source R package Cyclops.

1 Introduction

Increasing accessibility of large-scale observational health data provides rich opportunities to study comparative effectiveness and safety of medical products, but also poses unprecedented challenges. Typical administrative claims and Electronic Health Record (EHR) databases now follow tens to hundreds of millions of individuals (Hripcsak et al. Citation2016) with tens of thousands of possible health conditions, drugs and procedures occurring over decades of patient lives (Suchard et al. Citation2013). The massive scales of these databases offer more power for statistical analyses to learn about the effects of these products on health outcomes but also bring taxing computational burden.

The increasing complexity of common statistical models further exacerbated this big-data problem. For instance, the Cox proportional hazards model and the Fine-Gray model are widely applied in comparative effectiveness and safety studies. The computational complexity of likelihood evaluation for the Cox model and the Fine-Gray model naively grows quadratically with sample size. In addition, some form of regularization (Madigan et al. Citation2010) is often needed to achieve parsimonious model selection when entertaining hundreds to thousands of patient characteristics. This regularization typically requires computationally intensive cross-validation to select the optimal regularization parameter(s), further straining limited computational resources.

One can distribute computationally intensive work to the cloud or dedicated clusters that house multiple Central Processing Units (CPUs) across separate compute nodes that are linked together loosely through an Ethernet or InfiniBand network (Holbrook et al. Citation2021). However, for problems that require communication between nodes, communication latency may become a severe bottleneck. Since fitting survival models generally requires iterative algorithms, the communication latency costs often overpower the gains from the parallelized work within each iteration. To minimized communications between CPUs, we often follow coarse-scale parallelism, where programs are split into small number of large tasks (Barney Citation2010). However, this may result in load imbalance and only achieve “embarrassingly parallel” benefits (Suchard et al. Citation2010). Furthermore, CPU clusters and the cloud can be costly and arcane for many clinical researchers.

Multi-core CPU parallelization is another choice for distributing intensive computational works, as modern CPU chip usually consists of 2–18 or more cores that can run independently. One limitation of this architecture is that all cores share a single “memory bandwidth,” the amount of data that can be written to or read from memory in a given period of time (Holbrook et al. Citation2021). This approach also suffers from the limited number of cores. Thus, multi-core CPU parallelization is often only useful for modest-scale problems.

In contrast, graphics processing units (GPUs) offer a relatively inexpensive and efficient approach for speeding up fine-scale parallel computation. In fine-scale parallelism, we decompose programs into a large number of small tasks to facilitate load balancing and achieve much higher level of parallelism than coarse grain approaches (Barney Citation2010). The GPU is an ideal device for fine-scale parallelism because (a) it consists of hundreds to thousands of compute cores and (b) the shared memory architecture of a GPU’s coupled thread block allows for threads to communicate and share data among each other at a very high speed. Finally, GPUs are often conveniently available on standard laptops and desktop computers and can be externally connected to a personal computer.

Accelerating statistical computing via GPUs is an emerging discipline. As examples, Zhou, Lange, and Suchard (Citation2010) attain 100-fold speedups with GPUs in high-dimensional optimization. Suchard et al. (Citation2013) demonstrate that GPU parallelization achieves one to two orders of magnitude improvement over CPUs for a Bayesian self-controlled case series model. Beam, Ghosh, and Doyle (Citation2016) accelerate Hamiltonian Monte Carlo using GPUs by efficient evaluation of their probability kernel and its gradient. Terenin, Dong, and Draper (Citation2019) demonstrate that Gibbs sampling can run orders of magnitude faster than on a CPU. Holbrook et al. (Citation2021) apply GPU computing to a Bayesian multidimensional scaling model and deliver more than 100-fold speedups over serial calculations. Ko et al. (Citation2022) explore the GPU parallelization for proximal gradient descent on modest-sized l1 regularized dense Cox regression using PyTorch. In this article, we leverage GPU parallelization to the Cox model and the Fine-Gray model through innovative algorithmic mapping that play to the GPU’s strengths for accelerating observational studies using massive healthcare data. Specifically, we identify the computational bottleneck of the Cox model and the Fine-Gray model and take advantage of the cutting-edge GPU-accelerated library CUB (Merrill and Garland Citation2016) that navigate this bottleneck. We further implement our GPU advances in the easy-to-use R package Cyclops (Suchard et al. Citation2013). Our implementation supports a sparse data format, considering that observational healthcare data are generally sparse; the vast majority of patient characteristics are encoded as the presence or absence of some clinical condition, drug exposure, medical procedure or laboratory measurement above or below a cutoff point within given time-frames. We finally demonstrate that our GPU implementation accelerates the computation of fitting these complex models by order-of-magnitude compared to a similar CPU implementation on multiple GPUs and CPUs with different technical specifications.

2. Methods

2.1. Cox Proportional Hazards Model

We first establish notation under a typical survival analysis setting. Suppose there are N observed individuals available in a study. For individual i=1,,N, let Yi=min(Ti,Ci) represent their survival time, where Ti and Ci are the time-to-event time and right-censoring time, respectively. Let δi be the indicator variable such that δi=1 if we observe the event occurrence of individual i and δi=0 if the individual i is censored. Let xi be a P-dimensional vector of time-independent covariates for individual i. The survival data then consist of triplets {Yi,δi,xi}i=1n.

The cumulative distribution function of survival times gives the probability that the event of interest has occurred by time t, that is, F(t|x)=Pr(Tt|x). The survival function gives the probability that the event has not occurred by time t, that is, S(t)=Pr(T>t). Then we define the hazard function of time-to-event time as (1) h(t)=limΔtPr(tT<t+Δt|Tt)Δt=f(t)S(t),(1) where f(t)=ddtF(t) is the density function of random variable T.

Let β=(β1,β2,,βP) be a P-dimensional vector of unknown, underlying model parameters. Assuming survival times y1,y2,,yN are independent and identically distributed from density f(y|β) and β parameterized the survival function S(y|β), Cox (Citation1972) proposes a semiparametric hazard function as the product of an unspecified baseline hazard function h0(yi|β) and an exponential link function of covariates: (2) h(yi|β)=h0(yi)exp(xiβ).(2)

Parameter estimation of the Cox proportional hazards model follows from the log-partial likelihood (3) l partial(β)=i=1Nδi{xiβlog[rR1(Yi)exp(xrβ)]},(3) where R1(Yi)={r:yrYi} denotes the set of subjects who are “at risk” for event at time Yi. Then one often estimates β by its maximum log-partial likelihood estimator β̂ mple=arg maxβ{l partial(β)}.

This log-partial likelihood has a complicated form due to the repeated calculation of the risk sets, and thus brings a high computation burden. In practice, we need to keep track of the sum of many terms for each subject that usually requires O(N2) number of operations and will explode quickly as N increases (Kawaguchi et al. Citation2020). This computational burden prevents traditional model fitting as there can be millions of observations available in observational health databases.

2.2 Fine-Gray Model

The Fine-Gray model (Fine and Gray Citation1999) generalizes the Cox proportional hazards model for competing risks time-to-event data that consist of more than one type of event. Unlike the standard survival analysis setting such as under the Cox model where individuals are only susceptible to one type of event during follow-up, competing risks arise when individuals can experience more than one type of event and the occurrence of one type of event will either prevent the occurrence or change the underlying risk of the others (Noordzij et al. Citation2013). For individual i, competing risks data inherit the definition of event time Ti, possible right censoring time Ci, event indicator δi, and covariates xi from the standard survival data setting, and additionally include an event type variable ϵi. Without loss of generality, we assume there are two types of events, where ϵi=1 indicates that Ti refers to the time of primary event and ϵi=2 indicates the competing risk event.

The Cumulative Incidence Function (CIF) for competing risks data describes the probability of failing from the event of interest before the other possible (competing) event. Under the above setting when ϵ = 1 indicating the event of interest, the CIF and hazards function are defined as (4) F1(t|x)=Pr(Tt,ϵ=1|x), and(4) (5) h1(t|x)=limΔtPr{tT<t+Δt,ϵ=1|{Tt}({T<t}{ϵ1}),x}Δt=ddtlog{1F1(t|x)}.(5)

To model the covariate effects on F1(t|x), Fine and Gray (Citation1999) propose the proportional subdistribution hazards function: (6) h1(yi|β)=h10(yi)exp(xiβ),(6) where h10(yi|β) is an unspecified baseline subdistribution hazard, and β is a P-dimensional vector of model parameters.

Parameter estimation of the Fine-Gray subdistribution proportional hazards model follows from the log-pseudo likelihood (7) l pseudo(β)=i=1NI(δiϵi=1){xiβlog[rR(Yi)ŵr(Yi)exp(xrβ)]},(7) where R(Yi)={r:(yrYi) or (yr<Yi and ϵr1)} denotes the risk set at time Yi, and ŵr(t) is a time-dependent weight based on an Inverse Probability of Censoring Weighting (IPCW) technique (Robins and Rotnitzky Citation1992). Assuming two event types exist, the risk set R(Yi) contains two disjoint set: R1(Yi)={r:yrYi} and R2(Yi)={r:yr<Yiϵr=2}, where R1(Yi) is the regular risk set that includes the observations that have an observed time equaling or after Yi and R2(Yi) includes the observations that have observed the competing event before time Yi. Here we follow the design of weighted score functions for incomplete data with right censoring in Fine and Gray (Citation1999) for unbiased estimation from the complete-data partial likelihood. The time-dependent weights are defined as ŵr(t)=I(Crmin(Tr,t))Ĝ(t)/Ĝ(min(Yr,t)), where Ĝ(t) is the Kaplan-Meier estimate of G(t) and G(t)=Pr(C>t) is the survival function of censoring variable C. Combined, one can estimate β by its maximum log-partial likelihood estimator β̂ mple=arg maxβ{l pseudo(β)}.

2.3 Statistical Regularization

Observational healthcare datasets often include a large number of patient characteristics. For example, administrative claims usually contain information on all drug prescriptions, medical procedures and diagnosis codes for patients, and EHRs generally further contain demographics, medical history notes, laboratory results, and other health status indications (Madigan et al. Citation2014). A statistical regularization approach is typical in such high-dimensional data analysis to avoid overfitting. We can conveniently add a penalty π(β) for β to the log-partial likelihood of Cox model or the log-pseudo likelihood of Fine-Gray model and estimate β through these joint penalized likelihoods to achieve regularization.

For l1 regularization that shrinks many components of β to be zero, we define a separable penalty for each dimension βj in β through (8) π(β)=jπ(βj|γj)=jγj|βj|,(8) where the tuning parameters γj control the degree of regularization for each dimension. Similarly, one may employ an l2 penalty on the dimensions of β, such that (9) π(β)=jπ(βj|τj)=jβj22τj.(9)

Usually one assumes γj=γ j and τj=τ j, and we choose γ or τ through cross-validation.

Note that statistical inference in the context of regularization remains a challenge. Various standard errors estimators based on the nonparametric bootstrap have been proposed (Casella et al. Citation2010; Chatterjee and Lahiri Citation2011), but an approach that is both computationally efficient and statistically valid still remains out of reach. Since we are focusing on computational bottleneck in this article, we decide to follow the standard practice of regularization in large-scale and high-dimensional observational health studies (Mueller-Using et al. Citation2016; Shortreed and Ertefaie Citation2017; Bramante et al. Citation2021) despite the limitations of regularization.

2.4 Maximum Likelihood Estimation Using Cyclic Coordinate Descent

Following Genkin, Lewis, and Madigan (Citation2007) and Mittal et al. (Citation2014), we exploit a Cyclic Coordinate Descent (CCD) algorithm to reduce the high-dimensional penalized likelihood optimization down to a large set of simple one-dimensional optimizations (Wu and Lange Citation2008). This method cycles through each covariate and updates it using a Newton step while holding all other covariates as constants. The advantage of CCD is it only requires the calculation of scalar gradients and Hessians and avoids the inversion of large Hessian matrices in high-dimensional regression.

Specifically, for each one-dimensional optimization problem, we pick the βj(new) by maximizing g(βj)=l(βj)+π(βj) while holding all other βj’s unchanged. The second-order Taylor approximation of the penalized log-likelihood at current βj is (10) g(z)g(βj)+g(βj)(zβj)+12g(βj)(zβj)2.(10)

Then the new estimate βj (new) falls out (11) βj (new)=βj+Δβj=βjg(βj)g(βj).(11)

We employ a trust region approach similar to Genkin, Lewis, and Madigan (Citation2007) to restrict the step size so that the quadratic remains a reasonable approximation to the objective and improve convergence. In particular, we update βj during iteration k by (12) Δβj(k)=g(βj(k))g(βj(k)), and(12) (13) βj(k+1)=βj(k)+sgn(Δβj(k))min{|Δβj(k)|,Δj(k)},(13) where we update the trust region half-width as Δj(k+1)=max{2|Δβj(k)|,Δj(k)/2}, starting with Δj(0)=1.

Note that both the negated log-likelihood of the Cox model and the Fine-Gray model are convex in β, as well as the l1 and l2 penalty terms. Although the l1-norm is nondifferentiable at origin, we can follow the approach of Wu and Lange (Citation2008) to compute the directional derivatives along each forward and backward coordinate direction for our objective function. In particular, we compute the directional derivatives in both directions by plugging in sgn(βj)=+1 and sgn(βj)=1 when βj(k)=0, and only update βj in a direction when the directional derivative is negative, otherwise we keep βj(k+1)=0. Since the objective function is convex, it is impossible for both directional derivatives to be negative, but either direction with a negative directional derivative will result in a successful update. Although we lack of rigorous proof of convergence when employing a trust region, as the induced step sizes fail to meet the strict convergence conditions for this optimization problem (Xu and Yin Citation2017), we have not observed this issue in our work.

2.5 Massive Parallelization on GPUs

Parallelization through clusters and multi-core CPUs exhibits a number of drawbacks that makes these devices ill-suited for massive survival analysis, as discussed in the Introduction. As such, this article exploits massive parallelization on GPUs through new fine-scale algorithm decomposition for speeding up large-scale computations. Here we begin with an overview on GPU computing and summarize its main strengths and weaknesses.

The modern GPU contains an array of multithreaded Streaming Multiprocessors (SMs), where hundreds to thousands of work threads execute simultaneously (Nickolls et al. Citation2008). Many threads group together as a thread block, in which threads communicate through shared memory and cooperate through barrier synchronization. Thread blocks are further grouped into kernel grids. The programmer specifies the number of threads per block and number of blocks forming the grid. In our code, we program this ensemble via CUDA, a parallel computing platform that allows general-purpose computing on GPUs (GPGPU) using a familiar C-like syntax.

Understanding the memory hierarchy of GPUs is important for achieving optimal performance for parallel programs. Each thread has its own set of processor registers and local memory for thread-private variables, which provide the fastest memory access. Each thread block has a limited shared memory pool that is only visible to the threads within this block. All threads also have access to a large high-bandwidth, but off-chip (global) memory embedded on the GPU card. Shared memory provide high-speed access, while accessing global memory is hundreds of times slower (Micikevicius Citation2009).

In our implementation, GPUs handle only the most computationally intensive tasks. When such a task is scheduled, relevant data are first copied from host CPU memory to the global memory on the GPU device. Then the GPU kernel is launched, which loads data to on-chip memory for defined operations and writes results to global memory. Finally, results are copied from the device back to the host.

This parallel programming model has several limitations that we should keep in mind. First, we should minimize data transfer between the host and device because the transfer is extremely slow. Second, accessing global memory on the GPU is also relatively slow, so we want to minimize the reads from global memory and the writes to global memory. When we do read or write from global memory, we want sequential threads to access sequential addresses in memory. In this manner, the GPU “coalesces” multiple memory requests into a smaller number of 128-byte transactions, and we want to over-subscribe each GPU core with multiple threads, so that the cores remain active through thread-context-switching and the latency in memory access can be hidden. Third, launching a kernel also has overhead on the order of microseconds, so it is preferred to combine a series of kernels into a larger “fused” one. Finally, contemporary GPUs issue single instructions to a “warp” of 32 threads simultaneously, such that all threads within a warp must execute the same instruction each clock cycle. When threads within a warp follow different data-dependent conditional branches, their execution becomes temporally serialized; this can cause a performance penalty. To avoid this issue, one attempts to minimize the number of diverging branches within a warp.

2.6 Tree-based Parallel Algorithms: Reduction and Scan

Here we review two useful building blocks for massively parallel algorithms: reduction and scan. Reductions convert an array of elements into a single result. For example, if the reduction operator is addition, then the reduction takes an array [a0,a1,,an1] and returns a single value i=0n1ai. Reductions are useful for implementing log-likelihood calculations, since independent samples contribute additively to the model log-likelihood. Taking an array [a0,a1,,an1], the scan operation returns the array [a0,a0+a1,,i=0n1ai]. If we start from the beginning of the input array as above, the resulting array is called a prefix sum. While the resulting array is called a suffix sum if we start from the end and proceed toward the beginning. Scans are useful in accumulating statistics about individuals in the risk set in survival analysis. Implementing a sequential version of a reduction or scan both require ˜n additions on an array of length n.

The parallel versions of reduction and scan use a tree-based approach shown in . Note that effective parallelization of these types of binary tree traversals requires low latency sharing of partial sums across threads with appropriate synchronization, both aspects in which the large number of threads concurrently executing on the GPU greatly outperform multi-core CPU threads.

Figure 1. Tree-based parallel implementation of reduction and scan. Each gray box (column) represents a thread. Each thread runs a series of steps, and some steps must wait for results from other threads. Subfigures (a) and (b) show the naive binary tree-based approach for reduction and scan, respectively. Subfigure (c) presents a work-efficient scan algorithm using reduce-then-scan strategy, which includes an up-sweep phase for accumulating the partial sums and a down-sweep phase for aggregating the prefix sums. The tree-based approach for reduction and scan generally requires much data communication between threads but this remains low latency in shared memory, and thus is suitable for GPU parallelization.

Figure 1. Tree-based parallel implementation of reduction and scan. Each gray box (column) represents a thread. Each thread runs a series of steps, and some steps must wait for results from other threads. Subfigures (a) and (b) show the naive binary tree-based approach for reduction and scan, respectively. Subfigure (c) presents a work-efficient scan algorithm using reduce-then-scan strategy, which includes an up-sweep phase for accumulating the partial sums and a down-sweep phase for aggregating the prefix sums. The tree-based approach for reduction and scan generally requires much data communication between threads but this remains low latency in shared memory, and thus is suitable for GPU parallelization.

To obtain a parallel, work-efficient, and communication-avoiding prefix scan, we invoke the CUB library (Merrill and Garland Citation2016). The efficiency of their prefix-scan approaches a simple copy operation, as their prefix-scan requires the optimal ˜2n data movements: n reads and n writes to the global memory. The scan is constructed on two levels of organization: (a) a global device-wide scan and (b) a set of local block-wide scans within each thread block. The local block-wide scan use a reduce-then-scan strategy that can be visually resembled as an “hourglass” shape comprising an up-sweep and a down-sweep as shown in (c). In the up-sweep phase, we traverse the tree from leaves to root for computing the partial sums. Then the running prefixes are aggregated in the block-wide down-sweep traversing back up the tree from root using the partial sums computed in the up-sweep phase. The global scan implementation within CUB applies a single-pass chained-scan approach to achieve just ˜2n global data movements. The global scan further propose a decoupled look-back strategy by assigning each thread block a status flag indicating one of the three status: (a) aggregate (partial sum) of the block is available; (b) prefix of the block is available; and (c) no information about the block is available for other processor. Then each block will perform the computation conditional on its predecessor’s status flag. We refer readers to Merrill and Garland (Citation2016) for more details on this algorithm. In this article, we extend these parallel algorithms for Cox models and Fine-Gray models based on the implementation of reduction and scan from the CUB library.

2.7 GPU Massive Parallelization for Parameter Estimation

CCD is an inherently serial algorithm in which each iteration is based on the result of the last iteration. As we mentioned earlier, even within an iteration t, CCD cycles through each covariate j for j=1,2,,P one by one and the computation work for the next covariate cannot begin until the current update finishes. This serial algorithm however can still benefit greatly from parallelization by exploiting fine-grain problem decomposition within each iteration. Within each iterate’s covariate update, careful benchmarks reveal that over 95% of the runtime lies in computing the log-likelihood gradient g(βj) and Hessian g(βj).

To understand the computational work, let Δ=(Δ1,,ΔN)T be an N-dimensional column vector and M be an N × N loading matrix with entries (14) Mij={1for jR1(Yi)0otherwise.(14)

Recall that the risk set R1(Yi) contains all observations that have an observed event time equaling or after Yi. Thus, if we arrange the observations by their observed time Yi in decreasing order, matrix M is clearly a lower triangular, binary matrix, and matrix-vector multiplication M[exp(Xβ)] becomes a prefix sum over the elements in exp(Xβ), where we define exponentiation (exp) as element-wise operation. For example, given Yi>Yi, the set R(Yi) consists of all the observations from R(Yi) and the set {t:Yt[Yi,Yi]}, then rR(Yi)exp(xrβ)=rR(Yi)exp(xrβ)+r{t:Yt[Yi,Yi]}exp(xrβ). Making these substitutions in (3), we arrive at (15) L pseudo(β)=δXβδlog{S pre[exp(Xβ)]},(15) where we define forming the logarithm (log) as element-wise operation, and S pre[ν] as the prefix sum of arbitrary vector ν. Then the unidimensional gradient and Hessians under the Cox proportional hazards model falls out as (16) g(βj)=δXjδG and(16) (17) g(βj)=δ(HG×G),(17) where (18) G=S pre[exp(Xβ)×Xj]S pre[exp(Xβ)] and(18) (19) H=S pre[exp(Xβ)×Xj×Xj]S pre[exp(Xβ)](19) and vector Xj is the jth column of X. Note that we define here multiplication (×) and division (/) as element-wise operations as well.

While matrix-vector multiplication involving M takes O(N2) operations, identifying the cumulative structure reduces the time complexity to linear by calculating prefix sum S pre[ν] in series, and parallelization further reduces the time complexity to O(log2N) due to parallel scan’s tree-based structure (Harris, Sengupta, and Owens Citation2007). Finally, the vector-vector multiplication involving δ can be calculated as an element-wise multiplication in parallel in constant time and a reduction through a binary-tree in O(log2N).

Under the Fine-Gray model, let W=(w1,,wN) be an N-dimensional column vector of precomputed censoring weights described in previous section, and N as an N × N loading matrix with entries (20) Nij={1for jR2(Yi)0otherwise.(20)

Recall that R2(Yi)={r:Yr<Yiϵr=2} and R1(Yi) and R2(Yi) are disjointed, so N is an upper triangular, binary matrix and N[exp(Xβ)×W] is a suffix sum if we arrange the observations by their observed time Yi in decreasing order. Then making the following substitutions in (16) and (17) yields the unidimensional gradient and Hessians under the Fine-Gray model (21) G=S pre[exp(Xβ)×Xj]+S suf[exp(Xβ)×Xj×W]S pre[exp(Xβ)]+S suf[exp(Xβ)×W] and(21) (22) H=S pre[exp(Xβ)×Xj×Xj]+S suf[exp(Xβ)×Xj×Xj×W]S pre[exp(Xβ)]+S suf[exp(Xβ)×W],(22) where we define S suf[ν] as suffix sum of vector ν.

It is worth noting that the risk set under the competing risk setting consists of two disjoint sets due to multiple types of event, thus, a single pass scan furnishes neither the numerator nor denominator. Instead, the numerator and denominator of (21) and (22) can be computed through a forward (prefix) scan S pre[ν] plus a backward (suffix) scan S suf[ν] together (Kawaguchi et al. Citation2020).

In summary, we can decompose the calculation of the gradient and Hessian for βj in the following four sequential steps:

1. Read in nonzero xij for i=1,2,,N and update [Xβ]i as [Xβ]i(new)=[Xβ]i+xijΔβj.

Then perform three element-wise embarrassingly parallel transformations that read from the new estimate of [Xβ], and output [exp(Xβ)], [exp(Xβ)×Xj] and [exp(Xβ)×Xj×Xj]. Here we should keep in mind that X is generally sparse such that many elements xij are zeros, and exponentiation is an expensive operation in floating-point.

  1. Scans:

    1. Under the Cox model, we perform three forward scans that take the three output vectors of (1) as input, and return S pre[exp(Xβ)], S pre[exp(Xβ)×Xj] and S pre[exp(Xβ)×Xj×Xj].

    2. Under the Fine-Gray model, we perform three forward scans and three backward scans that take the three output vectors of (1) as input, and return S pre[exp(Xβ)], S pre[exp(Xβ)×Xj], S pre[exp(Xβ)×Xj×Xj], S suf[exp(Xβ)×W], S suf[exp(Xβ)×Xj×W] and S suf[exp(Xβ)×Xj×Xj×W].

  2. An element-wise transformation that takes the output vectors of (2) as well as the indicator vector δ as input, and outputs two new vectors δ×G and δ×(HG×G).

  3. Two reductions that take the output vectors of (3) as input, and output two double summations δG and g(βj)=δ(HG×G). Note that the first term δXj in gradient g(βj) can be precomputed and the value does not change during CCD.

Although parallel computing of the above numerical operations is much faster than serial evaluation, one crucial limitation is that a memory transaction involving reading or writing from global memory may take up to two orders of magnitude more time than a regular numerical operation applied to the value sitting in the limited number of on-chip registers within a GPU (or CPU for that matter) (Holbrook et al. Citation2021). In order to minimize memory transactions, we fuse several of our operations together for both our serial and parallel implementations. First, we fuse the multiple scans in Step 2 together as a tuple-scan, which takes a tuple of three vectors as input, and outputs a tuple of three (or six) vectors. Note that in Fine-Gray model, we perform the forward tuple-scan and the backward tuple-scan on the same tuple of three vectors, but read the input in two opposite directions simultaneously. Similarly, we can fuse the multiple reductions in Step 4 as a tuple-reduction. Since the element-wise transformations in Step 3 take O(1) time with GPU parallelization, Step 3 and Step 4 can be regarded as a transformation-reduction. Finally, since both scan and reduction parallelization share the same binary-tree structure, we further fuse Steps 2–4 together into a single kernel. Through fusion, for example, the output tuple from the scans never need to be written to global memory, nor read back for the later transformation-reductions. The fused kernel saves 2/3 of the reads and writes than executing three separated kernels. It is also worth noting that the computational work of the gradient and Hessian evaluation share a similar structure and even some component such as G, such that we have fused the computation of gradient and Hessian together to circumvent unnecessary kernel overhead and facilitate data reuse of these intermediate terms.

To exploit the sparsity of the design matrix X , we parallelize the transformation in Step 1 using a sparse CUDA kernel, which only reads in and processes the nonzero entries while keeping other entries as zeros all the time during CCD updates. This sparse kernel saves data movement as well as reduces memory bandwidth requirements significantly when X is sparse, which is common in real-world scenarios.

illustrates our fused kernel for evaluating the log-likelihood gradient and Hessian on the GPU. In this kernel, we spawn S=B×IPT×G threads, where B is the number of concurrent threads forming a thread block, IPT is the number of work-items (elements of input) evaluated per thread, and G=NB×IPT denotes the number of thread blocks. Both of the block size B and the thread grain size IPT are constrained by the hardware and are tunable constants. In practice, we follow the parameter settings in CUB with B = 128 and IPT = 15 for the fused kernel. For the sparse kernel we wrote, we choose B = 256 and IPT = 1. In the figure, each box in dashed line represent a thread block and recall that all threads within a block can access a shared memory that is a low-latency and on-chip (Nickolls et al. Citation2008) and is useful for performing the efficient scan in step (2) and reduction in step (4). The threads in parallel first read the values of tuple {exp(xiβ),xijexp(xiβ),xij2exp(xiβ)} for the current covariate column j from global memory and then conduct the single-pass adaptive look-back tuple-scan (Merrill and Garland Citation2016) using a reduce-then-scan strategy. Next, the threads read in the values of δ and perform the transformation with the on-chip cumulative sums and δ. The threads then perform a binary-tree tuple-reduction using shared memory, and one thread from each block writes its partial aggregates to global memory. Finally, a single-block reduction kernel sums over the G partial aggregates and writes the gradient g(βj) and Hessian g(βj) back to global memory.

Figure 2. Fused kernel for evaluating the gradient and Hessian for βj. We fused the single-pass scan with decoupling look-back (represented by the “hourglass”), the element-wise transformation (represented by the rectangle), and the reduction (represented by the upside down triangle) together in a fused kernel. Specifically, each of G blocks (showed as a box in dashed line) reads a batch of triplets from global memory, performs a single-pass adaptive tuple-scan and a transformation-reduction to compute local partial sum of gradients and Hessians in shared memory, then efficiently adds the pair of partial sums in a binary-tree tuple-reduction within a single block and writes the resulting pair of gradient and Hessian to global memory.

Figure 2. Fused kernel for evaluating the gradient and Hessian for βj. We fused the single-pass scan with decoupling look-back (represented by the “hourglass”), the element-wise transformation (represented by the rectangle), and the reduction (represented by the upside down triangle) together in a fused kernel. Specifically, each of G blocks (showed as a box in dashed line) reads a batch of triplets from global memory, performs a single-pass adaptive tuple-scan and a transformation-reduction to compute local partial sum of gradients and Hessians in shared memory, then efficiently adds the pair of partial sums in a binary-tree tuple-reduction within a single block and writes the resulting pair of gradient and Hessian to global memory.

When CCD processes the jth covariate, we only need to update βj and corresponding vector Xβ if Δβj0. Recall that the computation of Δβj requires g(βj) and g(βj) when regularization applies. Since both gradient g(βj) and Hessian g(βj) are computed on the GPU, we avoid P data transfers between GPU and CPU in one CCD cycle by using a GPU kernel to check if Δβj0 and then update βj and Xβ if needed. details the workflow to implement CCD using GPU parallelization.

Figure 3. The workflow of implementing CCD using GPU parallelization: for each j=1,,P, a sparse kernel reads in the entries of [Xβ] for which xij0 and writes the updated tuple of vectors to the global memory of the GPU, then a fused kernel performs a tuple-scan and a transformation-reduction to compute the gradient and Hessian of the log-likelihood with respect to βj. Finally a conditional kernel computes Δβj and updates βj and Xβ if Δβj0. Blue arrows represent data transactions to or from global memory. No data transfer between the GPU and CPU is needed until CCD finishes a complete cycle.

Figure 3. The workflow of implementing CCD using GPU parallelization: for each j=1,…,P, a sparse kernel reads in the entries of [Xβ] for which xij≠0 and writes the updated tuple of vectors to the global memory of the GPU, then a fused kernel performs a tuple-scan and a transformation-reduction to compute the gradient and Hessian of the log-likelihood with respect to βj. Finally a conditional kernel computes Δβj and updates βj and Xβ if Δβj≠0. Blue arrows represent data transactions to or from global memory. No data transfer between the GPU and CPU is needed until CCD finishes a complete cycle.

It is worth noting that thread-divergent branches in a CUDA kernel can substantially impact performance as execution gets temporally serialized. However, this is not an issue in our implementation because such branch divergence penalties only occur when threads within the same warp, but not across warps, need to execute alternative instructions. The branches in scan and reduction kernel in CUB are mainly due to slightly different instructions for the first processing tile and the last processing tile. Thus, the divergence occurs across warps and does not have an impact on performance penalty.

2.8 Multi-Stream Cross-Validation

We use k-fold cross-validation to search for the optimum tuning parameters γ or τ that controls the strength of regularization. We search for different values for the tuning parameter. For each value, the procedures are as follows:

  1. Randomly split data into k partitions.

  2. For each of the k partitions, we fit survival models via CCD on the remaining k – 1 partitions and compute the predictive log-likelihood of this partition using the estimated β̂.

  3. Average out-of-sample likelihood across all k folds.

  4. Repeat Step 1–Step 3 r times to reduce spurious effects from random data partitioning.

Finally, we select the tuning parameter with the smallest average out-of-sample likelihood as the desired optimal value.

We further improve the efficiency of cross-validation using multi-stream GPU and multi-threaded CPU approaches. Here, a stream denotes a sequence of operations (kernels) that execute in issue-order on a GPU (Rennich Citation2011). Instead of the fitting k partitioned models serially in a single (default) stream, we fit the k models across s streams in parallel, where 1<sk and each GPU stream is scheduled by an independent CPU thread. Likewise, for the multi-threaded CPU approach, s CPU threads evaluate the k models in parallel.

2.9 Comparison with an Alternative Massive Parallelization of Cox Models

In Ko et al. (Citation2022), the authors presented sample PyTorch code for parallelizing proximal gradient descent on modest-sized l1 regularized dense Cox regression. Here we provide a qualitative comparison of our method with this alternative approach, focusing on the per-cycle cost of cyclic coordinate descent and proximal gradient descent.

A single iteration of proximal gradient descent on l1-regularized Cox regression requires the following steps:

  1. A matrix-vector multiplication Xβ.

  2. An element-wise transformation exp(·) on the output vector of (1).

  3. A scan on the output vector of (2).

  4. Two matrix-vector multiplications of Pδ and XT[δPδ], where the matrix P=(πij) is defined as πij=I(yiyj)exp(xiTβ)/i:yiyjexp(xiTβ).

Without considering the specific parallel library, the above steps contain O(NP)+O(N)+O(N)+O(N2) operations, respectively. Meanwhile, one sweep of coordinate descent implemented in Cyclops requires no more than O(NP) operations (the worst case is that the data is dense), as each of the four steps outlined in Section 2.7 only at most requires O(N) operations. Additionally, step (4) in the proximal gradient descent approach described earlier can also be reduced to O(N) by using the same trick discussed in Section 2.7.

It is important to note that the number of iterations required to achieve an equivalent termination criterion is highly dependent on the data and is beyond the scope of our article.

We would also like to emphasize that one of the main feature of our implementation is that it supports and benefits from a sparse data format when available, as discussed in Section 2.7.

3 Results

We examine the performance of GPU versus CPU computing for fitting our massive sample-size survival models. To accomplish this, we conduct a series of synthetic experiments to investigate the relative compute-time of our parallelization across different sample sizes. We then reproduce a real-world study using our GPU implementation under a Cox model and extend the study to the competing risks setting. If not specified, we use a system equipped with a 10 core 3.3 GHz Intel(R) Xeon(R) W-2155 CPU and an NVIDIA Quadro GV100 with 5120 CUDA cores and 32GB RAM that can achieve up to 7.4 Tflops double-precision point performance.

3.1 Kernel Fusion

Kernel fusion is one of the main strategy we apply for achieving optimal performance. Taking the Cox model as an example, we compare three different implementations for Steps 2–4 in Section 2.7:

  1. Separated kernels

  2. Three separated scans using cub::DeviceScan:: InclusiveSum,

    1. One transformation operation using thrust:: transform, and

    2. Two separated reductions using cub::DeviceReduce:: Sum.

  3. Partially-fused kernels

    1. A tuple-scan using cub::DeviceScan::InclusiveScan, and

    2. A transformed tuple-reduction using cub::TransformInputIterator and cub::DeviceReduce::Reduce.

3. Single fused kernel

We simulate the input vectors with N=100,000 to 1,000,000 from Uniform(1,2). shows the speedup of the fused kernel and the partially fused kernels over the separated kernels. The partially fused kernels are 8–31 times faster than the separated kernels. The single fused kernel further generates up to a 42-fold speedup compared with the separated kernels. shows the runtimes (in milliseconds) in details when N=100,000. Interestingly, we find performing a tuple-scan on three vectors is almost three times faster than performing three separated scans (0.0036 ms vs. 0.0098 ms), demonstrating the values of fusion.

Figure 4. Speedup of the fused kernel and the partially-fused kernels over separated kernels for Steps 2–4 in Section 2.7. The sample sizes range between N=105 to 106. Solid line shows the speedup of the fused kernel over separated kernels, and dashed line shows the speedup of the partially-fused kernel over separated kernels.

Figure 4. Speedup of the fused kernel and the partially-fused kernels over separated kernels for Steps 2–4 in Section 2.7. The sample sizes range between N=105 to 106. Solid line shows the speedup of the fused kernel over separated kernels, and dashed line shows the speedup of the partially-fused kernel over separated kernels.

Table 1. Runtimes (in milliseconds) of the separated kernels, the partially fused kernels, and the fused kernel for Steps 2–4 in Section 2.7 when N=100,000.

3.2 Synthetic Experiments

We simulate indicator data X with N=100,000 to 1,000,000 samples and P = 1000 covariates with sparsity of 95%, where we randomly choose 5% of the entries uniformly to be 1s. We then draw βjN(0,1)×Bernoulli(0.80)j andTiExponential(xiβ)i,where the Bernoulli(0.80) specifies that, on average, 80% of the βj are set to 0 to induce sparsity, and Ti represent the time-to-event time for individual i. For generating competing risk times, we first draw: β1jN(0,1)×Bernoulli(0.80) and set β2j=β1jj, where β1 is the regression parameter of primary event and β2 is the regression parameter of competing event (Kawaguchi et al. Citation2020). Then we follow the design of Fine and Gray (Citation1999), where the cumulative incidence function (CIF) of primary event is a unit exponential mixture with mass 1p at when xi=0: Pr(T1it1|xi)=1[1p{1exp(t1)}]exp(xiβ1), and draw survival time of competing event T2i using an exponential distribution with rate exp(xiβ2). We set p = 0.5 in practice.

We fit these simulants under a fixed l1 penalty with γ=2 on a single CPU core and the default CUDA stream. To provide a comprehensive comparison for the performance gains, we conduct the experiments on two systems with different technical specifications. System 1 is equipped with a 3.3 GHz Intel(R) Xeon(R) W-2155 CPU (launch date 2017) and an NVIDIA Quadro GV100 (2018) with 5120 CUDA cores and 32GB RAM that can achieve up to 7.4 Tflops double-precision point performance. System 2 is equipped with a 2.2 GHz Intel(R) Xeon(R) Silver 4214 CPU (2019) and an NVIDIA A100 (2020) with 6912 CUDA cores and 80GB RAM that can achieve up to 9.7 Tflops double-precision point performance. presents runtimes comparisons across computing devices with a fixed number of covariates (P = 1000) with 95% sparsity and varying sample size N. We report both the total model fitting time and the time for computing gradients and Hessians, where the latter is our target of parallelization. On system 1 which is equipped with a powerful CPU, we see that the GPU parallelization delivers up to a 42-fold speedup for both Cox and Fine-Gray models in terms of computing gradient and Hessians. Despite additional data transfer and device initialization, GPU parallelization is still 35 × faster for this Cox model and 39 × faster for this Fine-Gray model relatively to our CPU implementation with one million samples. On system 2 which is equipped with a more powerful GPU, we see that the GPU parallelization achieves up to a 52-fold speedup for this Cox model and a 70-fold speedup for this Fine-Gray model. The data transfer between host and device during CCD updates only accounts for 1% of the total model fitting time. We can see a rapid increase of runtimes on the CPU with increasing sample size, while the GPU approach continues to yield relatively shorter runtimes across varying sample sizes, as the devices is still not fully occupied.

Figure 5. Runtimes (in seconds) and speedup of the GPU implementation relative to the CPU implementation for Cox and Fine-Gray models with a fixed l1 penalty using a single GPU stream or CPU thread. The sample sizes range between N=105 to 106 with a sparsity of 95%. Solid lines show the total model fitting time, and dashed lines show the time for computing gradients and Hessians. The gap between the GPU’s solid and dashed lines in the figures identify mostly computation that we did not port to the GPU, as data transfer between host and device during CCD updates accounts for less than 10% of this overhead and 1% of the total model fitting time.

Figure 5. Runtimes (in seconds) and speedup of the GPU implementation relative to the CPU implementation for Cox and Fine-Gray models with a fixed l1 penalty using a single GPU stream or CPU thread. The sample sizes range between N=105 to 106 with a sparsity of 95%. Solid lines show the total model fitting time, and dashed lines show the time for computing gradients and Hessians. The gap between the GPU’s solid and dashed lines in the figures identify mostly computation that we did not port to the GPU, as data transfer between host and device during CCD updates accounts for less than 10% of this overhead and ∼1% of the total model fitting time.

3.3 Multi-Stream Cross-Validated Experiments

We further use these synthetic experiments to explore the performance of our approach using multi-threaded CPU and multi-stream GPU computing by simultaneously searching for an optimal strength of regularization. Here, we use 10-fold cross-validation with 10 repetitions per fold, resulting in 100 cross-validation replicates to estimate an optimal γ under l1 regularization. On the GPU, we distribute the 100 replicates to s CUDA streams driven by s CPU threads. We also allow each of the CPU threads to process the 100 replicates directly on the CPU to demonstrate the performance of corresponding multi-threaded CPU parallelization. shows the runtimes of l1 regularized Cox regression on varying number s of threads or streams. Generally, our parallelization achieves more speedup through multi-stream GPU computing than through multi-core CPU threads alone. For example, the runtime of fitting a Cox model on 1,000,000 samples reduces from nearly 9 hr across 8 CPU threads to 14 min across 8 CUDA streams. We also observe that the curves flatten out as number of CUDA streams and CPU threads increases, and this pattern is particularly obvious in the experiments on smaller sample sizes and on the CPU. This pattern suggests that there is both less computation to parallelize with relatively small sample sizes and that multi-core CPU parallelization remains limited by the smaller memory bandwidth available to the CPU.

Figure 6. Runtimes for l1 regularized Cox and Fine-Gray models with 10-fold 10-replicate cross-validation using multi-core CPU and multi-stream GPU computing. The sample sizes range between N=105 to 106 with a sparsity of 95%.

Figure 6. Runtimes for l1 regularized Cox and Fine-Gray models with 10-fold 10-replicate cross-validation using multi-core CPU and multi-stream GPU computing. The sample sizes range between N=105 to 106 with a sparsity of 95%.

3.4 Cardiovascular Effectiveness of Antihypertensive Drug Classes

The large-scale evidence generation and evaluation across a network of databases for hypertension (LEGEND-HTN) study (Suchard et al. Citation2019) provided real-world evidence on the comparative effectiveness and safety of five first-line antihypertensive drug classes using a retrospective, comparative new-user cohort design. Specifically, LEGEND-HTN studied the relative risk of 55 health outcomes of interest , including three major cardiovascular events (acute myocardial infarction, hospitalization for heart failure, and stroke), six secondary effectiveness outcomes, and 46 safety outcomes. Within each of nine observational health data sources, LEGEND-HTN employed propensity score matching or stratification for confounding adjustment and Cox proportional hazards modeling for hazard ratio (HR) estimation between new-users of each of the different drug classes. Interestingly, LEGEND-HTN identified that new-users of thiazide or thiazide-like diuretics (THZs) have a lower risk as compared to new-users of angiotensin-converting enzyme inhibitors (ACEIs) in terms of several effectiveness and safety outcomes, even though ACEIs are the most commonly initiated monotherapy across databases.

Here we examine patients initiating ACEIs and THZs, where the outcome is hospitalization for heart failure from the Optum® de-identified Electronic Health Record dataset (Optum EHR). Optum EHR represents data from 85 million individuals that are commercially or Medicare insured in the United States. The data contain medical claims, pharmacy claims, laboratory tests, vital signs, body measurements, and information derived from clinical notes. A total of 1,014,618 patients are included in our study, 75.8% of whom initiated an ACEI and 24.2% of whom initiated a THZ. We consider the main treatment covariate (ACEI or THZ exposure), in addition to 9811 baseline patient characteristic covariates. presents a small selection of major patient baseline characteristics. From all baseline characteristics, we build a propensity score model and match ACEI and THZ new-users in a 1:1 ratio. A total of 434,866 patients were kept for further analysis after propensity score matching. We find ACEI new-users are more likely to be male, have diabetes, hyperlipidemia or heart disease comparing with patients initiating a THZ before propensity score matching. The THZ and ACEi cohorts are well-balanced on all 9811 baseline patient characteristics after matching, identified through low standardized difference of population proportions. The average sparsity of patient characteristic covariates is 97.11%, which means 2.89% of the entries are nonzero, occupying about 2.96GB RAM in sparse matrix format. We first fit a Cox proportional hazards model to estimate the HR between THZ and ACEi initiation for the risk of hospitalization for heart failure. We also fit a Fine-Gray model in which we consider acute myocardial infarction as a competing risk of hospitalization for heart failure, given myocardial infarction substantially elevates the future risk of heart failure. We include all patient characteristics as additional covariates in the Cox and Fine-Gray models with l1 regularization on all variables except the treatment variable to achieve a limited form of adjustment for possible residual confounding, and again employ a 10-fold 10-replicate cross-validation to search for optimal tuning parameters. The analyses require almost two days to fit the regularized Cox model and more than three days to fit the regularized Fine-Gray model using eight CPU cores, while only taking 3.87 hr and 8.57 hr for the regularized Cox model and the regularized Fine-Gray model with our GPU implementation. reports the runtimes for regularized Cox and Fine-Gray models on varying numbers of CUDA streams versus the corresponding multi-core CPU parallelization.

Figure 7. Runtimes (in hours) for l1 regularized Cox and Fine-Gray models using multi-stream GPU and multi-core CPU computing. The data contain 434,866 new-users of THZs and ACEIs, each with 9811 baseline patient characteristics covariates. We add a l1 penalty on all baseline covariates and use a 10-fold 10-replicate cross-validation to search for optimum tuning parameters.

Figure 7. Runtimes (in hours) for l1 regularized Cox and Fine-Gray models using multi-stream GPU and multi-core CPU computing. The data contain 434,866 new-users of THZs and ACEIs, each with 9811 baseline patient characteristics covariates. We add a l1 penalty on all baseline covariates and use a 10-fold 10-replicate cross-validation to search for optimum tuning parameters.

Table 2. Baseline hypertensive patient characteristics for new-user of THZ and ACEi in the Optum EHR database.

Through massive parallelization, we find that initializing with a THZ shows better effectiveness than initializing with an ACEI in terms of hospitalization for heart failure risk under both the Cox model (HR 0.83, 95% bootstrapped percentile interval [BPI] 0.72–0.94) and the Fine-Gray model (HR 0.83, 95% BPI 0.71–0.95), where we provide the BPIs as crude measures of sampling variability given the challenges in constructing nominal confidence intervals for regularized models (Casella et al. Citation2010; Chatterjee and Lahiri Citation2011). There are 215 and 134 out of 9811 baseline covariates with nonzero effect sizes in the Cox and Fine-Gray models, respectively; these covariates include age, gender, hyperlipidemia, diabetes, osteoarthritis, and heart disease. While our estimates remain in line with the LEGEND-HTN study, they are reassuring in two ways. First, massive parallelization has enabled us to provide additional adjustment for possible residual confounding due to unobserved imbalance after propensity score matching without overfitting through cross-validation. Second, a consistent estimate after controlling for an obvious competing risk reduces concern over bias from informative censoring. Neither including thousands of baseline covariates nor handling a competing risk at this scale were possible in the original LEGEND-HTN study due to their erroneous time requirements; both are now feasible.

4 Discussion

This article presents a time- and memory-efficient GPU implementation of regularized Cox and Fine-Gray regression models for analyzing large-scale time-to-event data with competing risks. We efficiently implement it in the open-source R package Cyclops (Suchard et al. Citation2013). In simulation studies, our GPU implementation is 35–70 times faster than the equivalent CPU implementation with up to 1 million samples. In our real-world example with ∼ 400,000 hypertension patients and ∼ 9000 covariates, massive parallelization reduces the total runtimes of both regularized Cox regression and regularized Fine-Gray regression with cross-validation from a few days on multi-core CPUs to few hours on a GPU.

The observed speed-up is a combination of algorithmic advances as well as hardware optimization. First, we observe that the cumulative structure of the risk set can be computed by a single-pass parallel scan algorithm, which can be very efficiently computed in parallel using a tree-based approach. This enables us to develop a work efficient and communication-avoiding tuple-scan algorithm in accumulating statistics about individuals in the risk set, dramatically speeding up likelihood, gradient, and Hessian evaluations. In particular, this tree-based scan algorithm takes great advantage of the high-speed shared memory of thread block on GPUs. Furthermore, as CCD is an inherently serial algorithm, we fuse multiple serial steps (a) transformations, (b) scans and (c) reductions together into a single vectorizable kernel. This fusion is possible because transformation is relatively lightweight operation, and scan and reduction algorithm share the same binary-tree structure. This kernel fusion reduces expensive memory transactions and kernel overheads which usually prevent the serial algorithm from benefiting by parallel computing. We also exploit the sparsity of the design matrix X through a sparse CUDA kernel to further save data movements. Finally, we only off-load the most computationally intensive routines to the GPU without rewriting the whole codebase.

There are numerous opportunities for improvement. For instance, our multi-stream implementation for k-fold cross-validation has not achieved full concurrency for > 1 replicates, especially for relatively small sample size (∼ 100,000) due to the low GPU utilization. The other potential problem is that we have to replicate the design matrix on each CUDA stream, which increases the memory bandwidth requirement. To overcome these limitations, one may use a single larger kernel to perform many repetitions of k-fold cross-validation on a single stream. By increasing the computational work in a single kernel, we can over-subscribe GPU threads to achieve a higher utilization and improve data reuse, but of course this creates the danger of exhausting register memory. In addition, we plan to add inference on parameter estimates via bootstrapping by synchronizing the computation of sub-samples. Both bootstrapping and cross-validation require independent computations on different sub-samples, which may benefit from parallel computing.

Finally, GPU hardware and libraries are rapidly maturing, and statisticians stand to benefit from GPU parallelization. We conclude this work with a few tips for educational purposes. First, simple building blocks such as scans and reductions are ubiquitous in statistical inference. One can easily apply cutting-edge parallel implementations of these building blocks to gain instant speed-up. Second, kernel fusions should be employed whenever possible for serial algorithms to reduce expensive memory transactions and kernel overheads. Finally, when optimize an existing program, one should consider off-loading the most computationally intensive routines to the GPU in stages, before rewriting the whole codebase.

Supplementary Materials

Cyclops:R-package Cyclops containing code to perform the analysis described in the article is available at https://github.com/OHDSI/Cyclops/tree/gpu_cox.

R scripts for executing the experiments:R scripts for executing the experiments in Section 3. are available at https://github.com/suchard-group/gpu_survival_analysis_supplement. Unfortunately, we are unable to provide access to the Optum de-identified Electronic Health Records due to data licensing constraints. However, we provide the cohorts definition that can reproduce our real-world analysis for reader who do license the Optum de-identified Electronic Health Records data source.

Disclosure Statement

MJS is an employee and share-holder of Johnson & Johnson. The remaining authors report there are no competing interests to declare.

Additional information

Funding

This research was supported through US National Institutes of Health grant R01 AI153044. We gratefully acknowledge support from NVIDIA Corporation with the donation of parallel computing resources.

References

  • Barney, B. (2010), “Introduction to Parallel Computing,” Lawrence Livermore National Laboratory, 6, 10.
  • Beam, A.L., Ghosh, S. K., and Doyle, J. (2016), “Fast Hamiltonian Monte Carlo using GPU Computing,” Journal of Computational and Graphical Statistics, 25, 536–548. DOI: 10.1080/10618600.2015.1035724.
  • Bramante, C.T., Ingraham, N.E., MurrayT.A., Marmor, S., Hovertsen, S., Gronski, J., McNeil C., Feng R., Guzman G., Abdelwahab, N., et al. (2021), “Metformin and Risk of Mortality in Patients Hospitalised with Covid-19: A Retrospective Cohort Analysis,” The Lancet Healthy Longevity, 2, e34–e41. DOI: 10.1016/S2666-7568(20)30033-7.
  • Casella, G., Ghosh, M., Gill, J., and Kyung, M. (2010), “Penalized Regression, Standard Errors, and Bayesian Lassos,” Bayesian Analysis, 5, 369–411. DOI: 10.1214/10-BA607.
  • Chatterjee, A., and Lahiri, S.N. (2011), “Bootstrapping Lasso Estimators,” Journal of the American Statistical Association, 106, 608–625. DOI: 10.1198/jasa.2011.tm10159.
  • Cox, D.R. (1972), “Regression Models and Life-Tables,” Journal of the Royal Statistical Society, Series B, 34, 187–202. DOI: 10.1111/j.2517-6161.1972.tb00899.x.
  • Fine, J.P., and Gray, R.J. (1999), “A Proportional Hazards Model for the Subdistribution of a Competing Risk,” Journal of the American Statistical Association, 94, 496–509. DOI: 10.1080/01621459.1999.10474144.
  • Genkin, A., Lewis, D.D., and Madigan, D. (2007), “Large-Scale Bayesian Logistic Regression for Text Categorization,” Technometrics, 49, 291–304.
  • Harris, M., Sengupta, S., and Owens, J.D. (2007), “Parallel Prefix Sum (Scan) with CUDA,” GPU Gems, 3, 851–876.
  • Holbrook, A.J., Lemey, P., Baele, G., Dellicour, S., Brockmann, D., Rambaut, A., and Suchard, M.A. (2021), “Massive Parallelization Boosts Big Bayesian Multidimensional Scaling,” Journal of Computational and Graphical Statistics, 30, 11–24. DOI: 10.1080/10618600.2020.1754226.
  • Hripcsak, G., Ryan, P.B., Duke, J.D., Shah, N.H., Park, R.W., Huser, V., Suchard, M.A., Schuemie, M.J., DeFalco, F.J., Perotte, A., et al. (2016), “Characterizing Treatment Pathways at Scale Using the OHDSI Network,” Proceedings of the National Academy of Sciences, 113, 7329–7336. DOI: 10.1073/pnas.1510502113.
  • Kawaguchi, E.S., Shen, J.I., Suchard, M.A., and Li, G. (2020), “Scalable Algorithms for Large Competing Risks Data,” Journal of Computational and Graphical Statistics, 30, 685–693. DOI: 10.1080/10618600.2020.1841650.
  • Ko, S., Zhou, H., Zhou, J.J., and Won, J.-H. (2022), “High-Performance Statistical Computing in the Computing Environments of the 2020s,” Statistical Science, 37, 494–518.
  • Madigan, D., Ryan, P., Simpson, S., and Zorych, I. (2010), “Bayesian Methods in Pharmacovigilance,” Bayesian Statistics, 9, 421–438.
  • Madigan, D., Stang, P.E., Berlin, J.A., Schuemie, M., Overhage, J.M., Suchard M.A., Dumouchel, B., Hartzema, A.G., and Ryan, P.B. (2014), “A Systematic Statistical Approach to Evaluating Evidence from Observational Studies,” Annual Review of Statistics and Its Application, 1, 11–39. DOI: 10.1146/annurev-statistics-022513-115645.
  • Merrill, D., and Garland, M. (2016), “Single-Pass Parallel Prefix Scan with Decoupled Look-Back,” NVIDIA, Tech. Rep. NVR-2016-002.
  • Micikevicius, P. (2009), “3D Finite Difference Computation on GPUs Using CUDA,” in Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp. 79–84. DOI: 10.1145/1513895.1513905.
  • Mittal, S., Madigan, D., Burd, R.S., and Suchard, M.A. (2014), “High-Dimensional, Massive Sample-Size Cox Proportional Hazards Regression for Survival Analysis,” Biostatistics, 15, 207–221.
  • Mueller-Using, S., Feldt, T., Sarfo, F.S., and Eberhardt, K.A. (2016), “Factors Associated with Performing Tuberculosis Screening of HIV-Positive Patients in Ghana: Lasso-based Predictor Selection in a Large Public Health Data Set,” BMC Public Health, 16, 1–8.
  • Nickolls, J., Buck, I., Garland, M., and Skadron, K. (2008), “Scalable Parallel Programming with CUDA,” Queue, 6, 40–53. DOI: 10.1145/1365490.1365500.
  • Noordzij, M., Leffondré, K., van Stralen, K.J., Zoccali, C., Dekker, F.W., and Jager, K.J. (2013), “When Do We Need Competing Risks Methods for Survival Analysis in Nephrology?” Nephrology Dialysis Transplantation, 28, 2670–2677. DOI: 10.1093/ndt/gft355.
  • Rennich, S. (2011), “CUDA C/C++ Streams and Concurrency,” in GPU Technology Conference.
  • Robins, J.M., and Rotnitzky, A. (1992), “Recovery of Information and Adjustment for Dependent Censoring Using Surrogate Markers,” in AIDS Epidemiology, eds. N. P. Jewell, K. Dietz, and V. T. Farewell, pp. 297–331, Boston, MA: Springer.
  • Shortreed, S.M., and Ertefaie, A. (2017), “Outcome-Adaptive Lasso: Variable Selection for Causal Inference,” Biometrics, 73, 1111–1122.
  • Suchard, M.A., Wang, Q., Chan, C., Frelinger, J., Cron, A., and West, M. (2010), “Understanding GPU Programming for Statistical Computation: Studies in Massively Parallel Massive Mixtures,” Journal of Computational and Graphical Statistics, 19, 419–438. DOI: 10.1198/jcgs.2010.10016.
  • Suchard, M.A., Simpson, S.E., Zorych, I., Ryan, P., and Madigan, D. (2013), “Massive Parallelization of Serial Inference Algorithms for a Complex Generalized Linear Model,” ACM Transactions on Modeling and Computer Simulation (TOMACS), 23, 1–17. DOI: 10.1145/2414416.2414791.
  • Suchard, M.A., Schuemie, M.J., Krumholz, H.M., You, S.C., Chen, R., Pratt, N., Reich, C.G., Duke, J., Madigan, D., Hripcsak, G., et al. (2019), “Comprehensive Comparative Effectiveness and Safety of First-Line Antihypertensive Drug Classes: A Systematic, Multinational, Large-Scale Analysis,” The Lancet, 394, 1816–1826.
  • Terenin, A., Dong, S., and Draper, D. (2019), “GPU-Accelerated Gibbs Sampling: A Case Study of the Horseshoe Probit Model,” Statistics and Computing, 29, 301–310.
  • Wu, T.T., and Lange, K. (2008), “Coordinate Descent Algorithms for Lasso Penalized Regression,” The Annals of Applied Statistics, 2, 224–244. DOI: 10.1214/07-AOAS147.
  • Xu, Y., and Yin, W. (2017), “A Globally Convergent Algorithm for Nonconvex Optimization based on Block Coordinate Update,” Journal of Scientific Computing, 72, 700–734.
  • Zhou, H., Lange, K., and Suchard, M.A. (2010), “Graphics Processing Units and High-Dimensional Optimization,” Statistical Science: A Review Journal of the Institute of Mathematical Statistics, 25, 311–324. DOI: 10.1214/10-STS336.