1,262
Views
5
CrossRef citations to date
0
Altmetric
Research Paper

CAMDA 2014: Making sense of RNA-Seq data: From low-level processing to functional analysis

, &
Pages 31-40 | Received 03 Oct 2014, Accepted 17 Jan 2015, Published online: 08 May 2015

Abstract

Numerous methods of RNA-Seq data analysis have been developed, and there are more under active development. In this paper, our focus is on evaluating the impact of each processing stage; from pre-processing of sequencing reads to alignment/counting to count normalization to differential expression testing to downstream functional analysis, on the inferred functional pattern of biological response. We assess the impact of 6,912 combinations of technical and biological factors on the resulting signature of transcriptomic functional response. Given the absence of the ground truth, we use 2 complementary evaluation criteria: a) consistency of the functional patterns identified in 2 similar comparisons, namely effects of a naturally-toxic medium and a medium with artificially reconstituted toxicity, and b) consistency of the results in RNA-Seq and microarray versions of the same study. Our results show that despite high variability at the low-level processing stage (read pre-processing, alignment and counting) and the differential expression calling stage, their impact on the inferred pattern of biological response was surprisingly low; they were instead overshadowed by the choice of the functional enrichment method. The latter have an impact comparable in magnitude to the impact of biological factors per se.

Abbreviations

DE=

differential expression

ANOVA=

analysis of variance

FDR=

false discovery rate

FPKM=

fragments per kilobase of exon per million mapped fragments

Introduction

The ultimate goal of any transcriptomic experiment is to discover functional patterns of biological response to conditions of interest (treatments, environmental influences, mutations etc). However, transcriptomic analytical pipelines consist of multiple processing stages, and one would reason that the propagation of results from analytical choices made at earlier stages to the downstream steps might lead to vastly different conclusions. To the best of our knowledge, systematic comparison of the results between different combinations of analytical methods within a pipeline in context of the resulting functional signature of cellular response has not been previously addressed.

Soon after its invention, RNA-SeqCitation1 started to aggressively replace microarrays in the field of transcriptomics,Citation2 and this shift is due to RNA-Seq's impressively larger (at least 2 orders of magnitude) dynamic range and its ability to detect novel transcripts.Citation1

Both microarray and RNA-Seq technologies have their specific biases: given the same concentration of target cDNA, number of probe-target duplexes on a microarray will be strongly sequence-dependentCitation3 and even physical position of the probe on the array may have an influence on the measurement by “fogging” the positions of low-intensity probes with the fluorescence emitted from their bright neighbors, resulting in artificial increase in estimated expression of numerous genes.Citation4 The observed phenomenon of “probe affinity effects” was addressed in low-level microarray processing algorithms such as RMA.Citation5 Recently, the phenomenon appeared to be in the focus of more mechanistic modeling efforts.Citation6 On the other hand, RNA-Seq has its own biases, starting from gene length bias that could be eliminated by normalizing for RNA length and for the total number of mapped readsCitation7 and as uniform sampling of mRNA poolCitation8 to biases that belong to next-generation sequencing technology itself such as nucleotide per cycle bias and mappability bias.Citation9

The nature and associated problems of hybridization- and sequencing-based transcriptomic technologies led to believe that microarrays and RNA-Seq will stay complementary;Citation10 this was supported by co-existence of reports that show RNA-Seq to be either more sensitive than microarraysCitation11 or surprisingly less sensitive for low-expressed genes.Citation12 Apparently, the difference in conclusions stems directly from the lack of consistent, well-established data processing routines. Long-term, one may expect prospering of RNA-Seq based on its objective advantages of high dynamic range and de novo transcriptome structure discovery, while all the technical issues are addressed by the analytical methods. However at present, the methods of RNA-Seq analysis are still rapidly evolving, and most importantly, detection of biologically meaningful responses remains a “bigger challenge” within both microarray and RNA-Seq worlds.Citation13 This makes investigating the interface between data processing stages and functional pattern detection essential.

We use 2 orthogonal approaches to evaluate the results of various combinations of RNA-Seq pipeline processing: gene-level consistency of DE results on microarray and RNA-Seq platforms and – the key of this study - conservation of functional signatures in 2 biologically related comparisons within the RNA-Seq platform.

The published results by SchwalbachCitation14 and KeatingCitation15 that compare 3 different conditions of E.coli growth: control (SynH, synthetic corn stover hydrolysate without the toxic components), naturally toxic medium (ACSH, corn stover hydrolysate containing toxic components of lignin decomposition) and artificially toxic medium (SynHLT, synthetic corn stover hydrolysate with added mixture of toxic components identified in ACSH) were used to help evaluate the relative impact of various stages of RNA-Seq processing on the inferred functional patterns of cellular response.

Our study shows that the lower level stages of the RNA-Seq analytical pipeline, while being highly diverse in nature and options, have surprisingly low influence on the derived functional pattern of the response. At the same time, choice of the functional enrichment methodology is found to be critical.

Results

Overview

Our approach to evaluating various methods within an RNA-Seq data analysis pipeline is outlined in . For every major stage of the pipeline (low-level processing, differential expression calling and functional pattern detection) we evaluate the results using both internal and external criteria whenever possible. The internal criteria are restricted to a particular processing stage and do not rely on external information, e.g. results from a different gene expression measuring platform, or biological knowledge. Every evaluation stage will be described in the respective subsection of the Results. The key external evaluation approach used in this study is correlation between vectors of inferred functionality for the 2 biologically related comparisons (). The level of similarity of the functional profiles derived for the 2 comparisons was selected as an external biological criterion because: a) the 2 toxic media are expected to cause similar profiles of cellular response because the composition of the artificially toxic medium was carefully matched to the composition of the naturally toxic medium and b) this expectation was recently supported experimentally.Citation15 In order to check the robustness of higher-level functional pattern detection, we have chosen 2 comparisons that are diverse enough to depart from checking gene-level signal reproducibility and similar enough to expect similar higher-level patterns of response.

Figure 1. Overview of multilevel RNA-Seq pipeline evaluation approach. Bold arrows, analysis pipeline; dashed gray arrows, tested influences of various analysis stages on the robustness of the resulting functional signature (the latter was measured as correspondence between the functional signatures inferred for cellular responses to natural and artificial media toxicity).

Figure 1. Overview of multilevel RNA-Seq pipeline evaluation approach. Bold arrows, analysis pipeline; dashed gray arrows, tested influences of various analysis stages on the robustness of the resulting functional signature (the latter was measured as correspondence between the functional signatures inferred for cellular responses to natural and artificial media toxicity).

Figure 2. Related biological comparisons used. Naturally toxic medium: ammonia-pretreated corn stover hydrolysate (ACSH); artificially toxic medium: synthetic hydrolysate with added cocktail of toxic compounds discovered in ACSH (“lignotoxins”); control medium: synthetic hydrolysate with lignotoxins omitted. SeeCitation15 for further details.

Figure 2. Related biological comparisons used. Naturally toxic medium: ammonia-pretreated corn stover hydrolysate (ACSH); artificially toxic medium: synthetic hydrolysate with added cocktail of toxic compounds discovered in ACSH (“lignotoxins”); control medium: synthetic hydrolysate with lignotoxins omitted. SeeCitation15 for further details.

In order to consider influences of both technical and biological nature on the same scale, the relative contributions of various processing stages were assessed in comparison with contribution from biological factors. For this purpose, all the combinations of the various methods were used to analyze time series data consisting of 3 growth stages (exponential, transitional and stationaryCitation15).

Exhaustive combining of 2 read pre-processing strategies, 2 alignment / counting pipelines, 3 count normalization methods, 2 pairwise comparisons, 4 DE calling methods, 3 functional overrepresentation strategies, 4 geneset types, 2 directions of expression changes, and 3 time points resulted in 6,912 vectors of functional enrichment results (approximately 1 million of individual gene set enrichment p-values). After removal of vectors with NA values, the pool of 3,145 pairwise correlations between the vectors representing the 2 biologically related comparisons was modeled as a function of all the technical processing factors plus biological factor (growth stage). This provided a big picture estimate of factor influence which complements the results of stage-specific evaluations.

Low-level processing and alignment/counting

We considered 2 options for read pre-processing: “RAW” (untrimmed) vs. “QC” (trimmed reads) in order to investigate the tradeoff between having more sequence data and having more confident base calls at every base position in a read. For alignment/counting, we considered “BWA-HTSeq” vs. “Bowtie-RSEM” (see Materials and Methods) to investigate the difference between genome alignment followed by hard-threshold counting and transcriptome alignment followed by probabilistic counting.

Despite the very different approaches for obtaining the count estimates of each gene, the pairwise Pearson correlation coefficients across all the libraries (Fig. S1) were surprisingly similar, with respective differences for each of the 210 pairwise correlation coefficients within 0.03 for both “RAW-Bowtie-RSEM minus RAW-BWA-HTSeq” and “QC-Bowtie-RSEM minus QC-BWA-HTSeq” comparisons. To detect a trend in interlibrary correlation, values across the 4 combinations of read pre-processing and alignment / counting pipelines, we counted for each combination - the numbers of times a particular method combination had the maximum correlation value across all 4 pipelines. shows that the RAW-Bowtie-RSEM pipeline was the overall winner, with 94 maximal values out of 210 comparisons and 12 out of 15 comparisons restricted to biological replicates. If we exclude the read trimming factor, the Bowtie-RSEM pipeline shows overall better performance as compared to BWA-HTSeq by having 130 winners in 210 comparisons (p = 0.00034) with the overall pool of libraries and 13 winners in 15 comparisons (p = 0.0037) between biological replicates. The effect of read pre-processing was opposite for the 2 alignment/counting methods: while it improved the interlibrary correlations for BWA-HTSeq, it affected the RSEM results in exactly the opposite way (). This is not surprising, considering the nature of the methods: while BWA-HTSeq takes the supplied sequence information as is and therefore heavily depends on the quality of the base calls, RSEM incorporates nucleotide-level quality scores in its modelCitation16 which allows the model to leverage the information in the lower-quality 3′-tail of the sequencing reads to construct better alignments. This shows the robustness of RSEM's model even with a bacterial genome, where the complications of alternative splicing and high share of non-coding genomic DNA (major issues addressed by RSEM) are not applicable.

Figure 3. Number of pairwise comparisons a particular combination of read preprocessing and alignment / counting method resulted in the maximal value of Pearson correlation between the genome-wide vectors of counts. (A), winning cases out of all the 210 pairs, irrespective to experimental conditions; (B), winning cases out of 15 interreplicate comparisons. Cyan, BWA-HTSeq pipeline; blue, Bowtie-RSEM pipeline.

Figure 3. Number of pairwise comparisons a particular combination of read preprocessing and alignment / counting method resulted in the maximal value of Pearson correlation between the genome-wide vectors of counts. (A), winning cases out of all the 210 pairs, irrespective to experimental conditions; (B), winning cases out of 15 interreplicate comparisons. Cyan, BWA-HTSeq pipeline; blue, Bowtie-RSEM pipeline.

Differential expression calling

Genome-wide profiles of significance values of DE calling

Figure S2 shows distributions of gene-level statistics for DE tests performed with 3 methods (DESeq, EBSeq, edgeR and voom / limma). The overall profiles of significance level assignments are found to be dramatically method-dependent. Based on biological reasoning, when one may expect genes to be either differentially expressed between the conditions or not, the ideal-world genome-wide profile of DE significance value is expected to be bimodal with peaks near the extreme (0 or 1) values, with low density in the intermediate zone that represents “class assignment hesitation” of an algorithm. Still, lack of discrimination against the “hesitation zone” was very obvious for 3 out of 4 algorithms (Fig. S2 A, C, D). On the contrary, EBSeq (Fig. S2B) has demonstrated a trend of “hesitation zone avoidance.” While the latter observation is encouraging, the distribution profile alone cannot be taken as an evidence for a better biological relevance of EBSeq-generated output without other, complementary evaluation strategies.

Individual library-level expression overlaps between the conditions

The majority of the transcriptomic experiments (both microarray and RNA-Seq) have a low number of biological replicates, with 3 or even 2 replicates dominating the overall pool of publically available repositories such as GEO.Citation17 Given that fact, which represents the de facto limited information on true data variability for any given gene, a biologist conducting the experiment needs to stay on conservative side and focus on genes that show a consistent trend of differential expression between the conditions at the individual replicate level. To formalize this selection, we introduced a “critical coefficient” (crt) filter defined and applied in our earlier microarray studies,Citation18,19,20,21,22 which helped to discover more consistent biological stories. Briefly, crt below 1 indicates a presence of unfavorable “data overlap” between the conditions being compared. The formal statistical properties of the critical coefficient will be reported elsewhere.

Here we check the relevance of the critical coefficient for DE analysis of RNA-Seq data. shows critical coefficients plotted for the genes called differentially expressed (i.e. with FDR < 0.05) in SynH+LT vs. SynH at timepoint T2 by 4 methods. Surprisingly, 2 out of 4 methods () demonstrate noticeable (6–8%) presence of genes with overlapping library-level expression values between the conditions (log10(crt) < 0). The phenomenon included pronounced cases like gene “b4354” (Fig. S3) which had extremely low (<0 .1) value for the critical coefficient indicating dramatic overlap of the expression values between the conditions. While 3 out of 4 methods associated high FDR values with this gene (EBSeq: 0.87; DESeq: 0.60; voom/limma: 0.97), edgeR selected this gene as highly differentially expressed (FDR = 0.0000082). This fact should not be treated as an evidence for a particular inaccuracy of edgeR but rather as an evidence for meaningfulness of critical coefficient filtering even on top of modern RNA-Seq differential expression calling methods. It should be noted that the genes selected as differentially expressed by voom / limma showed higher critical coefficient values overall, with a clear trend of those values to follow the statistical significance of the DE call; expression values for all the genes that were called differentially expressed by voom / limma at lower (< 0.005) FDR values were very well (at least 1.8 times) separated between the conditions at the level of individual libraries, while other methods assigned this strict level of statistical significance to a number of genes with weakly separated or even overlapping library-level values (). In fact, approximately half of the genes with crt < 1 and FDR < 0.05 according to EBSeq or edgeR, were selected at FDR below 0.005 by those methods (47 out of 96 for EBSeq and 27 out of 50 for edgeR) which follows the trend of generating densely populated clusters of genes with extremely high statistical significance of DE by those 2 methods (). The higher agreement between voom / limma output and biological common sense expectations suggests that this recent adaptation of a well-developed microarray analysis framework to RNA-Seq dataCitation23 deserves closer attention.

Figure 4. Critical Coefficients computed for genes that are called differentially expressed (FDR < 0.05) in artificially toxic medium at exponential growth phase by 4 DE calling methods. For EBSeq, posterior Probability of Equal Expression was used as a conservative estimate of FDR. (A), EBSeq; (B), DESeq; (C), edgeR; (D), voom / limma. Red, genes with critical coefficient values below 1 (corresponds to 0 in the log-scale applied). Black and red inline numbers are numbers of genes that are called DE by a particular method and show either absence (crt >=1, black) or presence (crt<1, red) of inconsistent sign of the transcript abundance changes between the 2 conditions at individual replicate level.

Figure 4. Critical Coefficients computed for genes that are called differentially expressed (FDR < 0.05) in artificially toxic medium at exponential growth phase by 4 DE calling methods. For EBSeq, posterior Probability of Equal Expression was used as a conservative estimate of FDR. (A), EBSeq; (B), DESeq; (C), edgeR; (D), voom / limma. Red, genes with critical coefficient values below 1 (corresponds to 0 in the log-scale applied). Black and red inline numbers are numbers of genes that are called DE by a particular method and show either absence (crt >=1, black) or presence (crt<1, red) of inconsistent sign of the transcript abundance changes between the 2 conditions at individual replicate level.

Consistency with microarray results

The same experimental design was used in the earlier studyCitation14 employing microarray platform. We have used this publically available dataset to assess the robustness of differential expression results across the platforms. Since the ground truth is not available and both microarray and RNA-Seq platforms possess a mixture of features of positive and negative connotation (e.g., proven technology with established analytical toolset but low dynamic range and cross-hybridization problems vs. new technology with potentially higher precision but immature analytical toolset), we applied very neutral / agnostic consistency criterion defined as a ratio between number of genes called DE by both platforms (with same directionality of change) to the size of the pool of genes called DE by any platform, for every combination of the pipeline parameters and biological factors. The value of this Relative Intersection (RI) statistic varied widely (Fig. S4), from almost no overlap (0.01) to about ½ overlap (0.48) at gene level, with median of 0.19 and mean 0.24. Surprisingly, both read pre-processing and the choice of alignment / counting pipeline had clearly no influence of the interplatform DE result consistency (Fig. S5). Moving downstream the processing pipeline, beyond count values generation, we start to see the influence of the processing options: count normalization method showed a borderline influence on the RI (Fig. S6A), and choice of the method for subsequent DE test (Fig. S6B) had significant effect on this statistic. Still, biological factors such as nature of the toxic medium (ACSH or SynH+LT – Fig. S7A), as well as growth stage (Fig. S7B) had the most pronounced effect on the RI.

Complementing the one-factor-at-a-time evaluation, we also took all the considered factors as parts of a generalized linear model that targets RI as a function of all the factors mentioned above plus the critical coefficient (the effect of the latter was not obvious during one factor at a time evaluation – not shown). The model confirmed the highest significance of biological factors, followed by DE calling choice and insignificance of the low-level processing options (). Artificially toxic (better controlled) medium resulted in better result correspondence across platforms, the latter decreased at stationary phase of growth when interference of multiple biological responses is observed. EBSeq, followed by edgeR, showed elevated RI values, compared to other 2 DE methods. Surprisingly, application of critical coefficient was neutral in terms of resulting interplatform agreement.

Table 1. Generalized Linear Model of the RNA-Seq - microarray DE results concordance (expressed as RI) as a function of the technical and biological factors combined. Levels of factors reported: QCRAW, no read pre-processing; alignRSEM, RSEM alignment / counting pipeline; normTMM, count normalization using TMM; normUPPER, count normalization using upper quartile; compsynHLT, biological comparison of SynH+LT vs. SynH; timeT3, transitional phase of cellular growth; timeT4, stationary phase of cellular growth; DEEBSeq, EBSeq as the DE calling algorithm; DEedgeR, edgeR as the DE calling algorithm; DEvoom, voom / limma as the DE calling algorithm; critTRUE, critical coefficient of 1.15 applied. Bold, variables retained after forward-backward regression and p-value filtering.

Overall, there is a trend of increasing the impact of the technical processing stages from upstream pre-processing to downstream differential expression testing. Keeping in mind the fact that the final result of the transcriptomic experiment is inferred pattern of functionality, evaluating the pipeline on the basis of functional pattern conservation is especially informative.

Relative impact of processing stages on the derived pattern of functionality

Overall model

We used 4 types of gene sets representing different dimensions of cellular functionality: 133 KEGG pathways, 400 species-specific pathways, 172 regulons and 333 transporters, combined with 3 gene set enrichment strategies (see Materials and Methods). For every combination of read pre-processing, count normalization, DE test, direction of expression change, gene set type, gene set enrichment strategy and cellular growth stage, we computed the correlation coefficient between the 2 (ACSH vs. SynH and SynH+LT vs. SynH) vectors of –log10(GS-FDR) within each gene set type, where GS-FDR is the FDR associated with call of involvement of a particular gene set in the transcriptional response. Hence, for KEGG pathways we computed correlations between 133-mer vectors, for species-specific pathways – between 400-mer vectors, and so forth.

The global overview of relative contribution of various processing stages on the resulting inferred functionality patterns was generated via the following Generalized Linear Model:

FunCor ∼ PreProcess + AlignCount + Normaliz + DEMethod + DEDirection + GeneSetType + FunSearchType + TimePoint

where FunCor is the Pearson correlation between vector of –log10(GS-FDR) for ACSH vs. SynH comparison and the respective vector for SynH+LT vs. SynH comparison, PreProcess is the read pre-processing option (RAW or QC), AlignCount is the alignment / counting pipeline (BWA-HTSeq or Bowtie-RSEM), Normaliz is the gene length-agnostic count normalization method for subsequent DE test (MED, UPPER, TMM), DEMethod is the algorithm for DE calling (EBSeq, DESEq, edgeR, voom / limma), DEDirection is the direction of the expression changes (up or down), GeneSetType is the type of gene set (KEGG, sPW, TF, Transporters), FunSearchType is the enrichment testing strategy (cluster, Fisher, Fold), TimePoint is the growth stage of the cells (T2, T3, T4).

The model summary is shown in . The contribution of various processing stages varies dramatically, with a clear trend of dramatically higher impact of biological factors (direction of the expression change and cell growth stage, absolute values of model coefficients 0.28 and 0.25) over technical factors (maximal model coefficient 0.13) and increasing the impact of the stage on preservation of the resulting signature of functionality toward the downstream analysis stages: the pre-processing and alignment / counting stages had no detectable influence of the functional signature preservation at all (“PreProcessRAW” and “AlignCountRSEM” lines in the ), normalization, DE testing options and the choice of gene set type had significant influences, however of low magnitude (model coefficients within 0.08), and the choice of functional overrepresentation strategy had the highest impact among all the technical factors.

Table 2. Generalized Linear Model of the correlation between vectors of functionality as a function of the technical and biological factors combined. Levels of factors reported: DEDirectionUP, genes upregulated in toxic media; GeneSetTypesPW, species-specific pathways as gene set type; GeneSetTypeTF, regulons as gene set type; GeneSetTypeTransp, Transporters as gene set type; FunSearchTypepiano, Fisher's gene-level p-values summarization as enrichment test; FunSearchTypepianoFOLD, gene-level summarization of fold changes as enrichment test; PreProcessRAW, nor read pre-processing; AlignCountRSEM, Bowtie-RSEM alignment / counting pipeline; NormalizTMM, TMM count normalization; NormalizUPPER, upper quartile count normalization; TimePointT3, transitional phase of cellular growth; TimePointT4, stationary phase of cellular growth; DEMethodEBSeq, EBSeq as the DE calling algorithm; DEMethodedgeR, edgeR as the DE calling algorithm; DEMethodvoom, voom / limma as the DE calling algorithm. Bold, variables retained after forward-backward regression and p-value filtering.

The two most pronounced effects detected – expression change directionality and cell growth stage – are illustrated of and , respectively. shows that the conservation of the inferred functional signature is dramatically higher in the case of downregulated genes, where more than half of the cases have Pearson correlation above 0.5. On the contrary, for upregulated genes we observe not only an overall left-shift of the distribution but also a substantial Gaussian component with peak around 0, representing cases of absence of any concordance between cellular responses to the 2 toxic media. Similar influence of the cell growth stage may be better illustrated when conditioned by the directionality of expression change (), where the gradual divergence between the 2 toxic media from T2 through T3 to T4 is clearly visible (), with main modes of the correlation values distribution for T2 being above 0.5 and for T4 around 0. With the direction of changes that is favorable for functional signature conservation, i.e., downregulation, all the 3 growth stages have the main peak of the correlation values distributions above 0.5, however the stationary stage (T4) still has a substantial tail at very low correlation values, with significant presence in the anti-correlation zone ().

Figure 5. Distributions of correlation coefficients between vectors of -log10(GS-FDR) values computed for ACSH vs. SynH and SynH+LT vs. SynH comparisons for up- and downregulated genes.

Figure 5. Distributions of correlation coefficients between vectors of -log10(GS-FDR) values computed for ACSH vs. SynH and SynH+LT vs. SynH comparisons for up- and downregulated genes.

Figure 6. Distributions of correlation coefficients between vectors of -log10(GS-FDR) values computed for ACSH vs. SynH and SynH+LT vs. SynH comparisons for different cell growth stages, with datasets restricted to upregulated (A) and downregulated (B) genes.

Figure 6. Distributions of correlation coefficients between vectors of -log10(GS-FDR) values computed for ACSH vs. SynH and SynH+LT vs. SynH comparisons for different cell growth stages, with datasets restricted to upregulated (A) and downregulated (B) genes.

Sub-models conditioned by biological factors

While the overall impact distribution between biological and technical factors is clear, one cannot exclude the possibility of masking more subtle technical influences by the overwhelming effect of biological factors. To reveal possible conditional dependences of the impact of technical factors, we constructed the 4 models with restriction of the data to a particular combination of the most (or the least) favorable directionality of expression change (i.e. downregulation or upregulation) and the most / the least favorable growth stage (i.e., T2 or T4).

TableS1 shows the structures of those models. A key observation there is association of geneset type significance with the direction of the expression change. For example, regulons are positively associated with preservation of functional signatures between the 2 media in the case of upregulated genes (Table S1AC), and no such association is visible for downregulated genes (Table S1BD). The phenomenon is even more pronounced with transporters, when a positive effect in the case of upregulated, and negative effect for downregulated genes is observed (the latter, however, is below statistical significance for growth stage T2). As to the choice of the enrichment method, summarization of fold changes has very positive effect for downregulated genes (Table S1BD), which may be attributed to generally more pronounced effect of downregulation than upregulation in the toxic media. The latter method was also beneficial (however, with less magnitude) for upregulated genes at T2 (Table S1A).

A noticeable phenomenon is the often-seen negative effect of TMM normalization on the functional signature conservation, visible in both the overall model () and in condition-restricted models (Table S1ACD). Visualizing the distributions of functional correlation values for upregulated genes at growth stage T2, for example () revealed that while all the 3 normalization methods have the main mode of distribution density above Pearson correlation of 0.5 and no simple scaling method (UPPER or MED) results in functional correlations below 0.12, TMM has a large low-end tail heavily exposed to the anticorrelation zone ().

Figure 7. Distributions of correlation coefficients between vectors of -log10(GS-FDR) values computed for ACSH vs. SynH and SynH+LT vs. SynH comparisons for different count normalization methods, with data sets restricted to upregulated genes at exponential (T2) growth stage.

Figure 7. Distributions of correlation coefficients between vectors of -log10(GS-FDR) values computed for ACSH vs. SynH and SynH+LT vs. SynH comparisons for different count normalization methods, with data sets restricted to upregulated genes at exponential (T2) growth stage.

Discussion

In this work, we approached a “bird view” evaluation of all the stages of RNA-Seq analysis pipeline in order to estimate relative contribution of option variations at every stage of the pipeline on the resulting functional signature of the cellular response.

The main observation of this study is the vanishing influence of low-level processing stages, such as read pre-processing and alignment / counting options on the functional signature derived downstream. While not all the possible combinations of read alignment and counting were checked, the results suggest that this level of detail is not necessary for our purpose; the two alignment / counting pipelines, despite very different nature of the underlying methods, had negligible effect on the inferred functional patterns. Still, analysis of interlibrary correlations within the alignment / counting stage and their combined dependence on upstream read pre-processing has demonstrated the advantage of probabilistic counting method (RSEM) that apparently can take advantage of lower-quality base calls to construct better alignments.

Checking relative agreement between RNA-Seq and microarray DE results on gene level discovered an overwhelming influence of biological factors on the agreement statistic, as compared to the technical parameters of the pipeline. Dependence of interplatform concordance on the magnitude of biological effect was reported recently.Citation24 Still, it was somewhat counterintuitive that processing options of different nature affect the interplatform concordance in much lesser extent. Interestingly, the work ofCitation24 compared 3 (limma, DESeq, edgeR) out of 4 DE methods applied to our study and found the results to be highly correlated. Our observation of different behavior of those methods in terms of genome-wide significance level distribution and preserving the consistency of the direction of expression changes provides another perspective and may motivate further developments of DE calling algorithms with consideration of biological expectations.

The main approach to method judgment applied here is conservation of functional signatures between 2 biologically related comparisons. While this approach is entirely different from microarray – RNA-Seq agreement evaluation, the bigger picture conclusions from the 2 were very similar: 1) biological factors have the most influence on the final result, 2) earlier stages of the processing pipeline have less impact on the final result than later stages, even if we vary the definition of the “final result” and consider either gene-level differential expression result across 2 platforms or geneset-level results within the same platform across 2 biologically related comparisons.

Examples of concordant conclusions from the 2 complementary types of evaluation include: 1) advantage of simple median normalization which demonstrated both better array intersection and better functional pattern reproducibility, 2) overwhelming importance of biological factors: later time point in the experiment (i.e. the stationary phase of cell growth) showed dramatically less agreement with arrays and dramatically less functional conservation between related comparisons within RNA-Seq platform, 3) negligible effect of low-level processing stages. Concerning the item 2, the stationary phase of bacterial cell growth, also known as conditional senescenceCitation25 is characterized by co-existence of dying and living cells and transcriptional responses to nutrient limitations. Those factors complicate the transcriptome profile of stationary-phase bacteria, and our observation of lower interplatform concordance at this stage is in agreement with the report that responses involving specific mechanisms show better interplatform concordance than the complex responses.Citation24 Beyond interplatform agreement issue at gene level, we show that the same principle is true when applied to within-platform agreement between biologically related comparisons at the level of inferred functionality signatures.

Occasionally, multiple evaluation procedures may result in more questions than answers. For example, DE detection method EBSeq demonstrated best significance values distribution profile, poor consistency of direction of the DE effect across replicates, best coherence with microarray results and none to a marginal advantage on the functionality signature conservation. This confirms that we are far from providing a unified solution which is optimal for all the possible cases. The idea of needing to adapt the analysis to a particular experimental setup is further confirmed by our results on modeling the impact of the processing factors conditioned by the most influential biological factors (Table S1), from which we see large interconnection of the best-performing analytical options and the biological context.

Our results suggest that RNA-Seq analysis should be extremely biology-aware, and special effort should be devoted to optimizing the last stage of the analysis, i.e., search for the functional patterns that form a unique signature of cellular response to the conditions of interest.

Materials and Methods

Data sets

RNA-Seq and microarray data sets used in the previous worksCitation14,15 are available from GEO (accession number GSE58927, more details in Supplementary Table S1). Microarray data was processed exactly as describedCitation14. For RNA-Seq data processing options, see below.

Reads pre-processing

The options included using reads as they come from the sequencing facility – referred to as “RAW” reads, and performing read trimming, resulting in “QC” reads. “RAW” reads were already free from adaptor and other artificial sequences and had length of 100 nt. To generate “QC” reads, we used Trimmomatic softwareCitation26 with the following rules: 1) remove the first 12 bases from 5′end, 2) remove any number of nucleotides from 3′end that have the average quality score < 30 in a 3-nt sliding window, keep the trimmed read if 36 or more nucleotides are left.

Alignment and counting

The genome alignment followed by hard-threshold counting pipeline (“BWA-HTSeq”) included alignment of reads with BWA version 0.7.9a-r786Citation27 and subsequent counting of genomic features’ coverage with HTSeq version 0.6.1p2Citation28 using default parameters. The transcriptome alignment followed by probabilistic counting pipeline (“RSEM”) included read alignment to the transcriptome with Bowtie version 0.12.7,Citation29 followed by counting with RSEM version 1.2.4.Citation16 For both pipelines, information of the type of strand – specificity of the libraries (generated with dUTP protocol) was supplied as “–stranded=reverse” or “–forward-prob 0” for BWA-HTSeq and RSEM pipelines, respectively. Fig. S8 shows our entire low-level processing pipeline (with read pre-processing and alignment / counting options). The pipeline code is available at https://github.com/scienceforever/GLSeq under GNU General Public License version 3.

Count normalization

The entire set of libraries was pre-normalized as a pool using one of the 3 methods: “MED” – median normalization from EBSeqCitation30 package, “UPPER” – upper quartile (75 percentile) normalization from the same package, or “TMM” – trimmed mean of M values normalizationCitation31 implemented in edgeRCitation32 package.

Tests for differential expression

The pre-normalized count data sets were used as input to the 4 differential expression testing algorithms: EBSeq,Citation30 DESeq,Citation33 edgeRCitation32 and voom / limma.Citation23 The default count normalization routines built in the packages were switched off.

Geneset enrichment analysis

Four types of gene sets – KEGG pathways, species-specific pathways, regulons and transporters – were used. KEGG gene-pathway assignments were downloaded via Bioconductor's KEGGRESTCitation34 package, other genesets were formatted from EcoCyc version 17.0Citation35 flat files. Cluster-based enrichment test with sets of responsive genes (FDR < 0.05) was performed with goseq package,Citation36 the 2 gene-level statistics summarization tests – Fisher's combined probability testCitation37 and summarization of median gene-level fold changes - using pianoCitation38 Bioconductor package. Statistical significance of the enrichment tests was estimated with 100,000 data permutations.

Generalized linear modeling

Generating of the models was performed in R statistical environment. To refine the condition-specific models (presented in Supplementary Tables S2A-S2D), stepwise forward-backward regression was applied with Bayesian Information Criterion penalty. The models were additionally cleaned up by removing variables with p > 0.05. The initial models ( and ) are presented before stepwise regression and p-value filtering, to demonstrate relative contributions of different processing factors explicitly.

Disclosure of Potential Conflicts of Interest

No potential conflicts of interest were disclosed.

Supplemental material

1010923_Supplementary_Figures_1-8.zip

Download Zip (1 MB)

1010923_Supplementary_Tables.zip

Download Zip (428 KB)

Acknowledgments

Authors are thankful to Dr. Dana Wohlbach for advice on BWA processing options and Prof. Robert Landick for general support of the project.

Funding

This work was supported by the DOE Great Lakes Bioenergy Research Center (DOE BER Office of Science DE-FC02–07ER64494).

Supplemental Material

Supplemental data for this article can be accessed on the publisher's website.

References

  • Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 2008; 320:1344-9; PMID:18451266; http://dx.doi.org/10.1126/science.1158441
  • Blow N. Transcriptomics: the digital generation. Nature 2009; 458:239-42; PMID:19279641; http://dx.doi.org/10.1038/458239a
  • Mei R, Hubbell E, Bekiranov S, Mittmann M, Christians FC, Shen MM, Lu G, Fang J, Liu WM, Ryder T, et al. Probe selection for high-density oligonucleotide arrays. Proc Nat Acad Sci U S A 2003; 100:11237-42; PMID:14500916; http://dx.doi.org/10.1073/pnas.1534744100
  • Upton GJ, Harrison AP. The detection of blur in affymetrix genechips. Stat Appl Gen Mol Biol 2010; 9:37; PMID:21044041; http://dx.doi.org/10.2202/1544-6115.1590
  • Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP. Summaries of Affymetrix GeneChip probe level data. Nucl Acid Res 2003; 31:e15; PMID:12582260; http://dx.doi.org/10.1093/nar/gng015
  • Harrison A, Binder H, Buhot A, Burden CJ, Carlon E, Gibas C, Gamble LJ, Halperin A, Hooyberghs J, Kreil DP, et al. Physico-chemical foundations underpinning microarray and next-generation sequencing experiments. Nucl Acid Res 2013; 41:2779-96; PMID:23307556; http://dx.doi.org/10.1093/nar/gks1358
  • Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Meth 2008; 5:621-8; PMID:18516045; http://dx.doi.org/10.1038/nmeth.1226
  • Labaj PP, Leparc GG, Linggi BE, Markillie LM, Wiley HS, Kreil DP. Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling. Bioinformatics 2011; 27:i383-91; PMID:21685096; http://dx.doi.org/10.1093/bioinformatics/btr247
  • Schwartz S, Oren R, Ast G. Detection and removal of biases in the analysis of next-generation sequencing reads. PloS one 2011; 6:e16685; PMID:21304912; http://dx.doi.org/10.1371/journal.pone.0016685
  • Liu F, Jenssen TK, Trimarchi J, Punzo C, Cepko CL, Ohno-Machado L, Hovig E, Kuo WP. Comparison of hybridization-based and sequencing-based gene expression technologies on biological replicates. BMC Genom 2007; 8:153; PMID:17555589; http://dx.doi.org/10.1186/1471-2164-8-153
  • Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, et al. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 2008; 321:956-60; PMID:18599741; http://dx.doi.org/10.1126/science.1160342
  • Malone JH, Oliver B. Microarrays, deep sequencing and the true measure of the transcriptome. BMC Biol 2011; 9:34; PMID:21627854; http://dx.doi.org/10.1186/1741-7007-9-34
  • Shendure J. The beginning of the end for microarrays? Nat Meth 2008; 5:585-7; PMID:18587314; http://dx.doi.org/10.1038/nmeth0708-585
  • Schwalbach MS, Keating DH, Tremaine M, Marner WD, Zhang Y, Bothfeld W, Higbee A, Grass JA, Cotten C, Reed JL, et al. Complex physiology and compound stress responses during fermentation of alkali-pretreated corn stover hydrolysate by an escherichia coli ethanologen. Appl Environl Microbiol 2012; 78:3442-57; PMID:22389370; http://dx.doi.org/10.1128/AEM.07329-11
  • Keating DH, Zhang Y, Ong IM, McIlwain S, Morales EH, Grass JA, Tremaine M, Bothfeld W, Higbee A, Ulbrich A, et al. Aromatic inhibitors derived from ammonia-pretreated lignocellulose hinder bacterial ethanologenesis by activating regulatory circuits controlling inhibitor efflux and detoxification. Front Microbiol 2014; 5:402; PMID:25177315; http://dx.doi.org/10.3389/fmicb.2014.00402
  • Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformat 2011; 12:323; PMID:21816040; http://dx.doi.org/10.1186/1471-2105-12-323
  • Edgar R, Domrachev M, Lash AE. Gene Expression omnibus: NCBI gene expression and hybridization array data repository. Nucl Acid Res 2002; 30:207-10; PMID:11752295; http://dx.doi.org/10.1093/nar/30.1.207
  • Braatsch S, Moskvin OV, Klug G, Gomelsky M. Responses of the rhodobacter sphaeroides transcriptome to blue light under semiaerobic conditions. J Bacteriol 2004; 186:7726-35; PMID:15516587; http://dx.doi.org/10.1128/JB.186.22.7726-7735.2004
  • Moskvin OV, Gomelsky L, Gomelsky M. Transcriptome analysis of the rhodobacter sphaeroides PpsR regulon: PpsR as a master regulator of photosystem development. J Bacteriol 2005; 187:2148-56; PMID:15743963; http://dx.doi.org/10.1128/JB.187.6.2148-2156.2005
  • Zeller T, Moskvin OV, Li K, Klug G, Gomelsky M. Transcriptome and physiological responses to hydrogen peroxide of the facultatively phototrophic bacterium rhodobacter sphaeroides. J Bacteriol 2005; 187:7232-42; PMID:16237007; http://dx.doi.org/10.1128/JB.187.21.7232-7242.2005
  • Moskvin OV, Kaplan S, Gilles-Gonzalez MA, Gomelsky M. Novel heme-based oxygen sensor with a revealing evolutionary history. J Biol Chem 2007; 282:28740-8; PMID:17660296; http://dx.doi.org/10.1074/jbc.M703261200
  • Zeller T, Mraheil MA, Moskvin OV, Li K, Gomelsky M, Klug G. Regulation of hydrogen peroxide-dependent gene expression in rhodobacter sphaeroides: regulatory functions of OxyR. J Bacteriol 2007; 189:3784-92; PMID:17351037; http://dx.doi.org/10.1128/JB.01795-06
  • Law CW, Chen Y, Shi W, Smyth GK. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 2014; 15:R29; PMID:24485249; http://dx.doi.org/10.1186/gb-2014-15-2-r29
  • Wang C, Gong B, Bushel PR, Thierry-Mieg J, Thierry-Mieg D, Xu J, Fang H, Hong H, Shen J, Su Z, et al. The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance. Nat Biotechnol 2014; 32:926-32; PMID:25150839; http://dx.doi.org/10.1038/nbt.3001
  • Fredriksson A, Nystrom T. Conditional and replicative senescence in escherichia coli. Curr Opin Microbiol 2006; 9:612-8; PMID:17067847; http://dx.doi.org/10.1016/j.mib.2006.10.010
  • Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 2014; 30:2114-20; PMID:24695404; http://dx.doi.org/10.1093/bioinformatics/btu170
  • Li H, Durbin R. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 2009; 25:1754-60; PMID:19451168; http://dx.doi.org/10.1093/bioinformatics/btp324
  • Anders SP, Pyl PT, Huber W. HTSeq - A python framework to work with high-throughput sequencing data. BioRxiv 2014; 31(2):166-9; PMID:25260700; http://dx.doi.org/10.1093/bioinformatics/btu638
  • Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 2009; 10:R25; PMID:19261174; http://dx.doi.org/10.1186/gb-2009-10-3-r25
  • Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BM, Haag JD, Gould MN, Stewart RM, Kendziorski C. EBSeq: an empirical bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics 2013; 29:1035-43; PMID:23428641; http://dx.doi.org/10.1093/bioinformatics/btt087
  • Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 2010; 11:R25; PMID:20196867; http://dx.doi.org/10.1186/gb-2010-11-3-r25
  • Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010; 26:139-40; PMID:19910308; http://dx.doi.org/10.1093/bioinformatics/btp616
  • Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol 2010; 11:R106; PMID:20979621; http://dx.doi.org/10.1186/gb-2010-11-10-r106
  • Tenenbaum D. KEGGREST: client-side REST access to KEGG. R package version 141. 2013
  • Keseler IM, Mackie A, Peralta-Gil M, Santos-Zavaleta A, Gama-Castro S, Bonavides-Martinez C, Fulcher C, Huerta AM, Kothari A, Krummenacker M, et al. EcoCyc: fusing model organism databases with systems biology. Nucl Res 2013; 41:D605-12; PMID:23143106; http://dx.doi.org/10.1093/nar/gks1027
  • Young MD, Wakefield MJ, Smyth GK, Oshlack A. Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biol 2010; 11:R14; PMID:20132535; http://dx.doi.org/10.1186/gb-2010-11-2-r14
  • Fisher RA. Statistical methods for research workers. Edinburgh: Oliver Boyd 1932.
  • Varemo L, Nielsen J, Nookaew I. Enriching the gene set analysis of genome-wide data by incorporating directionality of gene expression and combining statistical hypotheses and methods. Nucl Acid Res 2013; 41:4378-91; PMID:23444143; http://dx.doi.org/10.1093/nar/gkt111