391
Views
11
CrossRef citations to date
0
Altmetric
Research Articles

A spatial uncertainty metric for anthropogenic CO2 emissions

, , , , , , & show all
Pages 139-160 | Received 03 Jun 2014, Accepted 17 Dec 2014, Published online: 28 Jan 2015

Abstract

Large point sources account for as much as 60% of the anthropogenic carbon dioxide emissions for some countries. Because CO2 emissions are seldom measured directly, but are generally estimated from other data, we need to understand the uncertainty of these emissions estimates. Simply stated, for any given geographical and temporal location, we would like to quantify the emissions and the associated uncertainty with as fine a resolution as possible. While US data on point sources are largely assumed to be among the best available globally, the reported locations of these sources, based on the data set used in this analysis, are estimated to differ by 0.84 km on average from their actual locations. This paper presents a metric to quantify spatial uncertainty in point sources and explains why the uncertainty in point source data cannot be described with traditional methods. A Monte Carlo simulation is used to calculate expected emissions values for each point source and the associated spatial uncertainty is derived from these expected values. The uncertainty metric can be used to define and calibrate appropriate levels of resolution for regions with more or less reliable data sets. Gridded data are output to be incorporated into other data products reporting spatially explicit emissions estimates.

1. Introduction

For reasons ranging from understanding important human and biogeochemical processes to monitoring, reporting, and verifying international agreements, there is great interest in describing both the magnitude and the distribution of anthropogenic CO2 emissions around the globe. Because CO2 emissions are seldom measured directly, but are generally estimated from related, proxy, and re-purposed data, it is also important to understand the uncertainty of these emissions estimates. High spatial resolution is required to distinguish and understand human and biological processes in the global carbon cycle. As Sarmiento et al. (Citation2010) note, understanding and predicting how the terrestrial vegetation will behave in the future is ‘among the enduring problems in carbon cycle research’ and is constrained by our data on the uncertainty in fossil-fuel emissions. High-resolution observations help to constrain emissions estimates and resolve emissions activities. Also, attribution of responsibility, and the design and monitoring of mitigation activities all depend on understanding the magnitude and distribution of emissions. Concern with the spatial distribution of emissions is additionally driven by the desire to use ground-level data of emissions to evaluate satellite measurements, potentially enabling the satellites to remotely determine locations of sources and the magnitude of gas being emitted; and emissions estimates are needed at the same scale as satellite measurements. Can remote sensing methods be used to verify existing sources of emissions or to recognize and accurately attribute new, changed, or unreported sources? Inverse models of the sources, sinks, and transport of CO2 require high-resolution data on emissions. Before emissions data can be used for such applications, it is necessary to have an understanding of their accuracy. There is, of course, great interest in knowing both the spatial and the temporal distributions of emissions (see, for examples, Ciais et al., Citation2013; Nassar et al., Citation2013; Peylin et al. Citation2011), but this analysis is focused on the spatial distribution. Although different applications have different needs for spatial and temporal resolution, Ciais et al. (Citation2013) suggest that we would like to eventually have emissions data at resolutions of 1 km and 1 h. The utility of high-resolution data will depend on their uncertainty.

The challenge is to understand where and when emissions are taking place, where and when gases are taken out of the atmosphere, and the dynamics of the flow between them. At each step, quantifying the uncertainty in the estimates allows us to gauge our understanding of the processes taking place. Nassar et al. (Citation2013) point out that fossil-fuel emissions are often treated as though they have zero uncertainty, but that the result of this approach is that ‘errors in the fossil-fuel emissions are hidden and instead cause systematic errors in the biospheric and oceanic CO2 flux estimates’. Uncertainty in this sense refers not only to the confidence in the estimate of the total emissions being produced, but also to the confidence in the locations where these processes occur. The latter is a quantity that has remained largely unaddressed in the literature and is of high concern due to substantial reporting discrepancies in the data that are available. The analysis here focuses on the locational uncertainty in anthropogenic emissions, particularly emissions from large point sources, using data of annual sums of carbon dioxide emissions from electric power generation facilities in the USA. While locational uncertainty is relatively small for large point sources of emissions in the USA, this uncertainty is nonetheless important at high spatial resolution and the US data provide a test bed on how to treat the uncertainty in other countries where preliminary analyses show that the uncertainty is much larger and the fraction of exactly correct locations smaller.

We classify anthropogenic point sources as human-caused, localized, stationary sources of emissions such as those from combustion of coal, biomass, natural gas, oil, and other fuels. The methodology presented here applies to all point sources; however, it is particularly relevant for large point sources, which are orders of magnitude above background emissions levels. Due to their localized nature and extremely high levels of emissions compared to other sources, errors in their location make a larger impact on overall emissions totals than does spatial error in reporting of other types of data such as traffic, homes, small businesses, or agricultural emissions estimates.

We focus our analysis on the USA, where the data sources are widely accepted as reliable and accurate. The methods developed here can then be applied more broadly where data may not be as consistently and carefully assessed. We recognize in this analysis that the US data are regularly revised and improved. We have chosen to capture a snapshot of this data for the analysis, recognizing that by the time this work is published, the data may have undergone revisions that improve the accuracy of the data. The focus of the paper is on the methodology and we realize that we are using data in a manner not anticipated by the data set producers.

This paper takes US data on the locations of large point sources of CO2 emissions and examines the uncertainty that characterizes the locational data. We then show that when emissions data are gridded, the uncertainty in the locations of the large point sources becomes very important in estimating the emissions from specific grid spaces. Because a large point source either is or is not in a specific grid space, the uncertainty for each grid space depends on the probability that a large point source is actually in the grid space. We show that traditional statistical measures of uncertainty are not useful for these binary data and we propose a new statistic that provides a quantitative measure of the uncertainty in each grid space for emissions data when there is a quantified potential for locational error in large point sources of emissions. We finish by demonstrating that for spatially explicit data, there is thus a relationship between resolution and uncertainty and that this can be quantified.

1.1. CO2 emissions from large point sources

Anthropogenic large point source emissions comprise a significant portion of total carbon dioxide emissions worldwide and much of these emissions come from a small number of very large sources (Singer et al., Citation2014). In the USA, they represent 53% of anthropogenic CO2 emissions (EPA, Citation2014), with a third of these emissions coming from only 311 very large point sources (EPA, Citation2013), emphasizing the significant impact of large point source data on the nation's total carbon dioxide output to the atmosphere. In any effort to characterize the spatial distribution of CO2 emissions, it is therefore important to accurately report both the magnitude and location of large point sources and to understand any unavoidable uncertainty so that it can be fairly quantified.

The Emissions and Generation Resource Integrated Database (eGRID) (EPA, Citation2014) is a comprehensive inventory of environmental attributes of electrical power systems in the USA produced by the Environmental Protection Agency (EPA). eGRID integrates many different data sources on power plants and power companies from four different agencies, the EPA, the Energy Information Administration (EIA), the North American Electric Reliability Corporation, and the Federal Energy Regulatory Commission, to produce a detailed emissions and resource profile. The reported point sources of emissions from eGRID include 3005 carbon dioxide emitting power generation facilities across the USA (), with annual emissions sums given in metric tonnes of CO2. Because it includes most of the largest point sources in the USA and because it provides a model that can be enlarged globally, for most of the analysis presented in this paper, eGRID data (EPA, Citation2014) were used as an example, though the methodology extends to other data sets of this type.

Figure 1. Large point sources of CO2 emissions in the USA in 2009 as reported by eGRID (EPA, Citation2014).

Figure 1. Large point sources of CO2 emissions in the USA in 2009 as reported by eGRID (EPA, Citation2014).

Most data sets reporting large point source emissions, including eGRID, are intended for use in reporting carbon emissions totals at various political levels, as well as providing detailed categorical information on each point source. To this end, plant locations emphasize geopolitical data and not necessarily the exact point of gaseous discharge. Spatial locations of the power plants have been self-reported by the facilities themselves and are generally allocated by default to the centroid of a county if street address or latitude and longitude coordinates were not given.

1.2. Uncertainty

Data inconsistencies and other errors affect the accuracy of a reported quantity. As a consequence, reported quantities are provided with a range of values (usually expressed as v ± x with y% probability) that suggests the probability that the true value lies in the interval around the reported value. This range of probable values where the true value occurs reflects the uncertainty of the reported value. The level of uncertainty in any particular value might vary by many orders of magnitude. The origins of this uncertainty in the reported value might originate from many sources: a lack of information, disagreement over given data, measurement error, inherent variability, approximations, subjective judgements, or numerous other factors. Understanding and quantifying sources of uncertainty is essential to give a proper reflection of the range in which the true value could potentially fall (Schneider & Kuntz-Duriseti, Citation2002).

In considering emissions estimates of carbon dioxide, there are four important sources of uncertainty: (a) the magnitude of emissions from areal sources, (b) the spatial uncertainty in areal sources, (c) the magnitude of uncertainty associated with point sources, and (d) the spatial uncertainty in point sources. The last element is unique because of its binary character and it is the focus of this paper. While important, the spatial uncertainty from areal sources has smoother characteristics and can be ably handled with standard statistical methods and is therefore not specifically addressed here. There are also uncertainties associated with the calculation or measurement of emissions from the stacks of power generation sites or other facilities. Calculations cannot take into account every factor for each individual operation, resulting in approximate values, and the devices to measure emissions are limited by accuracy and precision, leaving uncertainty in the results.

For point sources, we have the binary condition that the source either is or is not in the space under consideration, and for the larger point sources, small locational errors can result in very large differences in the estimated emissions for two spaces – the space where the facility is reported and the space where it actually exists. The importance of discrepancies in spatial locations increases at finer spatial resolutions. Locational errors with large point source data arise due mostly to their self-reported nature and because the data are often being re-purposed from other applications. There are instances of lack of information where a power plant may not report any location at all and the data compilers will place the point source at a default location such as the centre of the political unit (county or city) in which it is known to be. This case results in large uncertainty for CO2 emissions as the point source could theoretically be anywhere within that political unit. In other cases, there is simply a lack of precision in the emissions data as power plants and other facilities do not always report the coordinates of point of release of the emissions, but instead an in-town office or street address.

An analysis of 500 randomly sampled points from eGRID (detailed in Section 2.1) found that these point sources are reported an average of 0.84 km away from their actual locations. A similar analysis of the top 81 emitters in eGRID, accounting for a full 30% of the CO2 emissions reported in the data set, reveals that even these hugely significant point sources have considerable uncertainty in representing the location of actual emissions. The geographical coordinates of emissions from these largest sources are incorrect 70% of the time, while 60% differ from their actual location by more than 1 km. The mean difference in reported and actual locations for these largest of point sources was 7.94 km, compared to the 0.84 km difference found in the random sample. We conclude that the discrepancies are not limited to small obscure power generators, but are a more widespread issue. shows the distribution of displacement values both for the sample of the 81 largest CO2 sources and for a statistically significant sample of 500 randomly chosen sources.

Figure 2. Histogram of the distance from the reported locations to the actual locations for (a) the top 81 emitters in the eGrid data set, and (b) a random sample of 500 emitters from the eGrid data set.

Figure 2. Histogram of the distance from the reported locations to the actual locations for (a) the top 81 emitters in the eGrid data set, and (b) a random sample of 500 emitters from the eGrid data set.

While large differences are clearly an issue, misallocation can still occur with very good data depending on the resolution of gridded data and the location of a point source within a given grid cell. The statistical frequency with which this misallocation could be expected can be suggested from basic geometry. Consider for a moment the implications of a 1.5 km difference between a reported and an actual location. Based on random placement in a grid cell of size 0.1 ×  0.1 degrees (about 11 × 11 km at mid-latitude in the USA), we would expect 25.4% of power plants to be within 1.5 km of a cell boundary. Evenly distributed, this would represent slightly more than 165 million tonnes of CO2, greater than the emissions of the entire state of New York (EIA, Citation2014). This would put 20 of the top emitters in this border region. Since more than half of the top 81 emitters are located farther than 1.5 km from the actual discharge, the chance that one or more of them are placed in an incorrect grid cell is non-trivial. While 20 might seem like a small number and not all of the 20 would likely be placed in the incorrect grid cell, these would be among the largest emitters in the country (imagine the emissions from more than a million cars), accounting for a large fraction of the emissions from the region where they are located. Placing a plant in the wrong grid cell amounts to placing it 11 km away from its actual location (based on the centre, or reference point, of the grid cell). As many recent discussions have focused on reducing the operating scale of models and reporting, understanding the magnitude of the uncertainty is an important factor to consider in these decisions. It is also a widely held view that the US data are better than most, making these calculations a conservative case study for the rest of the world.

As an extreme example, a power plant found in South Africa () just happened to end up at exactly the corner of four grid cells so that the actual stacks are in a separate grid cell from the main offices, which are in a different grid cell than the street address. Thus, depending on which is reported by the facility, the full value of that plant's emissions might be allocated to any of three different grid cells. This is a rare case, but highlights the potential issues even for very accurate data sets.

Figure 3. The Majuba Power Station in South Africa (27°06′02″S; 29°46′17″E), shown with the Ozone Monitoring Instrument data, which potentially places the emissions in a different grid cell than the actual emissions stacks.

Figure 3. The Majuba Power Station in South Africa (27°06′02″S; 29°46′17″E), shown with the Ozone Monitoring Instrument data, which potentially places the emissions in a different grid cell than the actual emissions stacks.

It therefore becomes relevant to determine the location of point sources with as much accuracy as possible, but even when the location is known to be within a few hundred metres, it is still crucial to have a means of reporting the uncertainty associated with each location so that other factors, such as issues of grid placement and resolution, can be properly accounted for.

However, there is currently no established methodology to deal with the spatial uncertainty of large point sources. In previous analyses, spatial uncertainty in the USA has largely been assumed to be zero, and globally it has remained unaddressed (Oda & Maksyutov, Citation2011), despite having significant influence on carbon accounting and policy decisions.

The following sections develop a method for comparing emissions data sets and evaluating their associated uncertainty. Within a single data set, the spatial uncertainty can be quantified and ways to reduce that uncertainty are discussed. The uncertainty in the total emissions value, hereafter referred to as magnitude uncertainty, is not part of this analysis. With two different types of uncertainty for each large data point, however, we do address effective means of reporting total uncertainty for various emissions. While atmospheric flux inversion models may be able to calculate resulting model uncertainties using these two uncertainty values separately, with their associated location or emissions value, the two separate numbers are difficult to present on a map and fail to give a clearly understandable picture of confidence in the data. Thus, a combined uncertainty measure has been developed to allow the reporting of a single value that describes the uncertainty in the data at each location based on uncertainty in both the emissions total and reported location.

2. Methods

2.1. Calculating spatial uncertainty

In order to characterize the spatial uncertainty in eGRID-reported locations, a sample of 500 random points was selected from the data set. Using Google Earth satellite imagery, the reported location of each point was found and the surrounding area was searched to visually identify the actual location of the power plant stacks. Where none were immediately apparent, common locations were targeted as first search areas and then verified with addresses and company information. These included landfill sites, outskirts of small towns, bodies of water, and rail lines. The actual location, once found and verified, was recorded, and the difference between the actual and original values was determined as a linear distance. The sample mean of the separation distance was 0.84 km, and this value was then used to derive the parameters used in further simulations.

As a beginning point, we assume that the actual locations are normally distributed about the reported location with radial symmetry (no angular preference). This is a simplification; the distribution is more complex than a single normal distribution. This is not readily apparent from the distance histograms above in until you compare the general shape of the histograms with the shape of the condensed density plot in that shows the distance from the mean for a normal distribution. However, we feel that using the normal distribution is an appropriate place to begin because: (1) The methods we outline here apply regardless of the form of the distribution and (2) preliminary characterizations of subcategories within the point source data according to characteristics such as proximity to large water sources, centroids of counties and city centres, and rail lines indicate behaviour that appears more normal.

Figure 4. The radially symmetric, bivariate, normal distribution PD resulting from latitude = 46 and longitude = 90, and a mean distance of 0.84 km for the distance from the mean. (a) shows the full 2D distribution, while (b) shows the corresponding 1D distribution of distances from the mean. This is effectively a collapsed, accumulated view (not a slice) of the distribution in the radial direction.

Figure 4. The radially symmetric, bivariate, normal distribution PD resulting from latitude = 46 and longitude = 90, and a mean distance of 0.84 km for the distance from the mean. (a) shows the full 2D distribution, while (b) shows the corresponding 1D distribution of distances from the mean. This is effectively a collapsed, accumulated view (not a slice) of the distribution in the radial direction.

The value of the mean distance (0.84 km) is a one-dimensional characterization of a two-dimensional distribution that characterizes the probable location of a point source relative to the reported location. In order to derive the two-dimensional distribution, we construct the correlation matrix using the following steps:

  1. For the reported latitude and longitude, we calculate the number of kilometre per degree in each direction according to standard conversions.

where a = 6378 km (equatorial radius) and e2 = 0.00669438 (square of the eccentricity) are constants.

  • 2. For the diagonal elements of the matrix, we divide the mean distance in kilometre by the degrees per kilometre in each direction and multiply by the square root of the dispersion coefficient (correcting for the increased circumference with increased radius).

Because of symmetry, the two diagonal elements would be the same except for the fact that the number of kilometre per degree is typically different in the two directions as a function of latitude.

  • 3. The off-diagonal elements are set to 0, assuming that the distribution is radially symmetric (no angular preference).

  • 4. If we use the example where latitude = 46°, longitude = 90°, and mean distance = 0.84 km, we get the correlation matrix,

  • 5. The corresponding bivariate normal distribution is produced by inserting the matrix constants with mean values for the distribution set to the reported location of the point source, as in . In the case with the values above, the distribution PD is

These calculations take the easy-to-measure distance errors and map them to the more complicated ellipsoidal shape of the earth. The resulting radially symmetric, bivariate normal distribution then has the characteristic that the mean distance of points in the distribution from the mean of the distribution is 0.84 km, but functionally in units of degrees. This distribution can be tested by evaluating the mean value of the distance function (from the mean values of the distribution) with the PD as the weighting function,

2.2. Monte Carlo simulation

The calculated spatial uncertainty for a data set provides a basis for investigating the confidence in the reported emissions values, but as data are typically aggregated into gridded formats, it is necessary to incorporate the dependence on grid resolution and the location of a point source within a grid cell into an uncertainty metric. In order to take these factors into consideration, a Monte Carlo simulation was used. The inputs to this simulation are the reported location and magnitude of emissions of a single point source and the calculated spatial uncertainty from the data set associated with that point source (0.84 km for our 500 point sample of eGRID) which was used to compute the correlation matrix for the distribution used in the simulation. The Monte Carlo simulation is then used to determine the proportion of the time the emissions would fall in the original or surrounding cells. The simulation computes average emissions values for each grid cell, adjusting these based on geographical and other characteristics to reduce and refine their associated uncertainty, and from these values, a final spatial uncertainty measure is calculated.

2.2.1. Computational algorithm

The Monte Carlo simulation takes as input the reported emissions value for a single power plant with magnitude M and the spatial uncertainty for that power plant, as well as its geographical coordinates (the calculated coordinates are linear dimensions of kilometre, but the grid system is in degrees). The corresponding bivariate, radially symmetric, normal distribution is then used to generate 10,000 sample points placed on a grid of (for example 0.1 × 0.1 degree) cells surrounding the cell in which the power plant was reported. A single run of the simulation is then a grid of values produced by one sample point. These runs are subsequently summed and divided by the number of sample points to give the simulated mean for each cell. In the calculations that follow, the grid coordinates are defined at the corner of the grid cell, but the calculations also have been done for grids defined by the centre of the grid cell. This makes the resulting data product compatible with multiple gridded emissions data sets for ease of use. If the reported emissions value for a given grid cell (i,j) is given by xij, then the simulated mean is computed through the simulation by

where N is the total number of simulation runs, n indexes each individual run, and sijn is the emissions value of cell (i,j) for run n.

Points which fall outside of the county in which the original point source was reported are excluded, decreasing slightly the number of runs counted, because our analysis of eGRID suggested that the point sources were almost always placed within the correct political jurisdiction even when the latitude and longitude were uncertain. Additional restrictions are being implemented for a future version based on further analysis.

In summary form, our approach for the USA was to

  • generate a temporary grid around the reported location of each source,

  • generate sample points on the grid around the reported location,

  • exclude points falling outside of the reported county,

  • compute the simulated means by grid cell for all large point sources, and

  • calculate the uncertainty and combine with uncertainty values from other sources for every cell in the domain.

The resultant grid of simulated means, as described above in the calculation, acts to distribute the original total carbon dioxide emissions from each point source over multiple surrounding cells based on the proportion of the sample points that fell in each cell, depending on the placement of the point source within the cell as well as the grid resolution. If any power plant is located near the centre of cell (i,j) at large enough resolution, all of the emissions will be allocated to cell (i,j) by the simulation because the probability that the point is actually in this reported cell is extremely high. However, for points near an edge of a grid space, the simulation redistributes the emissions to neighbouring cells to reflect the probability that the point might fall in an adjacent space based on its spatial uncertainty (). Note that magnitude uncertainty is not incorporated into the simulated mean distribution and is computed separately.

The simulated means are then used to calculate a final spatial uncertainty measure of the reported data for each grid cell as well as a magnitude uncertainty, which is taken to be the simulated means multiplied by a measure of the magnitude uncertainty for the large point source (typically expressed as a percentage of the emissions total). The spatial uncertainty calculations are discussed in detail in the following section. It should also be noted that any single grid space may contain multiple large point sources, in which case, the simulated means are combined linearly, while the spatial uncertainties can be added in quadrature.

  • Inputs: calculated spatial uncertainty, point source magnitude (M), and reported location.

  • Simulated mean output: .

  • Uncertainty output: measures of spatial and magnitude uncertainty in reported values.

As a simplified example, consider two test points, one in the centre of a 0.1 × 0.1 degree grid cell, and the other near the corner, both with total emissions M = 100. Note that, for simplicity, we round the results to two decimal places for this paper; however, the full precision is used in the final data product. The first test point produces simulated means of in the original location and elsewhere. Since the mean spatial uncertainty is less than the distance to any side of the cell, this is reasonable. The second point gives simulated means that distribute the emissions into three neighbouring grid cells (). The placement of the source puts this source a little over 1 km from one edge and a little over 1.5 km from the other. Since only spatial uncertainty is incorporated into the simulation, the total emissions of all cells should remain the same.

Figure 5. The simulated means that would be output by the Monte Carlo simulation for each test point in tonnes of CO2.

Figure 5. The simulated means that would be output by the Monte Carlo simulation for each test point in tonnes of CO2.

3. Statistical metric

To investigate in more depth, look at a single point source as described in the previous example with emissions of M = 100 tonnes of CO2, located near the corner of a grid cell as shown in (a). This centre grid cell we will call (i,j).

Figure 6. (a) A single source located near the corner of a grid cell. (b) The single point source from (a) shown with an example emissions value of 100 tonnes of CO2 and neighbouring grid cells, all with no emissions. (c) The simulated means of the emissions from the indicated point source based on the Monte Carlo simulation methodology described above. (d) Overall caption: example calculation of simulated means for a sample point on a 0.1 × 0.1 degree grid.

Figure 6. (a) A single source located near the corner of a grid cell. (b) The single point source from Figure 4(a) shown with an example emissions value of 100 tonnes of CO2 and neighbouring grid cells, all with no emissions. (c) The simulated means of the emissions from the indicated point source based on the Monte Carlo simulation methodology described above. (d) Overall caption: example calculation of simulated means for a sample point on a 0.1 × 0.1 degree grid.

A neighbourhood of grid cells with the reported emissions for a point source in cell (i,j) would then look like (b), where the exact point location of the emissions is included in the figure for reference. As outlined above, a Monte Carlo simulation determines the simulated means of the emissions.

The result of the simulation is a grid of simulated means based on the provided spatial statistics. Intuitively, this can be thought of as a combination of the probability that the emissions occur in a particular grid cell combined with the quantity of emissions. The result is shown in (c). For reporting purposes, we will retain both the original reported values and the simulated means and construct a measure of uncertainty from those values.

3.1. Interval estimation

We would like to describe the uncertainty in the reported value of emissions in each grid cell in a style suggestive of a 95% probability interval. There are some sensitive issues related to reporting the uncertainty, and we propose a model for discussion. To help get a good handle on this basic issue, we refer to our sample case again. What does a 95% probability interval tell us? It tells us the interval for which there is at least a 95% probability that the true value lies within it. For our case, there is a 78.51% probability that the source actually lies in the centre grid cell, (i,j). This calculation is simplified since we used 100 as our total emissions, but we generalize below for an emissions magnitude M.

For the grid cell (i,j) where the emissions were reported, we can compute this percentage as follows. Since the emissions all must occur in the same location (it is, after all, a single point source), the values in the grid cells can be converted to the percentage of the emissions in the grid cell by dividing the simulated mean in that cell, xij, by the magnitude of the emissions from the power plant, M. Then we assume that the percentage of emissions in (i,j) corresponds to the likelihood of the point source being there. So,

The result gives a value of 78.51, indicating that there is no 95% probability that the source is located in the reported grid cell. If the source is actually located in another grid cell, the emissions in the central cell would be 0 since there is no source there, and this would be expected to occur 21.49% of the time. So, in order to create an interval in which there is a 95% or higher probability that the actual emissions lie in the interval, we must include 0. So, the uncertainty (the plus/minus value) must be 100, making the reported value and associated uncertainty 100 ± 100. This range is deceptive, however, since the emissions value can only ever be 100 (the full value of the emissions), or 0, and could not fall anywhere in between.

An analogous calculation can be made for the cells in which the reported emissions value was zero, but the simulated means were non-zero. In our example, these are cells (i + 1,j), (i,j − 1), and (i + 1,j − 1). Since the reported value is zero, we need an interval around zero, which means that the probability that the reported value, 0, is correct, is one minus the fraction of the simulated means in that cell as shown:

For cells (i + 1,j) and (i,j − 1), which had simulated means of 14.81 and 5.62, respectively, their per cent probabilities would be 85.19 and 94.34, so there is no 95% probability that the emissions do not lie in the cell. To allow an interval in which there is a 95% probability that the true value lies in that interval, we must again expand the interval to include the value 100. Cell (i + 1,j − 1), on the other hand, would have a probability of 98.94 and the 95% probability interval would simply be zero, since there is a 95% probability that the emissions in the cell are zero as reported. This produces the grid in of the 95% probability interval widths.

Figure 7. The 95% probability interval widths based on the example Monte Carlo simulation results.

Figure 7. The 95% probability interval widths based on the example Monte Carlo simulation results.

Unfortunately, the results shown in are not particularly enlightening. With binary data such as these large point sources, the probability intervals will always be either 0 or the entire range from 0 to M. We should ask our statistical measure to give us more useful information.

3.2. Standard deviation as a measure of uncertainty

An alternative to the 95% probability interval is to use the standard deviation-based metric to reflect the level of uncertainty in the reported values. If we look at the standard deviation formula, we can rewrite it in a useful manner to aid in this computation.

Recall that the basic calculation for a standard deviation of a value in a sample is

which for large N can be approximated as above and rewritten as

This second relation reminds us that this is basically an average where each element is given equal weight in the sum. Now if there were multiple entries with the same value, we might combine them and form a weighted sum according to their frequency or probability.(1)

where p is the frequency that the particular value occurs.

3.2.1. Uncertainty in the simulated mean

To describe the standard deviation for each grid cell in the simulation, Equation (1) becomes

where the emissions value at a grid cell for each individual run is given by sijn, and n indexes each run. For the simulation, the individual runs produce grid values that are only either M or 0 resulting in duplicate values, so we can let , or the proportion of the time that , and be the proportion of the time that . This allows us to rewrite the formula as

Then, by factoring out common terms inside the square root and simplifying, we get

which in our example, with M = 100, would be

This then provides a measure of the uncertainty in the simulated mean calculations. However, the standard deviation for the simulated means has a maximum when (i.e. when there is a 50% chance that the point is actually in the grid space where it was reported) and diminishes to both sides. In the context of large point sources, the uncertainty should decrease as the expected proportion of emissions falls below 0.5.

Would this ever happen? There are two situations where it occurs. The first situation comes up when the simulated means are distributed over many cells as a consequence of a large spatial uncertainty. It might be argued that this should not really happen because it suggests that the grid spacing is much too small for the magnitude of the spatial error. However, this is exactly why it is important since a large number of very high uncertainties might indicate an inappropriate grid size.

The second case shows up if the reported value is near a corner with three other grid cells. No matter how small the spatial error might be, if you are very close to a corner, the proportion of emissions in each of the four grid cells will approach 0.25. That is, each of the four grid cells has roughly a 25% chance of actually containing the point source. While the first case can be avoided through the choice of grid size, the second is unavoidable and is simply a consequence of a semi-random distribution of point sources over grid cells.

If the proportion of emissions falling in each cell is 0.25, the uncertainty for the cell in which the source is reported should be larger than if the proportion is 0.5. This does not happen if we use the standard deviation of the simulated means. The reason is that this is the standard deviation of the simulated means and not a measure of the reported value and is therefore not an appropriate metric to describe the uncertainty we are interested in quantifying.

3.2.2. Point Spatial Uncertainty Measure (PSUM)

To address the deficiency of the standard deviation of the simulated means, we propose an alternate measure, which is similarly structured but looks at deviations from the reported value rather than deviations from the simulated mean. Again, we begin with the basic calculation for a standard deviation and apply this idea to the simulation, but instead of considering the simulated means, we look at the differences between the simulation values, the sij's at each grid cell, and the reported value, xij, in that grid cell. We will also change notation somewhat for this metric so that instead of summing over the grid cells, we sum over the k different possible values in the output {0,M}. Thus, the simulated values are used as the set of possible simulated values sk = {0,M}, and pk is the set of the proportions of each sk − xij. Thus, for a given grid cell (i,j), the spatial uncertainty metric can be calculated from Equation (1) as

The difference between the actual reported value in a cell and the set of possible simulated values will always be {M, 0}, and the proportions with which each of these differences occurs is related to the simulated mean for a given cell. This is because the simulated mean as a proportion of the full emissions value, , gives the fraction of the time that the simulated value agrees with the reported value for the cell in which the reported value is M and gives the proportion of the time that the simulated value disagrees with the reported value for the cell in which the reported value is 0. Thus, . The PSUM is then computed separately for these two different cases.

Case 1: PSUM for For the grid cells with reported emissions values of zero, the point confidence measure can be written as

Case 2: PSUM for xij= M

In the second case representing the grid cell where the reported emissions value is M, the PSUM becomes

Using these values, we can create a grid describing the reported value at each location and an associated uncertainty value. Standard practice might suggest that since the 95% probability interval is roughly twice the standard deviation, we should double these values in reflecting the level of uncertainty. However, the values computed from this method become large if doubled and outweigh the actual emissions values for frequency values below 0.75, so it is more appropriate to report the point spatial uncertainty measure alongside the reported emissions. The grid in shows the reported values in the sample case along with the uncertainty values.

Figure 8. The reported values in the sample case, along with the PSUMs reported underneath.

Figure 8. The reported values in the sample case, along with the PSUMs reported underneath.

Since the calculation of the PSUM depends entirely on the probability that emissions occur in a grid cell, a graph can be created to provide an idea of what these values would be over a range of probabilities. In , the values are calculated based on a single point source with emissions of 100 tonnes of CO2. The uncertainty values can be scaled to other emissions numbers by multiplying integral units of 100.

Figure 9. PSUM uncertainty values shown decreasing with increasing simulated means, , for the grid cell in which the emissions were reported. Note that for cells with non-zero simulated means in which the emissions were not reported, the PSUM measure would increase with increasing values. Because these calculations are based on an emissions quantity of 100 tonnes of CO2, the PSUM values are equivalent to a percentage uncertainty of the reported value.

Figure 9. PSUM uncertainty values shown decreasing with increasing simulated means, , for the grid cell in which the emissions were reported. Note that for cells with non-zero simulated means in which the emissions were not reported, the PSUM measure would increase with increasing values. Because these calculations are based on an emissions quantity of 100 tonnes of CO2, the PSUM values are equivalent to a percentage uncertainty of the reported value.

In summation, this term PSUM provides a quantitative measure of uncertainty for emissions in each grid cell when emissions come from large point sources, but there is sufficient spatial uncertainty about which grid cell the sources actually lie in. PSUM would replace traditional statistical measures of uncertainty that do not provide useful insights for point source data. Measures for point sources can be combined with those for other point sources or those from areal sources. The proposed PSUM statistic: (1) provides a clear, quantitative measure of uncertainty for gridded data, (2) could be easily implemented in a computerized geographic information system, and (3) requires input of only the latitude, longitude, and magnitude of reported emissions sources and a measure of the average locational uncertainty. It can easily be implemented with different values for the uncertainty in different geographical regions of a given data set or for different characterizing aspects of individual point sources, a measure that we envision will be required when the US data examined here are carried to a global coverage.

3.2.3. A combined metric for spatial and magnitude uncertainty

In some applications, such as maps for public use, it is relevant to have a measure of uncertainty that takes into account both spatial and magnitude uncertainty for a given point source. While magnitude uncertainty is not the focus of this analysis, it is important to note how the two measures can be combined to form a conglomerate measure of uncertainty. To combine these uncertainties, one key assumption is made, namely that there is 100% association between spatial and magnitude uncertainties. That is, error in the spatial location carries with it the associated error in magnitude. Because of this, the two uncertainty types then can be added linearly to create a combined uncertainty metric. This takes into account all the associated uncertainty in the data and presents a comprehensive value for the uncertainty in a gridded data product of emissions. Similar to the spatial uncertainty metric described in the previous section, this is different in concept from a probability interval and cannot be thought of as a plus or minus value on a grid cell. Instead, it should be envisioned as a quantitative representation of the total uncertainty on a grid cell based on the emissions in or near that cell. An alternative way to conceptualize this combined metric is by dividing it by M for each grid cell so that it can then be seen as the maximum fraction of the total emissions that could be found in that cell. However, we must be careful in discussing it this way since again the emissions are binary in nature and it is therefore not reasonable to think of a fraction of their total in any given cell.

4. The relationship between resolution and uncertainty

With the calculation of spatial uncertainty defined, we can produce a gridded map containing the point sources along with a companion map showing the accumulated uncertainty of the point sources. These maps, or data sets, can then be incorporated into larger efforts to define and characterize global carbon emissions, sequestration, and stocks.

One of the remaining tasks from the standpoint of the point sources is to explore the appropriate scale on which to report the values. A number of data products have been published recently that vary from the 1 km grid of the Open-source Data Inventory for Anthropogenic CO2 emissions data set (Oda & Maksyutov, Citation2011) to a 0.1 degree grid used in the Emission Database for Global Atmospheric Research data set (Janssens-Maenhout, Pagliari, Guizzardi, & Muntean, Citation2013), to the 4 × 5 degree resolutions of some atmospheric inversion models (for example, Jiang et al., Citation2013). Ranges for satellite imagery and ecological models vary along similar ranges, though efforts are active in pressing the grids down to even smaller scales, particularly on regional or local projects. If we can report the emission values on the same level of resolution used in other, related data products, then the integration will be less cumbersome. On the other hand, if we report the data on too fine a grid, the uncertainty quantities will be so large that the values may be undermined. Therefore, we investigate the resolution of the point source data to explore the minimal grid size on which the data are meaningful.

Since it is the relative size of the grid to the spatial uncertainty that matters, we examine the grid size as a function of the mean locational uncertainty, taken as a percentage of the total emissions. The US emissions data from eGRID we have used as our test case suggested an average uncertainty of 0.84 km. All of our recent simulations have been done on a 0.1 × 0.1 degree grid, which puts the mean uncertainty at a little less than one-tenth the size of the grid. The question is then whether this is an appropriate grid size, whether we can justify using a smaller grid, or whether we need to decrease our resolution to a larger grid. As our methods extend to the globe, the uncertainty in the spatial locations of point sources will change, and it may be necessary to adapt our grid resolution to reflect the differing uncertainty in the spatial data.

Using the measure developed here, we propose using the measure to help quantify the appropriate maximum resolution of the reported grid. We have calculated the average uncertainty for our calculated spatial error with a variety of grid sizes. The curves in then show how the average uncertainty relates to the size of the grid cells.

Figure 10. Average uncertainty as a function of grid size for a several different mean spatial errors ranging from 1 to 16 km calculated as a per cent of the reported emissions. The horizontal line suggests the effect of choosing a threshold on determining an appropriate grid size for reporting and mapping purposes. These were created using a Monte Carlo simulation with 10,000 points per point.

Figure 10. Average uncertainty as a function of grid size for a several different mean spatial errors ranging from 1 to 16 km calculated as a per cent of the reported emissions. The horizontal line suggests the effect of choosing a threshold on determining an appropriate grid size for reporting and mapping purposes. These were created using a Monte Carlo simulation with 10,000 points per point.

Larger grid sizes have smaller uncertainty because geometrically, the proportion of a grid cell near an edge decreases with increased area, so consequently the effects of high uncertainty near edges are reduced due to increased area farther from the boundaries. If we establish a threshold for an acceptable average PSUM, for example, 10% of the reported emissions, we can then calculate the appropriate grid size for different regions that have varied mean spatial error. For example, as the data become less certain, the appropriate grid size can be increased to maintain an average PSUM of 10%, or whatever threshold is most appropriate. The calculations here are averaged only for the cell in which the source is reported and do not take into effect alterations of the simulation due to factors such as political borders where we would normally exclude points in the simulation. This means that the average PSUM is slightly higher than it would be if political borders were involved. We also only consider grid cells in which sources might be present. Since the large point sources only appear in a small fraction of grid cells at high resolution, this means that the overall uncertainty would actually be lower. So, this calculation only reflects the resolution that is appropriate for the spatial uncertainty related to the placement of the large point sources. However, the vast majority of grid cells in which there are no large point sources (and those not neighbouring large point sources) will have both very small uncertainty and very small emissions, which means that reporting on a larger grid spacing does not inhibit any understanding of the spatial distribution of emissions or of its associated uncertainty.

shows several plots of the average uncertainty for different values of mean spatial error percentages with a line drawn at our presumed threshold of 10% for the PSUM. With preliminary data suggesting that the eGRID data for the USA are better than average around the globe, we predict that the appropriate minimum grid size for many other countries will be larger than for the USA and this approach provides a method for adjusting the grid spacing to reflect the changes in the uncertainty of the data. For the US eGRID data, with its 0.84 km mean spatial error, the uncertainty measure for the 0.1 × 0.1 degree grid is 20.47% of the reported emissions value, reflecting a relatively high level of uncertainty. shows several additional representative values.

Table 1. The uncertainty in the US data reveals that in order to reduce the PSUM below 10%, the spatial resolution cannot be reported on a 0.1 × 0.1 degree grid.

Again, we point out that some large point sources will inevitably lie near grid boundaries. This average measure of the uncertainty suggests a method to determine the overall minimum grid spacing that might be used in mapping the emissions and the uncertainties in the data, but individual sources may still have very large spatial uncertainties due to the nature of large point source data.

We can use this average measure to evaluate a useful resolution at which to report the data. In any given case, we recognize that there will be a trade-off between resolution and uncertainty but that with a measure like PSUM, we can quantify this trade-off. Where there is a need for high resolution, this should be accompanied by a need for high spatial accuracy.

5. Sample simulation outputs

The Monte Carlo simulation elaborated above was applied to eGRID data for each state in the continental USA to produce expected and uncertainty values on a 0.1 × 0.1 degree grid. The expected values are shown in for the Southeastern USA based on reported values in eGRID for 2009. It is clear that emissions from large point sources are distributed over grid spaces adjacent to the reported locations to reflect locational uncertainty, but that relatively few of the total number of grid spaces are affected by this locational uncertainty and that this locational uncertainty will be absorbed as the spatial resolution is decreased.

Figure 11. Expected values produced by the Monte Carlo simulation for the eastern USA from 2009 eGRID data of electric power generation in the USA on a 0.1 × 0.1 degree grid.

Figure 11. Expected values produced by the Monte Carlo simulation for the eastern USA from 2009 eGRID data of electric power generation in the USA on a 0.1 × 0.1 degree grid.

For further illustration, expected and uncertainty values are shown for Iowa specifically in (a and b). Iowa is bordered on the right by the Mississippi River and the concentration of large power plants along this line is evident. These large emissions values are accompanied by large spatial uncertainties with significant uncertainties potentially falling across the state line. However, because the computed points are constrained to the boundary of the reported county, cells shown which cross over into Wisconsin and Illinois still only contain the sum of the contribution of emissions to that cell from power plants in Iowa. Thus, when summing expected values over regions or the entire nation, there is no duplicate accounting.

Figure 12. (a) Expected emissions values output by the Monte Carlo simulation for Iowa, computed from 2009 reported values from eGRID. Units are tonnes of carbon dioxide per year. (b) Spatial uncertainty of reported emissions values in Iowa, given in tonnes of carbon dioxide per year. The uncertainty is computed from the expected values output by the Monte Carlo simulation and reported on 0.1 × 0.1 degree grid size. Only non-zero data are displayed. Caption overall: Simulation output in Iowa as an example, showing both river borders, and variation in expected values and uncertainty between power plants depending on grid placement.

Figure 12. (a) Expected emissions values output by the Monte Carlo simulation for Iowa, computed from 2009 reported values from eGRID. Units are tonnes of carbon dioxide per year. (b) Spatial uncertainty of reported emissions values in Iowa, given in tonnes of carbon dioxide per year. The uncertainty is computed from the expected values output by the Monte Carlo simulation and reported on 0.1 × 0.1 degree grid size. Only non-zero data are displayed. Caption overall: Simulation output in Iowa as an example, showing both river borders, and variation in expected values and uncertainty between power plants depending on grid placement.

6. Discussion

Large point sources are highly influential in overall totals of CO2 emissions and therefore it is critical to understand the issues associated with them and to have a means of quantifying and reporting their uncertainty in both magnitude and location. However, the binary nature of these emissions sources precludes traditional methods of dealing with their locational uncertainty in spatially explicit data sets of emissions. Small spatial errors and uncertainty may have order-of-magnitude effects on emissions totals in a grid cell. The approach presented here allows for the computation of spatial uncertainty values associated with large point source emissions. Placing a confidence interval on emissions from a grid cell to reflect spatial uncertainty is already problematic, but the scale of the impact arising from large point sources makes traditional confidence intervals irrelevant. Instead, we have presented a measure of uncertainty that provides a clear understanding of the uncertainties involved in the point sources according to their grid placement and the spatial resolution of the data set.

The statistical measure developed here is not to be confused with a traditional two-sigma uncertainty interval, but it does provide a similar, quantitative measure of the confidence in the emissions estimates for spatially explicit data.

It should be noted that the methodology presented here does not take into account the unavoidable uncertainty for power plants with multiple generators, which arises because the single location used to represent that ‘point source’ can only be placed on at most one of these sources of emissions and may be kilometres away from the others. This would have to be addressed separately as we have focused instead on the potentially much larger uncertainty arising from errors in locating the facility at all. The data used here from electric generating plants in the USA (EPA, Citation2014 – eGRID) provide an example of applying this methodology to a data set as a test case, but the methods can now be applied to analyses of other point source data sets globally.

Carbon Monitoring for Action (CarMA, Citation2013) is an example of a global data set of carbon dioxide emissions, accounting for over 60,000 power plants and 20,000 power companies worldwide. The number of emissions sources makes a concerted effort at checking and correcting the locations impractical in the short term. However, as with continual improvements to the eGRID data, we expect that other data sets will continue to improve as well. With regular resampling and calibration using the methods outlined here, the uncertainty in emissions can be updated along with these data sets, providing a partner data set detailing the uncertainty in the spatial locations of emissions worldwide.

Even with the relatively high-accuracy data provided for the USA, we begin to show the potential consequences of even these small uncertainties. Further studies will explore the application of these methods to other data sets such as CarMA that may have regional variations in spatial uncertainty.

In addition to providing a measure of uncertainty for spatial causes, we find that combining these uncertainties with magnitude uncertainty provides a useful measure and we can calibrate the magnitudes to provide a semblance of consistency. This enables the data to be implemented for varying purposes. The utility of combined or separate uncertainties is dependent on the purpose and intended audience.

The general approach developed here extends beyond point source data and intuitively can be used to combine multiple data types and obtain uncertainty values. But our focus is on dealing with data that is inherently binary; in a given grid cell, the point source is either present or not present and traditional +/− values do not provide useful information. Further analyses are needed to continue to refine the uncertainty outputs based on characteristics of the point sources. Preliminary analysis suggests that uncertainties can be refined based on additional knowledge that constrains the uncertainty, e.g. proximity to water sources, political structures, and population centres (depending on the country). The fuel source of the emissions also likely will help to clarify levels of spatial uncertainty.

The framework and the PSUM statistic developed here provide a quantitative approach to dealing with point source data in spatially explicit data sets and for quantifying the trade-off between uncertainty and resolution. The method can be easily implemented within computerized geographic information systems and it requires as input only the latitude, longitude, the magnitude of reported emissions sources and a measure of the average locational uncertainty. This method provides an initial approach to providing a metric for characterizing spatial uncertainty for point sources. More analysis and research are needed to refine these methods to better understand the impacts of such uncertainties on applications that use spatial emissions data.

Funding

This work was supported by the Carbon Monitoring System Program (NNH11ZDA001N-CMS) of the National Aeronautics and Space Administration. Supplemental internal support from the Research Institute for Environment, Energy, and Economics at Appalachian State University permitted the inclusion of greater student participation and contribution.

References

  • CarMA. (2013). Carbon monitoring for action. Retrieved from http://carma.org
  • Ciais, P., Dolman, A. J., Bombelli, A., Duren, R., Peregon, A., Rayner, P. J., … Zehner, C. (2013). Current systematic carbon cycle observations and needs for implementing a policy-relevant carbon observing system. Biogeosciences Discussions, 10, 11447–11581. doi: 10.5194/bgd-10-11447-2013
  • EIA. (2014). Summary: State CO2 emissions, U.S. Energy Information Administration. Retrieved February 25, 2014, from http://www.eia.gov/environment/data.cfm#summary (2011 data).
  • EPA. (2013). GHGRP 2011: Reported data, greenhouse gas reporting program. U.S. Environmental Protection Agency. Retrieved January 16, 2013, from http://www.epa.gov/ghgreporting/ghgdata/reported-2011/index.html
  • EPA. (2014). Clean energy: eGRID, ninth edition with 2010 data. U.S. Environmental Protection Agency, Retrieved February 24, 2014, from http://www.epa.gpv/cleanenergy/energy-resources/egrid/
  • Janssens-Maenhout, G., Pagliari, V., Guizzardi, D., & Muntean, M. (2013). Global emission inventories in the Emission Database for Global Atmospheric Research (EDGAR) – Manual (I), Technical Report, JRC Retrieved from http://publications.jrc.ec.europa.eu/repository/handle/111111111/27591
  • Jiang, Z., Jones, D. B. A., Worden, H. M., Deeter, M. N., Henze, D. K., Worden, J., … Schuck, T. J. (2013). Impact of model errors in convective transport on CO source estimates inferred from MOPITT CO retrievals. Journal of Geophysical Research, 118, doi:10.1029/jgrd.50216
  • Nassar, R., Napier-Linton, l., Gurney, K. R., Andres, R. J., Oda, T., Vogel, F. R., & Deng, F. (2013). Improving the temporal and spatial distribution of CO2 emissions from global fossil fuel emission data sets. Journal of Geophysical Research: Atmospheres, 118, 917–933.
  • Oda, T., & Maksyutov, S. (2011). A very high-resolution (1 km x 1 km) global fossil fuel CO2 emission inventory derived using a point source database and satellite observation of night lights. Atmospheric Chemistry and Physics, 11, 543–556. doi: 10.5194/acp-11-543-2011
  • Peylin, P., Houweling, S., Krol, M. C. Karstens, U., Rödenbeck, C., Geels, C., … Heimann, M. (2011). Importance of fossil fuel emission uncertainties over Europe for CO2 modeling: Model intercomparison. Atmospheric Chemistry and Physics, 11, 6607–6622. doi: 10.5194/acp-11-6607-2011
  • Sarmiento, J. L., Gloor, M., Gruber, N., Beaulieu, C., Jacobson, A. R., Malikoff Fletcher, S. E., … Rodgers, K. (2010). Trends and regional distributions of land and ocean carbon sinks. Biogeosciences, 7, 2351–2367. doi: 10.5194/bg-7-2351-2010
  • Schneider, S., & Kuntz-Duriseti, K. (2002). Uncertainty and climate change policy, chapter 2. Washington DC: Island Press.
  • Singer, A., Branham, M., Hutchins, M., Welker, J., Woodard, D., Badurek, C., … Marland, G. (2014). The role of CO2 emissions from large point sources in emissions totals, responsibility, and policy. Environmental Science and Policy, 44, 190–200. doi: 10.1016/j.envsci.2014.08.001

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.