Full article: Histogram cube: towards lightweight interactive spatiotemporal aggregation of big earth observation data

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

In the era of Earth Observation (EO) big data, interactive spatiotemporal aggregation analysis is a critical tool for exploring geographic patterns. However, existing methods are inefficient and complex. Their interactive performance greatly depends on large-scale computing resources, especially data cube infrastructure. In this study, from a green computing perspective, we propose a lightweight data cube model based on the preaggregation concept, in which the frequency histogram of EO data is employed as a specific measure. The cube space was divided into lattice pyramids by the Google S2 grid system, and histogram statistics of the EO data were injected into in-memory cuboids. Therefore, exploratory aggregation analysis of EO datasets could be rapidly converted into multidimensional-view query processes. We implemented the prototype system on a local PC and conducted a case study of global vegetation index aggregation. The experiments showed that the proposed model is smaller, faster and consumes less energy than ArcGIS Pro and XCube, and facilitates green computing strategies involving a cube infrastructure. Due to the standalone mode, larger dataset will result in longer cube building time with indexing latency. The efficiency of the approach comes at the expense of accuracy, and the inherent uncertainties were examined in this paper.

KEYWORDS:

1. Introduction

Spatiotemporal aggregation is a widely used geostatistical method for aggregating and summarizing geographic datasets to integrate heterogeneous information from multiple sources into a unified spatiotemporal analysis scale (Lopez, Snodgrass, and Moon Citation2005). For instance, in human–land systems, human data are often integrated with continuously distributed environmental data using survey areas or administrative regions as geostatistical boundaries (Rinner and Hussain Citation2011). In ecosystem research, it is essential to count species distribution differences based on different spatiotemporal units, such as plant communities and forest patches, to analyse the impact of global climate change on crop growth (Chen, Li, and Wang Citation2009). With the rapid development of high-temporal/spatial/spectral resolution satellite remote sensing technology and related geo-simulation systems, a large amount of high-dimensional Earth observation data and reanalysis datasets have been generated. In EO-based big data studies, owing to the spatiotemporal heterogeneity in the distribution of geographic phenomena as well as multiscale effects, analysts must often perform various aggregation steps of geographic variables in different spatiotemporal regions in interactive mode to explore the underlying patterns step by step (Nativi, Mazzetti, and Craglia Citation2017). Rapid response performance is a critical metric that directly affects the continuity of human–computer interaction in this exploratory analysis process.

However, existing spatiotemporal aggregation methods still suffer from problems such as complex computations and high I/O requirements attributed to vector–grid conversion (Haynes, Manson, and Shook Citation2017; Mennis Citation2010). The massive volume and high dimensionality of Earth observation data cause a series of computational challenges in spatiotemporal aggregation, such as multidimensional groupby aggregations, complex topology operations, and spatiotemporal coupling boundaries. The aggregation analysis function is referred to as zonal statistic in geographic information system (GIS) software such as ArcGIS and QGIS, which can support simple aggregation tasks locally. Currently, constructing EO data cubes on a global or local scale has become a cutting-edge technique for efficiently exploring the potential value of Earth observation big data (Baumann et al. Citation2016; Cardoso dos Santos, Van Der Zande, and Youdjou Citation2020; Gao et al. Citation2022; Gorelick et al. Citation2017; Killough, Siqueira, and Dyke Citation2020). In such online platforms, spatiotemporal aggregation tasks are frequently performed by a large number of online users as a necessary or critical step in data preprocessing or exploratory analysis. However, these repetitive computational tasks rely on hundreds or thousands of servers, which significantly increases the negative inventory of energy consumption and carbon emissions.

To address this issue, in this study, we propose a localized cube model that entails the utilization of the distribution frequency histogram of EO data as a specific measure. Based on the discrete global grid, the model can be employed to create multidimensional and multilevel EO data histogram statistics, enabling exploratory aggregation analysis of massive EO datasets through a multidimensional-view query process for histogram cubes. The model was designed with lightweight storage and computation capabilities, allowing flexible deployment and interactive access on local desktops or in the cloud while minimizing resource consumption in line with the concept of green computing. Overall, the model could facilitate efficient and cost-effective exploration and analysis of EO data.

The remainder of this paper is organized as follows: in Section 2, a systematic overview is provided of the different forms and characteristics of spatiotemporal aggregation, and the existing methods and challenges in this field are summarized. In Section 3, the proposed model and query approach are presented, and the specific technical implementation is outlined in Section 4. In Section 5, the case study and performance testing process are described. In Section 6, the various concerns related to our approach are described, and a discussion of future work is provided in Section 7.

2. Background

2.1. Problem description

The aggregation process can be decomposed into two steps. First, datasets are grouped based on specified dimension values (Group), and a specified aggregation function is then applied in each group (Agg). Based on the combination of aggregation dimensions, spatiotemporal aggregation can be classified into several types (): temporal aggregation, spatial aggregation, spatiotemporal independent aggregation (STIA), and spatiotemporal coupled aggregation (STCA). The spatial and temporal aggregation processes are considered specific cases of STIA.

Table 1. Types of spatiotemporal aggregation.

Display Table

In , $⟨ S, T ⟩$ denotes the analysis domain consisting of the space dimension S and time dimension T of the EO dataset, and $⟨ S_{z o n e}, T_{z o n e}, V_{z o n e} ⟩$ denotes the grouping domains of the space, time, and variable values. $T_{z o n e} = f (S_{z o n e})$ denotes the spatiotemporal coupling function, where the temporal grouping varies with spatial grouping. The complexity of spatiotemporal aggregation is reflected in the boundary conditions, including a complex spatial geometry (e.g. land use boundaries), unequal temporal regions (e.g. different lengths of cold and warm seasons), and mutually coupled spatiotemporal boundaries (e.g. different regional crop types with different phenologies). The difficulty of spatiotemporal aggregation increases with increasing number of dimensions and complexity of the boundary conditions. In GIS software, the zonal function in map algebra describes the expression and processing rules of the spatial aggregation calculation: Zonal(AggFunc, ValueLayer, ZoneLayer). AggFunc is the specific aggregation operator, such as sum, mean, and variance (Tomlin Citation1994). Spatial aggregation can be calculated in two ways. One way is to rasterize the zone vectors and then overlay them with the raster to be aggregated. The other way is to convert the raster into vector points, perform spatial topology analysis between the points and zone, and then determine the point number. The traditional map algebra provided in GIS software supports spatial and temporal aggregation processes separately but not spatiotemporal aggregation. Mennis (Citation2010) proposed the first multidimensional map algebra to address this issue by realizing a rule expression of spatiotemporal aggregation calculations. However, in the implementation, spatiotemporal zones must be preconverted into grids and then spatiotemporally joined with value cubes, generating enormous and redundant cubes when addressing large-scale, high-resolution, and long-time series aggregation cases. This severely limits the computational efficiency.

2.2. Related work

To address the low performance of spatiotemporal aggregation, many researchers have attempted to parallelize map algebra/multidimensional map algebra (Li et al. Citation2014; Zacharatou et al. Citation2017; Zhang, You, and Gruenwald Citation2015) to accelerate the process. However, as the amount of data and aggregation complexity increase, the scalability of parallel zonal statistics becomes significantly limited.

In recent years, the Earth Observation Data Cube (EO Data Cube) has become an emerging infrastructure for organizing EO data with multidimensional arrays as units of pixels, for example, the Google Earth Engine, XCube, GeoCube, and PIE Engine (Baumann et al. Citation2016; Cheng et al. Citation2022; Gao et al. Citation2022; Gorelick et al. Citation2017). The EO Data Cube provides online analysis services based on multidimensional array algebra and adopts a distributed parallel architecture of storage and computing to improve the scalability of data processing. When performing spatiotemporal aggregation, the aggregation zone is first converted into a multidimensional array, and tensor operations are then performed using the EO dataset. Such technologies can relieve the pressure of the frequent aggregation of EO data through powerful computing resources in the background. Researchers have used relational spatial databases to establish spatiotemporal indices for EO data cubes and reduce the search space through the data block mechanism, thereby decreasing redundant data access (Hu et al. Citation2018; Xu et al. Citation2018). Appel et al. (Citation2018) integrated the Geospatial Data Abstraction Library (GDAL) in a 2D GIS environment into the multidimensional database SciDB to enhance its vector–raster integration analysis capabilities. However, the lack of description of the time dimension prevents complex spatiotemporal aggregation computations.

The preaggregation mechanism of online analytical processing (OLAP) in data warehousing is the main inspiration for solving this problem. OLAP can respond to queries quickly and with a low resource cost by aggregating data in advance, thereby avoiding online computations. In exploratory analysis or recursive query scenarios, the query efficiency can be further improved by using approximate query methods (Li and Li Citation2018). Currently, there are three main types of spatiotemporal data models that involve the implementation of preaggregated query mechanisms. First, in the pyramid data model, an image cache is constructed using a pyramid grid to realize multiscale queries of remote sensing images (Gutierrez and Baumann Citation2010; Sample and Loup Citation2010). Second, in the aggregate index tree model, preaggregated information of spatiotemporal big data (social network data, cell phone location data, traffic track data, etc.) is directly stored in index trees for real-time query and visual analysis (Lins, Klosowski, and Scheidegger Citation2013; Pahins et al. Citation2017; Wang et al. Citation2017). However, this model is not suitable for EO data because of its dense array and multidimensional nature, which makes the tree large and costly to maintain or update. Third, the spatial OLAP (SOLAP) model entails the establishment of a preaggregation mechanism from fine to coarse spatial data to provide multidimensional and multilevel query analysis (Bédard, Rivest, and Proulx Citation2007). Gómez et al. (Citation2009) systematically described the aggregation of vector data in SOLAP and proposed a spatial overlay preaggregation method for complex geometric queries as a direct alternative to R-tree-indexed queries. The SOLAP method is suitable for multidimensional operations and exhibits suitable scalability under large-scale data pressure but can only be applied to discrete measurements (numerical/object-based spatial data). To support continuous remote sensing data, SOLAP methods with a raster as the basic measurement unit based on multilevel spatial grids and temporal divisions have been proposed (Bimonte, Zaamoune, and Beaune Citation2017; Li et al. Citation2014). Under interactive query scenarios, user-defined zones of an arbitrary geometry should be allowed. However, the members of the spatial dimension in most SOLAP cubes are predefined, and the spatial query region of the user can only be limited to the spatial members (e.g. administrative divisions). Although grid-based SOLAP can be used to approximate the query region using grid cells at the same level, this method lacks sufficient flexibility. Coarse grids can cause large errors in the aggregation results, whereas fine grids can lead to an inefficient query because of the large number of cells generated.

EO data cubes provide convenient access to EO data and scientific computation. However, existing spatiotemporal aggregation methods are computationally intensive and require high energy consumption. Based on the idea of SOLAP preaggregation, we proposed a data model and computational framework to support exploratory spatiotemporal aggregation queries, in line with the concept of green computing.

3. Methodology

3.1. Cube model

To relieve the conflict between complex aggregation conditions and real-time interactive aggregation performance, we introduced the multilevel dimensional structure of the SOLAP model into the data cube concept. The frequency histogram of the EO data distribution is considered the basic measurement in this model to create a multiscale data statistical cube structure in memory, which is called a histogram cube (HCube). The conceptual model is defined as $⟨ D, M, F ⟩$ , where D denotes the space, time, and variable dimensions (), M is the measurement that encompasses the frequency histogram and other derived statistics, and F denotes the aggregation function.

Figure 1. Conceptual model of the histogram cube.

Space Dimension (Ds): The layers in the space hierarchy are spatially nested and can be divided into a pyramid layer (Ds.base) and a custom layer (Ds.def), which constitute a collection of spatial scales from fine to coarse. Ds.base is a quadtree-based multilevel global discrete grid (the Google S2 grid is adopted in our implementation, as described in Section 4.1), whose granularity structure can be expressed as $θ^{0} ≺ θ^{1} \dots θ^{m}$ , and the dimensional members can be labeled as global discrete grid codes. Ds.def refers to the user-defined spatial zones located above the pyramid layer, such as multilevel administrative regions and climate zones. Members of Ds.def can be adaptively described by multilevel granularity cells in Ds.base (details in Section 3.2): Ds.base → Ds.def.

Time Dimension (Dt): This also includes a pyramid layer (Dt.base) and a custom layer (Dt.def), similar to the spatial hierarchy. Dt.base is the structure of the natural time granularity, i.e. day < month < year, whereas Dt.def refers to user-defined time zones, such as the phenology or ENSO warm/cold periods. These can be adaptively described by a time-pyramid layer: Dt.base → Dt.def. For instance, plant phenology can be described based on days, and periods encompassing months can be combined into a monthly granularity (refer to Section 3.2). Due to geographical heterogeneity, the temporal statistical domains of different geographical zones may not be consistent, such as phenological phenomena. To address this issue, we designed a space–time coupled layer with the granularity expressed as $⟨ θ_{s}, θ_{t} ⟩$ . In this layer, time cell $C_{t}$ is dependent on space cell $C_{s}$ , which can be characterized as follows: (1) $\begin{matrix} C_{t} = f (z o n e (C_{s})) \end{matrix}$ (1) where $z o n e (C_{s})$ is denotes the space zone, and function $f$ is the spatiotemporal coupling relationship.

Variable Dimension (Dv): This dimension is used to define the domain and zoning of the EO variables (e.g. vegetation coverage, temperature, and rainfall), with the members representing intervals of the corresponding variable histograms. A hierarchical relationship can be established between histograms of different granularities. Based on these histogram intervals, the variables can be further categorized in terms of their attributes. For example, rainfall can be classified into three levels: light, medium, and heavy rain.

Histogram Measures: The data space is partitioned into multiscale cells based on combinations of different dimensional granularities. If the number of levels in the space, time, and variable dimensions is |SH|, |TH|, and |VH|, respectively, cuboids with varying granularity combinations of |SH|×|TH|×|VH| are generated, forming a pyramid structure in which the histograms are aggregated layer by layer (). Unique cuboid cells ([S], [T] and [V]) are defined by the space code (S), time code (T), and variable code (V), respectively. The frequency histogram of the EO variables within each cuboid can be represented as $[x_{i}, f_{i}]$ , where $f_{i}$ denotes the frequency of the EO variables within the histogram interval $x_{i}$ .

Figure 2. Pyramid structure of histogram cuboids.

Aggregation Function: The histogram in a coarse-grained cuboid cell can be directly derived from the summation of the histograms in fine-grained cells. Commonly used aggregation functions (sum, count, mean, median, and variance) can be computed as derived measures using estimation equations () based on the frequency histogram.

Table 2. Aggregation functions based on a frequency histogram.

Display Table

According to the distributable aggregation capability, aggregation functions can generally be divided into three categories (Gray et al. Citation1997): (i) Distributive function: the result obtained by applying the function to N aggregate values is the same as that obtained by applying the function to all the data without partitioning, such as the sum and count; (ii) algebraic function: aggregated values can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which can be obtained by applying a distributive aggregate function, such as mean = sum/count. (iii) holistic function: aggregated values can only be obtained based on all data, such as the median and variance. In HCube, aggregates, including count, sum, and mean, can be calculated in a distributive manner from each cell, while the median and variance are derived from a globally aggregated histogram, expressed as follows: (2) $\begin{matrix} H . b i n (n) = \sum_{i = 1}^{n} H_{i} . b i n (n) \end{matrix}$ (2) where $H_{i} . b i n (n)$ is the nth bin of histogram H in space–time cell i. Note that the median and variance are approximations calculated indirectly through the histogram, and their accuracy gradually increases with decreasing histogram bin width.

3.2. Query method

Based on the above model, the aggregation computation on the original EO dataset is mapped to distributed aggregation queries on HCube. Query conditions can be expressed by the constraints on the dimension domains and aggregation functions as $⟨ φ (D_{s}) : θ_{s}, φ (D_{t}) : θ_{t}, φ (D_{v}) : θ_{v}, F ⟩$ , where $φ$ denotes the user-defined query domain, $θ$ is the corresponding query granularity of each dimension, and F denotes the aggregation function. The spatiotemporal aggregation query process in HCube can be divided into three steps: (i) transforming the spatiotemporal query domain $⟨ φ (D_{s}) : θ_{s}, φ (D_{t}) : θ_{t} ⟩$ into a multigranular adaptive representation of the space–time cells; (ii) querying the frequency histograms of variable $⟨ φ (D_{v}) : θ_{v} ⟩$ from the corresponding spatiotemporal cuboids in HCube; (iii) aggregating the histograms/aggregates and returning the result.

3.2.1. Adaptive spatiotemporal expression

The key to the query process is the spatiotemporal adaptive expression of the query domain, namely, the realization of D.base→D.def. The pyramid structure of Ds.base was built into memory for the space dimension. First, spatial topology operations are performed layer by layer starting from the highest level of the spatial pyramid. Then, the space cells contained or intersected by the query domain are identified in each layer. The cells intersected by the domain are refined until each cell is overlapped or covered by the domain, and the cells completely out of the domain are removed. Finally, the top-down search trajectory in the pyramid of this query process yields a minimum-depth aggregation tree, where the set of adaptive space cells in the query domain can be obtained by aggregating all leaf nodes. Based on the temporal pyramid, a temporal minimum-depth aggregation tree can be built by merging the periods from bottom to top. This approach ensures that the continuous time period is expressed in the largest possible time cell. Thus, the time–query domain can be adaptively converted into a collection of time cells with multiple granularities (). For example, the time domain [2020/12/31, 2022/2/2] can be expressed as a set of time cells {2020/12/31, 2021, 2022/1, 2021/2/1, 2021/2/2}. Eventually, the orthogonal set of space cells ${C_{s}^{m}}$ and time cells ${C_{t}^{n}}$ comprises the target space–time cells ${C_{s t}^{m \times n}} = {C_{s}^{m}} \times {C_{t}^{n}}$ .

Figure 3. Adaptive spatiotemporal expression by space and time cells. (a) Adaptive spatial expression. (b) Adaptive temporal expression. (c) Target space–time cells.

In the case of spatiotemporal coupling constraints, the time domain corresponding to the space cells covered by each zone can be obtained through the spatiotemporal coupling function (EquationEquation 1(1) $\begin{matrix} C_{t} = f (z o n e (C_{s})) \end{matrix}$ (1) ). If $f (z o n e (C_{s}))$ is a discrete function, a preestablished spatiotemporal lookup table (e.g. the distribution of the phenology in different climatic zones) can be used to quickly determine the temporal domain. In this case, the spatial joint set of space cells ${C_{s}^{m}}$ and time cells ${C_{s}^{m}}$ comprises the target space–time cells ${C_{s t}^{m}} = {C_{s}^{m}} ⋈ {C_{t}^{n}}$ . Because the target cells are generated by traversing the leaf nodes of the minimum-depth aggregation tree, they can be automatically grouped in granularity according to the tree hierarchy, thereby forming the key-value structure ${⟨ θ_{s}, θ_{t} ⟩ : C_{s t}^{}}$ .

3.2.2. Histogram query and aggregation

As shown in , HCube consists of cuboids with multiple granularities, and a multidimensional index is constructed in memory by combining the space code (S), time code (T), and variable code (V) in the order of $⟨ S, T, V ⟩$ . The aggregation query process for the histogram can be decomposed into two functional phases: map and reduce phases ((a)). In the map phase, the query process is used to search the cuboids of the corresponding granularity in the target space–time cells, and multidimensional keys comprising the codes of the target cells are then used to query the histograms/aggregates from the corresponding cuboids. In the reduce phase, all the histograms/aggregates obtained in the map phase are aggregated in groups by the combination keys of the temporal and spatial zones, and the final statistics are calculated according to the aggregation function provided in .

Figure 4. (a) Map and reduce phases of the aggregation query. (b) Spatial relationships between the cell and query domains.

In the spatial adaptive representation, there are two spatial relationships between the cell and query domain, namely, completely inside the domain (called the inner cell, IC) and intersecting with the boundary of the domain (called the edge cell, EC). Among ECs, there are several cases: cells located at the outer boundary of the domain (outer edge cell, OEC), cells located at the shared boundaries of adjacent zones inside the domain (shared edge cell, SEC), or both ((b)). Therefore, this representation cannot precisely depict the spatiotemporal domains. To reduce the uncertainty in the aggregation results caused by this problem, the proportion of the edge cell area covered by the zone is considered as a weight to correct the aggregate results of the sum and count. In this case, the sum can be computed as follows: (3) $\begin{matrix} S u m (Z O N E) = \sum \frac{S_{E C \cap Z O N E}}{S_{E C}} \times S u m (E C) + \sum S u m (I C) \end{matrix}$ (3) where $S_{E C}$ and $S_{I C}$ are the EC and IC areas covered by the zone, respectively, and $S_{E C \cap Z O N E}$ denotes the part of the EC area intersecting the zone. The median and variance aggregates are directly derived from the target cells because they are mainly influenced by the data distribution.

4. Implementation

Based on the proposed conceptual model and query methods, a prototype system was implemented in Python, including a Cube Model, Cube Builder, Cube Query Engine and Cube Viewer ().

Figure 5. Implementation architecture of the histogram cube.

4.1. Data model layer

Google S2 (https://s2geometry.io) was used as the discrete global grid system (DGGS) to build the spatial pyramid of the cube. This grid cuts the Earth into six planes, and each plane is divided into multilevel grids using a quadtree algorithm. Because of the use of Hilbert curves as space-filling curves, S2 achieves excellent global spatial order preservation relative to other grids, such as Geohash's Z-curves. We implemented the minimum-depth aggregation tree based on the Google S2 geometry (official open-source library) to support multigranular adaptive representation of the spatial query domain.

In object construction of the Cube Model, the configuration file in JavaScript Object Notation (JSON) format was used to describe the cube model structure and establish the mapping relationship with the cube data. As shown in (b), this includes dimension configuration (dimension type, dimension domain, and hierarchy), measurement configuration (variable name, value domain, and number of histogram bins), and persistence storage path of the cube data and the raw EO data source. Cube Builder is used to import the raw data for cube space dissection based on the cube configuration and build histogram cuboids starting from the finest granularity level. The cuboids at the upper level are aggregated from the finest granularity cuboid upwards step by step without accessing the raw EO data again.

Figure 6. Implementation of the cube model layer. (a) Cube object. (b) Cube configuration. (c) Cube data storage.

EO data are unevenly distributed globally. For example, terrestrial observation variables are commonly missing in marine areas, resulting in many cuboids with empty or invalid data. The use of dense arrays or databases to store sparse data can cause a very high storage and query performance overhead. Therefore, we used a multidimensional index-based data table to store sparse arrays but constructed a transparent user-oriented multidimensional view through the cube model ((c)). Because the histogram greatly reduces the volume of the raw dataset, HCube can load all cube data into the in-memory database at once and respond to query requests in real time on the server side. Pandas DataFrame (https://pandas.pydata.org) was employed for in-memory storage, and the column-oriented data file format Apache Parquet (https://parquet.apache.org) was used to store all cuboids persistently on the disk. For each spatiotemporal granularity, we labeled the cuboids in the form of $⟨ M_{v}, θ_{s}, θ_{t} ⟩$ ( $M_{v}$ denotes the members in variable dimensions) and built a multidimensional index in the order of $⟨ S, T, V ⟩$ . The parquet format has a high compression ratio and can be deployed in distributed cloud environments such as Hive and S3 to support remote access and distributed computing. The proposed model can be connected to other sparse array databases as a backend to support large-scale applications.

4.2. Query engine layer

The query engine is responsible for generating query conditions, constructing temporal/spatial minimum-depth aggregation trees, and processing request/response to query tasks. After the engine starts, it immediately loads the cube configuration and creates a specific cube object and then reads all the histogram cuboids simultaneously to complete cube model initiation as a query server. After receiving the query conditions sent by the interaction layer, the engine executes indexing to automatically generate adaptive space–time cells and then queries all histograms to be aggregated in the cube. The aggregation query results are pushed to the client in a timely manner in the JSON data structure (the same data transfer format used for the query requests). The client can visualize the time-series curves of aggregated values for the query regions. If a user's query request contains more than one query region (task), the query engine executes the query tasks in parallel in a multiprocess manner and merges the final query results at the end rather than executing each task in turn.

In the above distributed aggregation process, optimization mechanisms are adopted to reduce data reads and communication. First, when constructing the minimum-depth aggregation tree, we limited the maximum number of covered cells to 500 via the RegionCovering function of the Google S2 geometry to prevent the fine grid representation of the geometric edges from generating an excessive number of cells. Second, a large number of SECs exist among the cells covered by adjacent spatial domains, which leads to redundant queries when multiple query tasks include the same adjacent spatial domains. Therefore, we merged the same space–time cells before executing a batched query to avoid duplicate cell reads. The histograms of the shared cells are sent to each task group after the query returns. Finally, when distributive or algebraic aggregation is performed, a detailed histogram distribution is not needed. Thus, the sum and count aggregates for each cell can be computed in advance during the map phase, and the mean can be calculated from the sum and count during the reduce phase without histogram aggregation.

4.3. Interactive layer

A simple 3D visual interface based on Cesium (https://cesium.com) was designed for interactive querying. Users can create polygon regions through mouse interaction or upload files (shapefile/GeoJSON) to specify the spatial query domain $φ (D_{s}) : θ_{s}$ . The temporal domain $φ (D_{t}) : θ_{t}$ can be selected using the time-sliding axis and drop-down boxes. The spatiotemporal coupling domain can be input by uploading polygon files with time fields (prepared by the users). The EO variables and aggregation functions can be specified using multiselection boxes. These inputs are converted into query conditions $⟨ φ (D_{s}) : θ_{s}, φ (D_{t}) : θ_{t}, φ (D_{v}) : θ_{v}, F ⟩$ and sent to the Cube Engine layer in JSON format to perform the query tasks. The results are returned to the interactive interface in time for chart visualization.

5. Evaluation

5.1. Case study and data

The normalized difference vegetation index (NDVI) is a widely used remote sensing indicator that reflects the relative density and health of vegetation (Kogan Citation1995). Exploring the regional heterogeneity of vegetation and its time-series variation through interactive aggregation queries can effectively help to better understand vegetation response patterns to climate change. As a case study, we built HCube instances for global NDVI cumulative observations. Vegetation growth is affected by climatic conditions such as rainfall and temperature, with different growth seasons. Based on the vegetation phenology, we calculated the average plant growth season for each climatic zone to generate spatiotemporal coupling domains (). Two types of NDVI datasets were selected for testing. One was a global dataset (called Dataset A in the rest of the paper) with a spatial resolution of 0.05° provided by the Vegetation Index and Phenology Lab, University of Arizona, USA (https://vip.arizona.edu). The data volume is 38.2 GB, and the dimensions are 7200 (lon) × 3600 (lat) × 396 (month). The other dataset was a 30-m Landsat NDVI dataset (called Dataset B) that covers the entire China region obtained from the Chinese Ecosystem Research Network Data Center (http://www.nesdc.org.cn). Dataset B was used to test the scalability of the proposed model in high-spatial-resolution scenarios with a data volume of 1.28 TB and dimensions of 228747(x) × 131365(y) × 22(year). Phenology datasets were also obtained from the Vegetation Index and Phenology Lab, and the climate zone data used in the test were obtained from the National Center for Environment Information, US (https://www.ncei.noaa.gov).

Figure 7. Workflow of the cube analysis process in the case study.

5.2. Experimental settings

The experiments were performed in a Linux environment on a personal desktop computer with a common configuration (CPU i9-12900K 3.2 GHz (16 cores), RAM 64 GB, and 1 TB SSD). The HCube data were built in the same environment, and the aggregation tools in ArcGIS Pro (commercial software, https://www.esri.com) and XCube (open-source data cube project for Copernicus Project, https://github.com/dcs4cop/xcube) were used for comparison. To simulate spatiotemporal query tasks of users, we randomly generated 1000 spatial query domains (with an area of 0.5∼200 × 10⁴ km²) and 360 temporal query domains (periods from 1 to 360 months) on a global scale for Dataset A. A total of 1000 spatial query domains (with an area of 1∼400 km²) in China were generated for Dataset B. In addition, spatiotemporally coupled domains were prepared in a shapefile, including two temporal fields (SOS and EOS) representing the start and end, respectively, of the growing season of plants (). The query response and concurrent performance of the three methods were tested under different query conditions. Psutil was used to monitor memory usage, CPU utilization, and disk I/O load during the experiments (https://github.com/giampaolo/psutil), whereas pyRAPL was employed to record the power consumption of the CPU and RAM in the querying process (https://github.com/powerapi-ng/pyRAPL).

5.3. Cube building

For Dataset A, the spatial pyramid of the HCube instance (called HCube-A) was set to a 9-level (L0-L8) S2 cell. The spatial coverage of the cell at the highest level, L8, is approximately 1000 km². The temporal pyramid has two levels: month and year. The number of histogram bins was set to 10, 20, and 40. Thus, the bin widths were 0.1, 0.05 and 0.025, respectively. The Cube Builder first starts to generate cuboids at the finest granularity level and then aggregates them at the upper levels. We recorded the cube building time and data reduction ratio for different S2 cell sizes and numbers of histogram bins.

Because S2 uses quadtree division, the number of N-level cells is four times that of N-1-level cells. (a) shows that the data reduction ratio of the data rapidly decreased with increasing space cell size but still reached 125 times (∼312 MB) at the L8 level. In contrast, the time overhead gradually increased. This occurs because the construction of the finest cuboid requires traversing every pixel of raw data, which is a time-consuming task. A small bin width could improve the accuracy of data queries, but the time overhead would increase exponentially while generating sparser matrices ((b)). Therefore, an appropriate bin width should be selected according to the actual accuracy requirements. The default configuration for the performance tests was as follows: number of bins = 20 and cell size = L8.

Figure 8. Cube building time and data reduction ratio in the case of different S2 cell sizes (a) and different numbers of histogram bins (b).

Owing to the high data reduction ratio, HCube data can be quickly shared or transferred between local and remote systems. We deployed the cube data in the AliCloud Object Storage Service (https://www.aliyun.com). When HCube starts, the loading speed of the remote cube in this test is delayed by approximately 10 s relative to the local loading speed (the test result is related to the volume of the cube data and network speed), but the delay problem exerts no impact on the subsequent query performance because all the query data are fetched from the local in-memory storage after starting the engine.

For Dataset B with a higher spatial resolution, an HCube instance (called HCube-A) was built with a spatial pyramid of 16-level (L0-L15) S2 cells. The spatial coverage of the finest cell is approximately 0.08 km². There is only one member (year) in the time dimension, and the number of histogram bins was set to 20. Similar to Dataset A, cube building will first entails geocoding each pixel one by one starting at L15. The data processing is a relatively time-consuming task (approximately 65.4 h). The chunking strategy was adopted to avoid memory overflow during data aggregation, also requiring considerable time. The data size of HCube-B is nearly 15.2 GB, and the data reduction ratio reaches 86.2 times.

5.4. Performance comparison

For comparison, ArcGIS Pro can support spatial and temporal aggregation processes of multidimensional data separately but does not implement spatiotemporal aggregation functions. ArcPy was used to call the zonal statistics function to complete the testing task. Another comparison solution, XCube, developed using Python and Xarray (https://xarray.dev), is a suitable benchmark for comparison with built-in time-series analysis functions. The tests described below were performed to verify the performance of the proposed model.

5.4.1. Query response time

First, the query response performance of the three methods with the increase in the space and time dimensions was tested on HCube-A. In the spatial aggregation test, the spatial query domain was set to 1000 random cases, whereas the temporal query domain was randomly selected for a 20-year period from March 1, 1982. In the temporal aggregation test, 360 random cases were set for the temporal query domain, whereas a region with an area of approximately 80 × 10⁴ km² was selected for the spatial query domain. The query tasks were executed serially, and the response times of the individual queries were recorded ().

Figure 9. Response time of the individual aggregation queries with the increase in the space and time dimension on HCube-A. (a) ∼ (c) Performance of HCube and XCube with space growth. (d) ∼ (f) Performance of HCube and XCube with time growth (HCube and XCube). (g) ∼ (h) Performance of ArcGIS with space and time growth.

Among the three methods, HCube achieves the best performance, and the response time varies slightly with space and time dimension growth, varying between ∼0.2 and 0.3 s. The mean and variance functions of XCube perform better than those of HCube for small query areas and periods (40 × 10⁴ km² and 200 months, respectively, in this case), but the response time linearly increases with increasing query domain. The performance of the XCube median function is consistently lower than that of HCube and significantly fluctuates because the computational complexity of the median depends not only on the data size but also on the data distribution. In the ArcGIS test, the performance of the median and variance are consistent with that of the mean. Therefore, we only show the performance of the mean in . Notably, the performance fluctuates with space change, distributed between ∼8 and 14 s, but gradually increases over time. The response time of ArcGIS includes the writing time of intermediate results, in addition to the additional overhead due to the software itself, which cannot be estimated. In summary, the stable performance of HCube shows that the preaggregation query-oriented solution is highly scalable in terms of managing changes in query domains.

Second, to verify the performance of the model under scenarios with a higher spatial resolution, we also determined the query response time of HCube-B with space and time dimension growth. In the spatial aggregation test, the 1000 spatial query domains randomly generated before were used, and the total 22-year period was adopted as the temporal query domain. In the temporal aggregation test, a region with an area of approximately 100 km² was selected for the spatial query domain, and the time step was set to 1 year.

The results show that there is no obvious dependence of the query performance on time and space growth. The individual query response time of HCube-B remains stable at approximately 0.4 s, although the performance is slightly lower than that of HCube-A (). The key metrics related to the performance include the cell number, geocoding speed and indexing latency. In the query process, any spatial query domain is converted into a limited number (≤500 in our case) of S2 cells, regardless of the spatial resolution of the data. For higher-spatial-resolution data, a smaller spatial query domain generates finer levels of S2 cells. Due to the high performance of Google S2, the difference in the geocoding speed between fine and coarse cells is extremely small. Therefore, multidimensional indexing latency becomes a major factor affecting the above query performance differences. The exponential growth in fine spatiotemporal cells in the data table of HCube-B leads to increased indexing complexity and hence latency.

Figure 10. Response time of the individual aggregation queries along with space and time dimension growth on HCube-B. (a) ∼ (c) Performance with space growth. (d) ∼ (f) Performance with time growth.

5.4.2. Resources and energy consumption

The groupby operation of STIA involves two dimensions simultaneously, which increases the pressure on memory and computation. Based on HCube-A, we assessed the system resources and energy consumption of the different methods under concurrent scenarios, including the response time, CPU utilization, memory usage, I/O load, and RAPL energy consumption. Because ArcGIS does not support STIA, we compared only XCube and HCube. Both involve the use of multiprocessing with a number of concurrent CPU cores.

shows the overheads of the system resources and their variations during the experiment. The results show that HCube can achieve twice the response performance and energy efficiency of XCube for the same concurrent tasks. XCube uses the lazy load mode, which does not require loading the cube entirely in memory, and chunks of data can be read from the hard disk in real time, but the significant increase in the I/O volume can cause computational latency problems in the case of a large number of requests. There are basically no data I/O operations in the queries in HCube because of in-memory mode use. The CPU and memory utilization curves show that the overall memory consumption of HCube is not high because the volume of aggregated cube data is relatively small. The computationally complex steps of HCube are building aggregation trees and querying histograms at the early stage. The aggregation computation does not consume much computational resources, and the average CPU utilization is always lower than that of XCube. From the comparison of the three operators, the variance exhibits the highest computational complexity, and the corresponding CPU utilization is approximately 45% with the highest energy consumption. We also recorded the response times of the individual queries in the above experiments, and the results showed that the performance of the individual queries in the concurrent environment exhibits a slight loss for both options relative to the serial environment (). However, HCube completely outperformed XCube by a maximum factor of 4.2 in all spatial domain cases.

Figure 11. Performance of the concurrent STIA queries. (a) ∼ (f) show the resource overhead (Energy, CPU, Mem, Read and Write) and response time in the querying process. (g) ∼ (h) show the comparative curves of CPU and memory utilization in the aggregation process for the mean.

Figure 12. Variation in the response time along the space domain for the individual STIA query tasks in a concurrent environment, where (a) Agg = mean, (b) Agg = median, and (c) Agg = variance.

As of this writing, we did not find any Python implementations of STCA. We evaluated STCA function in HCube and compared the results to those of STIA. The experiment was conducted to query the NDVI averages during all plant-growing seasons in the selected climate zones in the United States from March 1982 to March 2012. The average value of the results of three replicate experiments was obtained. The observations include the response time, memory overhead, CPU utilization, I/O load, and energy consumption (). Compared to the STIA queries, the implementation of STCA queries includes the additional processes of lookup table querying and grouped aggregation, thus increasing the computation and memory overhead, which leads to a lower performance.

Figure 13. Performance comparison of STCA and STIA. (a) Response time. (b) Energy consumption. (c) CPU utilization. (d) Memory usage.

5.5. Uncertainty analysis

Owing to the histogram-based estimation algorithm and grid-based spatial approximation representation, there are errors between the query results and the real values. Without accessing the actual data in the cell, we cannot obtain the correct spatiotemporal distribution of the data. Therefore, the error is uncertain. The distribution of the errors over space, time and histogram bins were investigated and analysed. HCube-A was used as a test case to ensure comparability and consistency of the validation results, because it covers a global spatial extent and has a longer time series. In the above experiments, we recorded the errors between the real values and the query values in the case of several aggregation types (sum, mean, median, and variance) (). The results showed that the errors gradually decrease with area growth, which indicates that the reduction in the EC proportion could reduce the uncertainty effect of the approximate expression. In contrast, the errors significantly fluctuate but do not gradually decrease with time dimension growth. The obvious fluctuations during the period from 50 to 200 months might be due to the high heterogeneity of the NDVI distribution. Among the four aggregation functions, the median performed the worst as it is susceptible to uncertainty in the data distribution.

Figure 14. Errors of the aggregates (sum, mean, median, and variance) with space and time dimension change. (a) ∼ (d) show the error variations with space growth. (e) ∼ (h) show the error variations with time growth. Relative errors are used in the case of the sum and variance.

In contrast, the histogram granularity directly affects the query accuracy and exerts a considerable uncertainty impact on holistic measurement estimation. shows that the finer the histogram granularity is, the smaller the error in the query results. From an application point of view, the appropriate histogram granularity depends on the accuracy we can accept, which is determined during cube building. Therefore, error-based evaluation should consider the actual accuracy requirements. In addition, because the Google S2 grid is distorted at high latitudes, there are differences in the area of the grid cells between different latitudes. The results showed that the error tended to increase with increasing latitude. Among the aggregates, the median exhibited a more pronounced effect, with a maximum of approximately 0.32 in the high-latitude region and a minimum of approximately 0.04 in the low-latitude region for 10 histogram bins. This error distribution is also strongly related to the distribution of the data variables in the grid cells.

Figure 15. Error distribution of the aggregates (sum, mean, median, and variance) with histogram granularity (number of bins = [10, 20, 40]) and geographical latitude. Among the aggregates, rsum is the result of the sum after boundary correction. Note that the error value is the absolute value of the relative error.

6. Discussion

OLAP and preaggregation are classic techniques in data warehouses, but their applications in the EO domain are mainly focused on data management and visualization, and little research has been conducted in the field of data analysis. HCube is a SOLAP model based on a generic EO data cube structure to support exploratory analysis. HCube can significantly reduce the dependence of EO data aggregation on the computing infrastructure, providing a lightweight and green solution. This is reflected in three aspects: (i) Small size. Frequency histograms with high data reduction ratios can be easily stored, shared, and analysed in local or remote storage environments. (ii) Fast response time. HCube converts spatiotemporal aggregation into multidimensional queries, significantly reducing the computation pressure and I/O latency. (iii) Low energy consumption. The CPU power consumption, memory usage, and I/O throughput during the aggregation query are significantly lower than those of the real-time aggregation computation. Therefore, HCube demonstrates suitable scalability for large-scale datasets. As a SOLAP model, HCube can be integrated into EO platforms. It can also be regarded as a localized EO data cube, thus complementing existing global EO platforms such as the Open Data Cube, EO Earth Cube, and GeoCube (Sudmanns et al. Citation2022).

An obvious cost aspect of this approach is that preaggregation results in additional computation and storage overheads. However, compared to the scenario where users frequently request computational tasks, this approach involves only one computation at the beginning, which can greatly reduce the computational overhead of the subsequent queries. The expected computation size must be balanced with the query performance requirements according to the application requirements. For example, a coarser cube granularity (histogram bins, grid level, and time granularity) results in a higher query speed and less storage. One possible optimization approach is to identify hot spot regions based on the user query history and generate histogram information for these regions only while performing real-time aggregation in cold spot regions.

Another cost item of this approach is that the query results are subject to errors and uncertainties, which originate from two main sources (): (i) the first source is the algorithmic error of the aggregation function, especially the histogram estimation process of the holistic aggregation function. According to the equations in , a finer cube granularity reduces the impact of errors. (ii) The other source is the spatial approximation expression by grid cells, including cell deformation errors, cell accuracy, and uncertainty in the variable distribution in the cell. The Google S2 grid is efficient but suffers large-area deformation in high-latitude regions, which can easily lead to spatial expression errors. The differences between the various DGGSs have been explored (Kmoch et al. Citation2022), but the impact of the various grids on interactive query analysis must be further evaluated within the context of application examples, and the modifiable unit area problem should be further considered. In addition, we assumed that the spatial distribution of the edge cells (ECs) is homogeneous with the actual query regions. Therefore, the area share was used to estimate the sum/count of the actual EC part, while all the data in the edge cells were used to estimate the median and variance instead of the actual part. However, there is often spatial heterogeneity in the variables in the ECs, which leads to uncertainty in the estimated values, and this uncertainty varies with the proportion of ECs in the query domain. One possible optimization mechanism is to obtain the actual data in the ECs precisely and combine the preaggregation query of ICs with the real-time computation of ECs. However, this mechanism involves the collaboration of computational and query tasks, which requires an additional computational overhead.

Figure 16. Factors influencing the query errors and uncertainties in HCube.

The EO data applicable to our model mainly refer to global remote sensing reanalysis datasets with long-term series including various types of Earth environment indicators that are extracted, fuzed and calculated from raw EO data. With the rapid development of remote sensing technology, high-spatial-resolution reanalysis datasets will gradually become an important data source for fine assessment and local analysis. Thus, supporting a high spatial resolution is an important development trend of data cubes. Our Cube Builder implementation still suffers bottlenecks for high-spatial-resolution data. For example, the Cube Builder performs S2 geocoding on each pixel that is an inefficient way. This results in a relatively long process of cube building that could face enormous challenges when processing global high-resolution data. From a methodological perspective, our cube query mechanism exhibits favorable scalability, and the aforementioned indexing latency can be relieved via a distributed method. However, our current implementation is limited to local PCs but lacks a distributed in-memory architecture. Addressing these challenges are the directions we must consider in the next version.

The geographical environment is a complex system of interconnected and interacting geographical elements. It is important to explore both global and local patterns of geographic events and phenomena. Therefore, in the process of human–computer interaction, it is necessary to perform multiple rounds of progressive mining, observation, and analysis based on different spatiotemporal regions, scales, and rule constraints while integrating domain knowledge to discover valuable trends (Kopp et al. Citation2019; Liu and Heer Citation2014). Our proposed model was designed for such exploratory analysis scenarios. Therefore, the accuracy requirements are not particularly high necessarily, and the user must predefine the cube parameters and evaluate the uncertainty risk to meet the application requirements. Spatiotemporal aggregation is a fairly basic geographic analysis method, and the multidimensional data model and query framework proposed in this study can be incorporated into other spatial analysis and machine learning algorithms to provide similar methodologies for exploratory queries of more complex measurement data and geographic knowledge.

7. Conclusion

The frequent spatiotemporal aggregation demands in existing massive EO data analysis place high pressure on the computational performance and resource consumption. The existing solutions of online computing platforms for EO data using a distributed architecture exhibit high energy consumption and are unsustainable. In this paper, we systematically determined the types of spatiotemporal aggregation and proposed a histogram cube model based on SOLAP and the DGGS from a green computing perspective. In the model, the computation tasks of spatiotemporal aggregation on EO data are converted into multidimensional query requests on the histogram cube, which could achieve the goal of interactive queries on massive EO data in a lightweight manner on a local PC.

We implemented the prototype system in Python based on Google S2, and global NDVI aggregation analysis was used as a case study. The results of comparison tests with XCube and ArcGIS showed that the model can greatly reduce the EO data size and achieve a real-time response to interactive aggregation tasks with low resource consumption. The model can flexibly support various types of aggregation processes, including STCA, providing a powerful and efficient tool for exploratory geographic analysis. However, the model exhibits inherent uncertainty and generates some accuracy loss, which should be carefully evaluated by the user. Our method is suitable for EO data interactive query scenarios with low accuracy requirements, frequent queries, and limited resources.

In future work, we will improve and optimize the data model and its implementation in three aspects: (i) explore the use of DGGSs with smaller deformations, such as rHEALPix (Gibb Citation2016), for constructing spatial pyramids to improve the accuracy of the aggregated results; (ii) establish a collaborative mechanism for the preaggregation query and dynamic aggregation computation processes to reduce the error effect of ECs on the aggregation results; (iii) migration of the model to a distributed in-memory architecture for supporting large-scale EO data aggregation applications.

Acknowledgements

We want to express our gratitude to the anonymous reviewers and editors for their valuable comments on improving the quality of this paper.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The datasets of vegetation index and phenology for this study are openly available in https://vip.arizona.edu and http://www.nesdc.org.cn. The climate zone data for tests are available in National Center for Environment Information of US at https://www.ncei.noaa.gov. The additional materials that support the findings of this study are available on request.

Additional information

Funding

This work was supported by Key Laboratory of National Geographic Census and Monitoring, Ministry of Natural Resources, China: [Grant Number 2020NGCM05]; Natural Science Foundation of Shaanxi Province, China: [Grant Number 2020JQ-413].

References

Appel, M., F. Lahn, W. Buytaert, and E. Pebesma. 2018. “Open and Scalable Analytics of Large Earth Observation Datasets: From Scenes to Multidimensional Arrays Using SciDB and GDAL.” ISPRS Journal of Photogrammetry and Remote Sensing 138: 47–56. https://doi.org/10.1016/j.isprsjprs.2018.01.014.
Web of Science ®Google Scholar
Baumann, P., P. Mazzetti, J. Ungar, R. Barbera, D. Barboni, A. Beccati, L. Bigagli, et al. 2016. “Big Data Analytics for Earth Sciences: The EarthServer Approach.” International Journal of Digital Earth 9 (1): 3–29. https://doi.org/10.1080/17538947.2014.1003106.
Web of Science ®Google Scholar
Bédard, Y., S. Rivest, and M. J. Proulx. 2007. “Spatial Online Analytical Processing (SOLAP): Concepts, Architectures, and Solutions from a Geomatics Engineering Perspective.” In Data Warehouses and OLAP: Concepts, Architectures and Solutions, edited by Wrembel Robert, 298–319. Hershey: IGI Global.
Google Scholar
Bimonte, S., M. Zaamoune, and P. Beaune. 2017. “Conceptual Design and Implementation of Spatial Data Warehouses Integrating Regular Grids of Points.” International Journal of Digital Earth 10 (9): 901–922. https://doi.org/10.1080/17538947.2016.1266040.
Web of Science ®Google Scholar
Cardoso dos Santos, J. F., D. Van Der Zande, and N. Youdjou. 2020. “The Data Cube System to EO Datasets: The DCS4COP Project.” EGU General Assembly 2020: 21634. https://doi.org/10.5194/egusphere-egu2020-21634.
Google Scholar
Chen, N., H. Li, and L. Wang. 2009. “A GIS-Based Approach for Mapping Direct use Value of Ecosystem Services at a County Scale: Management Implications.” Ecological Economics 68 (11): 2768–2776. https://doi.org/10.1016/j.ecolecon.2008.12.001.
Web of Science ®Google Scholar
Cheng, W., X. M. Qian, S. W. Li, H. B. Ma, D. S. Liu, F. Q. Liu, J. L. Liang, and J. Hu. 2022. “Remote Sensing Cloud Computing Platform Development and Earth Science Application.” National Remote Sensing Bulletin 25 (2): 220–230. https://doi.org/10.11834/jrs.20210447.
Google Scholar
Gao, F., P. Yue, Z. Cao, S. Zhao, B. Shangguan, L. Jiang, L. Hu, Z. Fang, and Z. Liang. 2022. “A Multi-Source Spatio-Temporal Data Cube for Large-Scale Geospatial Analysis.” International Journal of Geographical Information Science 36 (9): 1853–1884. https://doi.org/10.1080/13658816.2022.2087222.
Web of Science ®Google Scholar
Gibb, R. G. 2016. “The rHEALPix Discrete Global Grid System.” IOP Conference Series: Earth and Environmental Science 34(1): 012-012.
Google Scholar
Gómez, L., S. Haesevoets, B. Kuijpers, and A. A. Vaisman. 2009. “Spatial Aggregation: Data Model and Implementation.” Information Systems 34 (6): 551–576. https://doi.org/10.1016/j.is.2009.03.002.
Web of Science ®Google Scholar
Gorelick, N., M. Hancher, M. Dixon, S. Ilyushchenko, D. Thau, and R. Moore. 2017. “Google Earth Engine: Planetary-Scale Geospatial Analysis for Everyone.” Remote Sensing of Environment 202: 18–27. https://doi.org/10.1016/j.rse.2017.06.031.
Web of Science ®Google Scholar
Gray, J., S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. 1997. “Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-tab, and sub-Totals.” Data Mining and Knowledge Discovery 1: 29–53. https://doi.org/10.1023/A:1009726021843.
Web of Science ®Google Scholar
Gutierrez, A. G., and P. Baumann. 2010. “Using Preaggregation to Speed up Scaling Operations on Massive Spatio-Temporal Data.” In Conceptual Modeling – ER 2010, edited by J. Parsons, M. Saeki, P. Shoval, C. Woo, and Y. Wand, 188–201. Berlin: Springer.
Google Scholar
Haynes, D., S. Manson, and E. Shook. 2017. “Terra Populus’ Architecture for Integrated big Geospatial Services.” Transactions in GIS 21 (3): 546–559. https://doi.org/10.1111/tgis.12286.
PubMed Web of Science ®Google Scholar
Hu, F., C. Yang, J. L. Schnase, D. Q. Duffy, M. Xu, M. K. Bowen, T. Lee, and W. Song. 2018. “ClimateSpark: An in-Memory Distributed Computing Framework for big Climate Data Analytics.” Computers & Geosciences 115 (2018): 154–166. https://doi.org/10.1016/j.cageo.2018.03.011.
Google Scholar
Killough, B., A. Siqueira, and G. Dyke. 2020. “Advancements in the Open Data Cube and Analysis Ready Data – Past, Present and Future.” IGARSS 2020 – 2020 IEEE International Geoscience and Remote Sensing Symposium 2020 (1): 3373–3375.
Google Scholar
Kmoch, A., I. Vasilyev, H. Virro, and E. Uuemaa. 2022. “Area and Shape Distortions in Open-Source Discrete Global Grid Systems.” Big Earth Data 6 (3): 256–275. https://doi.org/10.1080/20964471.2022.2094926.
Google Scholar
Kogan, F. N. 1995. “Application of Vegetation Index and Brightness Temperature for Drought Detection.” Advances in Space Research 15 (11): 91–100. https://doi.org/10.1016/0273-1177(95)00079-T.
PubMed Web of Science ®Google Scholar
Kopp, S., P. Becker, A. Doshi, D. J. Wright, K. Zhang, and H. Xu. 2019. “Achieving the Full Vision of Earth Observation Data Cubes.” Data 4 (3): 94. https://doi.org/10.3390/data4030094.
Web of Science ®Google Scholar
Li, K., and G. Li. 2018. “Approximate Query Processing: What is New and Where to Go?” Data Science and Engineering 3 (4): 379–397. https://doi.org/10.1007/s41019-018-0074-4.
Google Scholar
Li, J., L. Meng, F. Z. Wang, W. Zhang, and Y. Cai. 2014. “A Map-Reduce-Enabled SOLAP Cube for Large-Scale Remotely Sensed Data Aggregation.” Computers & Geosciences 70 (2014): 110–119. https://doi.org/10.1016/j.cageo.2014.05.008.
Google Scholar
Lins, L., J. T. Klosowski, and C. Scheidegger. 2013. “Nanocubes for Real-Time Exploration of Spatiotemporal Datasets.” IEEE Transactions on Visualization and Computer Graphics 19 (12): 2456–2465. https://doi.org/10.1109/TVCG.2013.179.
PubMed Web of Science ®Google Scholar
Liu, Z., and J. Heer. 2014. “The Effects of Interactive Latency on Exploratory Visual Analysis.” IEEE Transactions on Visualization and Computer Graphics 20 (12): 2122–2131. https://doi.org/10.1109/TVCG.2014.2346452.
PubMed Web of Science ®Google Scholar
Lopez, I. V., R. T. Snodgrass, and B. Moon. 2005. “Spatiotemporal Aggregate Computation: A Survey.” IEEE Transactions on Knowledge and Data Engineering 17 (2): 271–286. https://doi.org/10.1109/TKDE.2005.34.
Web of Science ®Google Scholar
Mennis, J. 2010. “Multidimensional map Algebra: Design and Implementation of a Spatio-Temporal GIS Processing Language.” Transactions in GIS 14 (1): 1–21. https://doi.org/10.1111/j.1467-9671.2009.01179.x.
Web of Science ®Google Scholar
Nativi, S., P. Mazzetti, and M. Craglia. 2017. “A View-Based Model of Data-Cube to Support big Earth Data Systems Interoperability.” Big Earth Data 1 (1-2): 75–99. https://doi.org/10.1080/20964471.2017.1404232.
Google Scholar
Pahins, C. A., S. A. Stephens, C. Scheidegger, and J. L. Comba. 2017. “Hashedcubes: Simple, low Memory, Real-Time Visual Exploration of big Data.” IEEE Transactions on Visualization and Computer Graphics 23 (1): 671–680. https://doi.org/10.1109/TVCG.2016.2598624.
PubMed Web of Science ®Google Scholar
Rinner, C., and M. Hussain. 2011. “Toronto’s Urban Heat Island – Exploring the Relationship Between Land use and Surface Temperature.” Remote Sensing 3 (6): 1251–1265. https://doi.org/10.3390/rs3061251.
Web of Science ®Google Scholar
Sample, J. T., and E. Loup. 2010. Tile-based Geospatial Information Systems: Principles and Practices. New York: Springer.
Google Scholar
Sudmanns, M., H. Augustin, B. Killough, G. Giuliani, D. Tiede, A. Leith, F. Yuan, and A. Lewis. 2022. “Think Global, Cube Local: An Earth Observation Data Cube’s Contribution to the Digital Earth Vision.” Big Earth Data 7 (3): 831–859. https://doi.org/10.1080/20964471.2022.2099236.
Google Scholar
Tomlin, C. D. 1994. “Map Algebra: One Perspective.” Landscape and Urban Planning 30 (1-2): 3–12. https://doi.org/10.1016/0169-2046(94)90063-9.
Web of Science ®Google Scholar
Wang, Z., N. Ferreira, Y. Wei, A. S. Bhaskar, and C. Scheidegger. 2017. “Gaussian Cubes: Real-Time Modeling for Visual Exploration of Large Multidimensional Datasets.” IEEE Transactions on Visualization and Computer Graphics 23 (1): 681–690. https://doi.org/10.1109/TVCG.2016.2598694.
PubMed Web of Science ®Google Scholar
Xu, D., Y. Ma, J. Yan, P. Liu, and L. Chen. 2018. “Spatial-feature Data Cube for Spatiotemporal Remote Sensing Data Processing and Analysis.” Computing 102 (2020): 1447–1461. https://doi.org/10.1007/s00607-018-0681-y.
Google Scholar
Zacharatou, E. T., H. Doraiswamy, A. Ailamaki, C. T. Silva, and J. Freiref. 2017. “GPU Rasterization for Real-Time Spatial Aggregation Over Arbitrary Polygons.” Proceedings of the VLDB Endowment 11 (3): 352–365. https://doi.org/10.14778/3157794.3157803.
Web of Science ®Google Scholar
Zhang, J., S. You, and L. Gruenwald. 2015. “Efficient Parallel Zonal Statistics on Large-Scale Global Biodiversity Data on GPUs.” In Proceedings of the 4th International ACM SIGSPATIAL Workshop on Analytics for Big Geospatial Data, edited by Varun Chandola, 35–44. New York, NY: ACM.
Google Scholar

Histogram cube: towards lightweight interactive spatiotemporal aggregation of big earth observation data

ABSTRACT

1. Introduction