654
Views
0
CrossRef citations to date
0
Altmetric
Review Article

A review of prediction models for E. coli in urban surface waters

, , , , , & show all
Pages 539-548 | Received 22 Aug 2023, Accepted 25 Jan 2024, Published online: 13 Feb 2024

ABSTRACT

Urban surface water is increasingly used for contact recreation. Predicting Escherichia coli (E. coli) concentrations in these waters can support early warning of bathers and explain the dynamics of this faecal pollution indicator. This study provides the first overview of the scientific knowledge on E. coli prediction models for freshwater in cities. Modelling techniques for urban waters are comparable to those for other freshwater environments, with multiple linear regression being the most frequently used approach. While previously reviewed E. coli prediction models for freshwater beaches predominantly target lakes, urban models mainly target rivers. We found indications that model performance for urban rivers is lower than for recreational beach water in rivers in general. Reported performance metrics indicate that not all relevant sources are captured by the models. Future research should solve the lack of insight into model performance for specific applications and verify the suggested directions to improve models’ accuracy.

1. Introduction

Urban surface waters are increasingly used for recreational bathing. In Europe, not only the number of designated bathing sites increased, also swimming outside designated sites is increasingly popular (EEA Citation2020). A study of 32 popular surface water swimming sites in the Netherlands and Belgium that are not designated as bathing water shows that many of these waters are located in cities (De Jong et al. Citation2022). Urban waters are also increasingly used for swimming events (Hintaran et al. Citation2018) such as the Big Jump in Ghent, Belgium, the Amsterdam City Swim, and Swim-In Leiden, The Netherlands (Van der Meulen et al. Citation2023). A study in Amsterdam, The Netherlands, and Toronto, Canada showed that demand for this type of recreational use of urban water is expected to increase over the next decades due to population growth, urban regeneration near water, climate change and popularity of outdoor sports (Van der Meulen et al. Citation2020).

Faecal pollution is one of the key water quality issues that limits suitability for swimming. Faecal pollution can cause health problems including gastrointestinal diseases, skin irritation, and eye infection (Davies-Colley, Valois, and Milne Citation2018; Mallin et al. Citation2000; Soller et al. Citation2010). The most common indicator for faecal pollution in freshwater is the E. coli bacteria (Azevedo Lopes et al. Citation2016; EC Citation2006; Nagels, Davies-Colley, and Smith Citation2001; Van der Meulen et al. Citation2022). At designated bathing sites, the common monitoring programs for E. coli consist of weekly or biweekly sampling and culture-based analysis of E. coli, in line with the European Bathing Water Directive. The limitation of this approach is that water managers have no information about the E. coli concentration between sampling moments while E. coli can have a high temporal variability (Boehm et al. Citation2002; Myers, Koltun, and Francy Citation1998). New technologies enable automated sampling and analysis, which can improve the monitoring frequency (Angelescu et al. Citation2018) but scholars state that implementation of these techniques is hampered by high costs and technical challenges (Dorevitch et al. Citation2017; Griffith and Weisberg Citation2011; Seifert-Dähnn et al. Citation2021).

As alternative to higher monitoring frequencies, prediction models are developed to support early warning of bathers and to improve understanding of the temporal dynamics of bathing water quality. Prediction models, using common rapid and easily measured parameters, improve reliability of early warning systems at bathing sites, as a cost-saving alternative for high-frequency monitoring of E. coli. Prediction models also have the advantage that they provide real-time information (De Brauwere, Ouattara, and Servais Citation2014; Naloufi et al. Citation2021).

There is a growing body of scientific literature on E. coli modelling studies for surface waters. Many studies are targeting recreational beach waters in coastal areas (e.g. Bedri et al. Citation2014; Choi, Chan, and Lee Citation2022; Thoe and Lee Citation2014) or inland freshwater beach waters (e.g. Francy et al. Citation2013; Madani and Seth Citation2020; Nevers et al. Citation2009). The general performance of faecal indicator bacteria prediction models for all types of surface waters, based on frequently used multiple linear regression (MLR), is highly variable and scholars state that performance is not always acceptable (De Brauwere, Ouattara, and Servais Citation2014). Machine learning approaches such as artificial neural networks are less frequently used but seem to perform better than regression-based models (De Brauwere, Ouattara, and Servais Citation2014; Heasley et al. Citation2021; Li et al. Citation2022). It is important to distinguish models for different types of water systems (such as marine waters, estuaries, rivers or lakes) as pollution sources and the processes of decay and transport differ (De Brauwere, Ouattara, and Servais Citation2014; Dwivedi, Mohanty, and Lesikar Citation2013). Heasley et al. (Citation2021) reviewed modelling approaches and model performance for microbial water quality in freshwater recreational beach waters. Most (75%) of the reviewed models target lakes. The study shows that faecal indicator bacteria prediction models provide a better indication of real-time water quality compared to an estimate based on previous day’s measurements (the so-called persistence model approach).

It is valuable to review E. coli prediction models specifically for urban areas. In cities with abundant water, a significant portion of the surface water system exists of rivers or canals. Therefore, current generic insights about E. coli prediction models in recreational freshwaters, that mainly involve lakes (Heasley et al. Citation2021), cannot be assumed valid for urban surface water. Urban waters also differ from rural water systems in terms of sources of E. coli. While in rural areas diffuse sources (mainly from wild animals and livestock) are often predominant, in urban areas, point sources such as combined sewer overflows (CSOs), effluents from wastewater treatment plants (WWTPs), and runoff collected by a separate sewer system dominate (De Brauwere, Ouattara, and Servais Citation2014). A study in the Portland metropolitan area, USA, indicated that an urban watershed has a more complex stormwater infrastructure and more variable sources of E. coli compared to suburban and rural watersheds, which leads to a lower predictability of E. coli concentrations in the most urbanized watershed (Chen and Chang Citation2014). This larger variety of E. coli sources, that also have a high temporal variability, may influence the performance of E. coli prediction models. There is currently no overview of E. coli prediction studies for freshwaters specifically in cities, and consequently, no insight into the capabilities of such models in highly urbanized settings.

The aim of this study is to give an overview of the available scientific knowledge on E. coli prediction models for freshwater surface waters in cities. The first objective of this review is to provide an overview of the state of the art and relevant studies for future research on E. coli prediction in urban surface water. The second objective is to determine whether general insights with respect to performance and applicability of E. coli models for freshwater are valid for surface water in cities. Finally, this review aims to provide directions for future research.

2. Method

2.1. Literature search and selection

The literature search has been conducted in Scopus in March 2022 using the following search query: (TITLE-ABS-KEY ((E. coli) OR (Escherichia AND coli) OR (e. AND coli) AND NOT (drink* OR food*)) AND TITLE ((water* OR pool* OR river* OR pond* OR lake*) AND (forecast* OR predict* OR model*))).

The articles that describe at least one E. coli prediction model for freshwater surface waters have been included in our database for review. Models targeting other water types such as groundwater, water in drainage pipes or drinking water are excluded.

The first literature search has yielded 305 articles about E. coli prediction models for surface water. Because this selection contained many irrelevant publications, we have refined the search operation by discarding 217 publications that do not include ‘forecast’ or ‘predict’ in the title; 60 articles have been discarded because they were not meeting the inclusion criteria, were duplicates, a review paper, not in English, or they did not provide sufficient information about the modelling technique, study area, and input or output variables. We have performed a detailed assessment of 31 articles that passed the selection to identify publications that describe at least one E. coli prediction model for surface water in cities.

After more detailed reading, we have found 10 articles that include at least one E. coli prediction model for freshwater in cities (). All articles are published quite recently, during the last 12 years (2010–2022). In most articles, multiple draft models for one site are tested to find the best combination of explanatory variables. Since we aim to assess the performance of prediction models in cities and compare their performance with prediction models for other environments, we have assessed the final best performing model for each site. In total, we identified 25 models in this way.

Table 1. Overview of publications that describe at least one prediction model for E. coli in city surface water.

For identification of study areas in cities, we have compared the locations for which the models are developed with the boundaries of ‘urban centres’ from the Global Human Settlement-Urban Centres database of the Joint Research Centre (Florczyk et al. Citation2019; https://ghsl.jrc.ec.europa.eu/ucdb2018visual.php, accessed on 29 September 2022). Urban centres are defined as ‘the spatially-generalized high-density clusters of contiguous grid cells of 1 km2 with a density of at least 1,500 inhabitants per km2 of land surface or at least 50% built -up surface share per km2 of land surface, and a minimum population of 50,000’. We use this definition of cities because it enables a focus on dense urban areas. This definition of a city is preferred over the definition of a city based on built-up area or administrative boundaries (UNHABITAT Citation2020) because that may include rural areas and small, low-density settlements.

2.2. Characterization of the models

For each publication that includes one or more models for study sites in cities, following the urban centre definition, we have extracted information about the model’s techniques, variables, application context and performance. This information has been collected in an Excel database that includes a record for each model. The following characteristics have been collected:

  • Application context and purpose: Water body type, country, city, name of the water body, bathing water designation status, purpose of the model

  • Modelling approach: Modelling technique, input variables, output (continuous or categorical)

  • Performance:Reported performance metrics and scores

The results have been used to provide an overview of the available literature. Next, we have compared the characteristics of E. coli prediction models for urban freshwaters to models for other water body types. Finally, we have identified knowledge gaps and opportunities to improve performance of the models as input for suggested future research.

3. Results

3.1. Application context and purpose

Most models are developed for locations in the USA (), with 14 models described in 6 articles. 4 models (in 1 publication) are targeting locations in Canada, 3 models (in 2 publications) in Europe and 4 models (in 1 publication) in Asia. The models target 10 cities in 8 different urban conglomerations. Most models (23 out of 25 models) are targeting a river or creek; 2 models target lake water; there are no models for canals or ditches. The targeted lakes involve designated bathing waters (). This designation status is derived from the publications that mention that the sites are popular beach waters, the occurrence of regular monitoring of bathing water quality parameters or beach water closings due to poor water quality (Li et al. Citation2022; Madani and Seth Citation2020). Rivers and creeks are not designated as bathing water, or their status is unclear. Strong indications that the sites are not formally designated as bathing water are that the publications provide no mentioning of the designated status, actual recreation activities or regular monitoring of bathing water quality parameters. In one publication (Chen and Chang Citation2014), the authors report that the site ‘serves recreation’; we labelled the designation status of this site as unclear.

Figure 1. The number of locations that are formally designated as bathing site.

Figure 1. The number of locations that are formally designated as bathing site.

The purpose of the predictive models is not always clear from the publications, but the authors at least mention why predictive models are needed or relevant. Predictive models provide insight into ‘fate and transport processes’, ‘processes’, ‘causes’ or ‘conditions’ that influence E. coli concentrations in surface water (Chen and Chang Citation2014; Choi et al. Citation2012; Desai and Rifai Citation2010; Herrig et al. Citation2019; Madani and Seth Citation2020). Many authors make explicit reference to bathing water management or health protection as application context for predictive models. Some give general statements about the use of predictive models: ‘assist the stakeholders in the daily management of bathing sites’ (Naloufi et al. Citation2021), ‘to protect public health’ (Choi et al. Citation2012), ‘provide predictive decision-making support for effective public health management’ (Madani and Seth Citation2020). More specific potential applications of the models are to use the outcomes for ‘the development of effective management controls on pathogen contamination in surface water so that they are suitable for (...) recreation’, or ‘to regulate the monitoring (…) for the concentration of E. coli (…)’ (Chen and Chang Citation2014). Several authors also suggest using the prediction model to inform stakeholders about health risks: ‘the timely E. coli data may be compared with the water quality criteria for body contact’ (Chen and Chang Citation2014), ‘reliable early warning systems are demanded by the EBWD to reduce the risk of exposure’ (Herrig et al. Citation2019), ‘to inform the public of the health risk.’ (Li et al. Citation2022), ‘the prediction of bacteria counts and their use in informing the potential safety/hazard of that water body for recreational activities’ (Rossi et al. Citation2020). Several authors present predictive models as alternative to intensive monitoring and analyses of E. coli. They refer to advantages of predictive models over traditional or innovative monitoring techniques with respect to costs (Li et al. Citation2022; Madani and Seth Citation2020; Naloufi et al. Citation2021; Rossi et al. Citation2020), efficiency (Choi et al. Citation2012), sophistication and technical challenges (Madani and Seth Citation2020) and timely or real-time information (Li et al. Citation2022; Naloufi et al. Citation2021; Rossi et al. Citation2020).

These statements show two main purposes of predictive models for E. coli in city water: 1) to gain insight into causes of variability in E. coli concentrations (system understanding) and 2) to retrieve estimated E. coli concentrations when no measured values are available (prediction). Insight into causes of E. coli variability can be used to develop measures to influence sources of E. coli or processes that determine fate and transport of E. coli, and to target monitoring. Predicted E. coli concentrations can be useful to inform stakeholders timely about risks related to faecal contamination at bathing sites (early warning).

3.2. Modelling approaches

3.2.1. Modelling techniques and output

The modelling approaches that are described for freshwater in cities can be distinguished in data-driven empirical models and process-based models. Data-driven models describe the relationship between independent input variables and the dependent output variable based on statistical relations in a dataset that includes the independent and dependent variables. The input variables are selected based on statistical analysis of their importance. Process-based models explicitly include a mechanistic understanding of sources of E. coli, and processes of decay and transport. Most (23 of 25) E. coli prediction models for freshwater in cities are data-driven models (). Within this group, most of the models involve regression models. Two models are developed with machine learning techniques (Li et al. Citation2022; Naloufi et al. Citation2021). We found two examples, in the same publication, of process-based models (Choi et al. Citation2012).

Table 2. The applied modelling approaches are divided into data-driven models and process-based models.

In one model (Rossi et al. Citation2020), the location (choice of two sample locations) is used as independent input variable. In some papers, multiple models for one site are developed. Chen and Chang (Citation2014) developed a dry weather and wet weather model for each location.

Most models have E. coli concentrations as output (continuous output). Three models for locations in the USA produce a binary output that shows whether or not the E. coli concentration is expected to exceed a set target value (categorical output). The applied target values refer to the beach action values that were recommended by the USEPA (Citation1986)Footnote1: 126 CFU/100 ml for the 30-day geometric mean (used by Li et al. Citation2022; Rossi et al. Citation2020) and 235 CFU/100 ml for single samples (used by Rossi et al. Citation2020).

3.2.2. Input variables

The input variables can be divided into four categories: meteorological, water quality, hydrological and ‘other’. Most models (18 of 21) include a combination of meteorological, water quality and/or hydrological variables, in some cases with additional other variables. The most frequently used group of input variables are hydrological variables (in 76% of the models), followed by water quality (68%) and meteorological variables (56%), see . Other variables are used in a minority of 32% of the models. Examples of meteorological variables are rainfall, UV radiation and wind (). Water quality refers to variables like turbidity, total suspended solids (TSS) and water temperature (). Hydrological variables include, for example, flow rate, wave height and wave direction (). Other variables include time of year, day number, and location (). The three most frequently used variables are flow rate (17 models, 6 articles), rainfall (13 models; 7 articles) and turbidity (11 models, 5 articles). The next most widely used variables are TSS, water temperature and conductivity (all three used in 9 models and in, respectively, 3, 4 and 4 publications).

Table 3. The percentage of models that include specific groups of input variables.

Table 4. Specification of the used variables per category.

The process-based models by Choi et al. (Citation2012) for a creek in Gwangju, South Korea, are based on the assumption that E. coli concentrations are mainly influenced by weather conditions and non-point sources of E. coli such as surface runoff. Separate models are developed for wet and for dry weather conditions. The models include a hydrodynamic water balance module and an advection-dispersion-reaction module. The latter calculates E. coli transport, E. coli decay by sunlight, die off and settling, and it includes resuspension of sediment as a source of E. coli.

3.2.3. Data requirements

The datasets that are used to develop the models vary greatly in terms of length of the period covered by the dataset and measurement frequency. For development of 13 out of the 25 reviewed models, only data from the summer period are used (Madani and Seth Citation2020; Desai and Rifai Citation2010; Naloufi et al. Citation2021; Rossi et al. Citation2020; some models in Chen and Chang Citation2014). For the other models, full year or winter data are used, or the period is not specified (Fisher et al. Citation2011; Herrig et al. Citation2019; Choi et al. Citation2012; Jagupilla et al. Citation2020; Li et al. Citation2022, some models in; Chen and Chang Citation2014). Data availability of training data for E. coli is usually the most limiting, with the lowest measurement frequency and monitoring period. Data coverage ranges from 12 weeks (Desai and Rifai Citation2010) to 8 years (Chen and Chang Citation2014) and measurement frequency ranges from hourly (Choi et al. Citation2012) to monthly (Chen and Chang Citation2014). For the development of most models (15/21), a training dataset is used with E. coli data from multiple years. For about half of the models, (12/25), the dataset includes a measurement frequency of at least 5 days per week. For 5 models, weekly data are used, and the other 8 models are developed with monthly data (or less than weekly).

For water quality, hydrological and meteorological variables, the data frequency is higher; in most cases at least daily and often more frequent measurements are used. The monitoring locations for the other variables are not always the same as for the E. coli monitoring. Especially meteorological and flow data are often obtained from the closest nearby monitoring stations.

Some authors provide recommendations with respect to data requirements. Madani and Seth (Citation2020) recommend reviewing the validity of a MLR model every year, and to develop a new model if needed, based on data from the preceding 2 years. This is based on their finding that a model based on a 2-years dataset performed marginally better than a model based on 3 or 4 years of data. They also recommend that the 2-year dataset includes water quality data, including E. coli, with a frequency of 5 days per week over summer months. Herrig et al. (Citation2019) advice to use at least 1 year of data for the development of a Bayesian MLR model. They also stress that the dataset should cover different environmental conditions that occur at a site, such high flow and low flow. Desai and Rifai (Citation2010) used a training dataset of only 12 weeks and concluded that the dataset was useful to highlight variables that impact water quality but that data from a longer period are needed to explain variability in E. coli concentrations.

3.3. Performance

The reviewed publications provide a variety of model performance metrics to describe how accurate the models predict E. coli concentrations (continuous model output) or categorical output values. The most frequently reported performance metric is R2, which indicates which part of the variation in predicted E. coli concentrations is explained by the model compared to the variation in the measured concentrations. Reported R2 values range from 0.13 to 0.87. About half (n = 11) of all (n = 21) reported R2 values is < 0.5. For two models that target lake water, accuracy is reported, with values of 78% and 88% (Madani and Seth Citation2020; Li et al. Citation2022 respectively). This means that in most cases, the E. coli concentration is correctly predicted to be above or below a set target value. Another reported performance metric is the mean absolute percentage error (MAPE). In Naloufi et al. (Citation2021), MAPE is 53.2%, so more than 50% of the predicted E. coli values is inaccurately predicted. Naloufi et al. (Citation2021) consider MAPE values < 50% as reasonable, <20% as good, and > 50% as inaccurate. The mean absolute error, representing the difference between predicted and measured Log values for E. coli, of the wet weather and dry weather models by Choi et al. (Citation2012) are 0.22 and 0.14, respectively.

4. Discussion

4.1. Limited available literature on E. coli prediction models for urban waters

This study provides the first overview of E. coli prediction models for freshwaters in cities. Our review shows that the number of scientific publications (10 articles) is low, also compared to the 53 publications on E. coli prediction models for freshwater recreational beaches that Heasley et al. (Citation2021) reviewed. This may relate to the relative recent attention for E. coli modelling in urban water. The oldest publication in a review of faecal indicator bacteria (FIB, in 87% with E. coli as indicator) models for freshwater inland beach water (Heasley et al. Citation2021) is from 2000, while the oldest publication that we found for urban water is published in 2010.

The number of available studies is especially limited outside the USA. The overrepresentation of modelling studies for the USA is in line with the results of Heasley et al.’s (Citation2021) review of FIB prediction models for freshwater recreational beach waters. However, the dominance of studies for the USA is less prominent in our analysis, with 60% of the publications targeting the USA versus 83% in the review by Heasley et al. (Citation2021). Notably, the available studies are all targeting study areas with a temperate or continental climate according to the Köppen-Geiger Climate Classification (Peel, Finlayson, and McMahon Citation2007) and that are located in high-income countries according to the World Bank (https://datahelpdesk.worldbank.org/knowledgebase/articles/906519-world-bank-country-and-lending-groups). Given the likely impact of differences in climate and wastewater management on water quality dynamics, the insights from the reviewed studies may not be valid for low-income countries and/or tropical regions.

Notably, most urban models target rivers or creeks (), while the recreational beach waters from the review by Heasley et al. (Citation2021) are mainly located in lakes. This difference is likely caused by the fact that most formally designated bathing sites involve lake water. This is confirmed by our review. The two study sites at lakes are designated bathing waters (Li et al. Citation2022), while all other study sites in rivers are not designated as bathing waters or their status is unclear (). The lack of modelling studies for canals shows a lack of knowledge for this type of waterbodies. It is relevant to address this knowledge gap since urban canals are currently used for swimming (De Jong et al. Citation2022; Van der Meulen et al. Citation2020, Citation2023) and this use is expected to increase (Van der Meulen et al. Citation2020). The lack of E. coli modelling studies for urban canals relates to a generic lack of water quality research in manmade waters (Koschorreck et al. Citation2020).

We expect that in the near future more knowledge will be developed about the performance of prediction models based on machine learning and for other regions in the World. However, the number of publications on E. coli prediction models for urban water remains low. After our review in March 2022, we found two relevant additional articles reporting on data-driven machine learning models in the USA (Nafsin and Li Citation2023) and a process-based model in Peru (Mori-Sánchez et al. Citation2023). For future reviews, inclusion of grey literature may identify models for more urban areas and more water body types than what is described in peer reviewed scientific publications. This would also require assessment of grey literature in more languages.

4.2. Model performance and usefulness for different purposes

Based on the reported performance metrics, we cannot retrieve sound conclusions on whether the models are good enough for their intended use. This is related to two difficulties. First, the purpose of the models is usually not explicitly stated in the reviewed publications (section 3.1). The performance of the models should be considered in the context of the purpose of the models. e.g. if the model is used to identify factors that determine E. coli concentrations, the accuracy may be lower than when the model is used to warn bathers about exceedance of a target value. Notably, prediction models are often proposed as alternative to intensive monitoring that is limited by costs, and in case of innovative technologies also by the sophistically methods (Li et al. Citation2022; Madani and Seth Citation2020; Rossi et al. Citation2020). However, the reviewed publications do not include a comparison of the costs of the model with monitoring costs or with the level of expertise that is required. Second, the reported performance metrics are especially useful to compare different alternative prediction models for a specific study site, but these metrics cannot be used to decide whether the model accuracy is good enough for specific modelling purposes because there is no set target.

The most frequently reported performance metric is R2. While higher R2 values are better than lower values, we cannot compare the performance of models for different locations based on this metric because it is influenced by the degree of variation in the measured E. coli concentrations. However, R2 values lower than 0.5 seem insufficient for effective use of the models in early warning or other daily management practices. Low R2 values also indicate that the model does not capture all relevant sources of E. coli. This is the case for 11 of the 21 models with reported R2 values (section 3.3). This is more than the relative number of R2 values < 0.5 that are reported for freshwater recreational beaches at rivers (3/11) but lower than for lakes (21/35) in Heasley et al. (Citation2021). These findings indicate that model performance for urban rivers is lower than for recreational beach water in rivers in general. This may be caused by the quality of the training data and input data or by a higher number of E. coli sources and/or temporal variability in urban areas. Further research is needed to examine these potential causes of low performance.

For the two models that target lake water, the accuracy values of 78% and 88% (Madani and Seth Citation2020; Li et al. Citation2022 respectively) are in the same order of magnitude as the average accuracy of 81% that Heasley et al. (Citation2021) reported for inland beach water models. Comparing reported accuracy values for lake models and river models from Heasley et al. (Citation2021) shows no generic significant difference between the accuracy of river models (average: 81,0%, range: 66–93.8%) and lake models (average: 80,8%, range: 54–100).

4.3. Directions for improving data-driven prediction models

With two exceptions, all reviewed models are data-driven models. The advantage of data-driven models is that they are applicable in situations where sources and processes are not known or uncertain. The most frequently used modelling technique is MLR, which is in line with the findings by Heasley et al. (Citation2021) for freshwater recreational waters. As several studies show that machine learning approaches perform better than regression-based models (De Brauwere, Ouattara, and Servais Citation2014; Heasley et al. Citation2021; Li et al. Citation2022), we recommend to further assess the opportunities for better predictive capacity for E. coli in city water. The currently limited number of publications about machine learning approaches for E. coli prediction in freshwater prevents sound conclusions about their performance.

The development of accurate data-driven models requires a large amount of data, and the models are extremely site-specific (Francy et al. Citation2013). The training dataset of the regression-based models should have the appropriate temporal coverage (period of the measurements) and temporal resolution (frequency of the measurements) to cover the variety of circumstances in the water system. The recommendations by authors of the reviewed articles, based on the performance of their E. coli prediction models when using alternative training datasets (section 3.2.3), cannot be generalized. The required temporal resolution of the dataset depends on the variability of the input and output variables at the study site. As E. coli concentrations and/or the correlations with the independent variables show seasonal variability (Durham et al. Citation2016; Fluke, González-Pinzón, and Thomson Citation2019), we advise to include seasonal variation in the models or to develop a model based on data that cover the months for which the model output is used. For example, in northwestern Europe, bathing activities are concentrated in the warmer summer months. However, some of the reviewed models are developed with training datasets including other seasons (section3.2.3) which may lower their performance during the relevant season. This is also stressed by Herrig et al. (Citation2019) who state that the dataset should cover different environmental conditions that occur at a site, such high flow and low flow.

The regression-based models provide insight into the influence of meteorological, water quality and hydrological conditions on E. coli concentrations in surface water. This gives an indirect indication of potentially relevant sources and processes of decay and transport. For example, the positive correlation between rainfall and E. coli in many models indicates potential input from animal faeces in surface runoff or from CSOs. None of the reviewed models include independent variables that are directly linked to specific well-known sources of E. coli in urban water such as WWTP discharge volumes, bird counts (direct input of faeces or through surface runoff), bather counts (direct input and resuspension of polluted sediment), and CSO discharge volumes or events. We recommend assessing if model performance improves if well-known sources of E. coli in urban water are directly included in the models. This is also recommended by Herrig et al. (Citation2019) and Li et al. (Citation2022). Future research may also compare different models for the same site to better understand why some models perform better than others.

4.4. Directions for improving process-based prediction models

For assessment of the impact of changes in the water system, e.g. climate change or water management measures, process-based models are required (De Brauwere, Ouattara, and Servais Citation2014). In our review, we found only two process-based models, which is not unexpected as De Brauwere et al. (Citation2014) also found that process-based models are often applied at catchment scales and that they are mainly targeted at rural areas. Process-based models require a mechanistic understanding of sources of E. coli, and processes of decay and transport. Identification of point sources in an urban environment is, however, challenging. Some sources of faecal pollution are known, such as WWTPs and sewer overflows. Still, estimating the input from such sources is uncertain due to short-term fluctuations in both discharge and FIB-concentration (De Brauwere, Ouattara, and Servais Citation2014). Other potential sources, such as sewer leakages and failures in sewage pipes, are obviously not registered but they may be a significant source of faecal pollution. For example, the percentage of lots where wastewater is discharged through storm water drainage pipes is estimated to be one to several % in The Netherlands (Lieftink, Boogaard, and Langeveld Citation2020). A sound process-based prediction model also requires insight into sediment dynamics, as resuspension of FIB from sediments can be a significant source of E. coli in surface water (De Brauwere, Ouattara, and Servais Citation2014; Desai and Rifai Citation2010). To enhance generic insight into potentially significant sources of E. coli we recommend emission research in different types of urban areas that vary in terms of geographics, climate, water use, land use and wastewater treatment systems.

5. Conclusions

This is the first review of E. coli prediction models for urban water. With 10 articles describing at least one E. coli prediction model for freshwater in cities, the available literature on models for this specific environment is limited compared to the number of publications on recreational freshwater beaches that are usually located outside cities. Most targeted urban waters are rivers that are not officially designated as bathing waters. Although urban canals are increasingly used for contact recreation, no models for this type of urban water bodies were found. Another notable observation is the lack of publications on E. coli prediction models for cities in tropical regions and low-income countries.

This study provides novel insights into the similarities and differences of model approaches and performance between models for urban waters and for other water body types. The modelling techniques for urban waters are comparable to those for other freshwater environments, with MLR being the most frequently used approach. Data-driven models based on machine learning technology and process-based models are scarce. The main difference with previously reviewed E. coli prediction models for freshwater beaches, is that models for urban water mainly target rivers while models for recreational beach waters mainly target lakes. Comparison of reported performance metrics indicates that model performance for urban rivers is lower than for recreational beach water in rivers in general. Given the limited number of urban models, this indication should be regarded as hypothesis for further research rather than a sound conclusion on performance.

With this review, we identified major knowledge gaps that should be addressed in future research to enforce the yet limited knowledge base for urban water quality issues. The main knowledge gaps and research needs relate to a lack of insight into the performance of the models for specific application purposes and to opportunities for improving accuracy. Future research should explicitly review the performance of the models for specific application purposes such as early warning for bathers, identifying sources, or predicting the impact of changes in the water system on E. coli concentrations. This requires a clear definition of the required accuracy for specific applications. Also, the added value of the prediction models needs to be reviewed. While some scholars propose prediction models as cost-efficient and easier alternative to high-frequency monitoring of E. coli, none of the reviewed publications includes a comparison of costs and feasibility between these two approaches. This is an important knowledge gap, given the frequently repeated claims about the advantages of models over monitoring.

To improve accuracy of the models, more research is needed to compare modelling techniques, data requirements and insight into sources of E. coli. Indications that machine learning technology performs better than traditional regression techniques can only be verified if more models are developed with machine learning. It is expected that accuracy of prediction models will improve if temporal coverage and frequency of training data and input data better reflects the conditions for the targeted application season and the variability in water quality. E. coli prediction model’s accuracy may improve with better insight into sources that are currently not captured by the models. Better insight in sources, fate and transport can improve selection of input variables for data-driven models and enhance process-based models. The low R2 values in at least half of the models indicate that not all relevant sources are captured by the models and/or input data or training datasets do not capture the variable environmental conditions at the study sites.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

The work was supported by the Amsterdam Institute for Advanced Metropolitan Solutionsand Deltares.

Notes

1. USEPA published an updated advice in 2012 with a geometric mean of 126 cfu per 100 mL and a Statistical Threshold Value (STV) of 410 cfu per 100 mL for E. coli. The STV refers to the 90th percentile of the water quality distribution.

References

  • Angelescu, D. E., V. Huynh, A. Hausot, G. Yalkin, V. Plet, J.-M. Mouchel, S. Guérin-Rechdaoui, S. Azimi, and V. Rocher. 2018. “Autonomous System for Rapid Field Quantification of Escherichia coli in Surface Waters.” Journal of Applied Microbiology 126 (1): 332–343. https://doi.org/10.1111/jam.14066.
  • Azevedo Lopes, F. W., R. J. Davies-Colley, E. Von Sperling, and A. P. Magalhaes. 2016. “A Water Quality Index for Recreation in Brazilian Freshwaters.” Journal of Water and Health 14 (2): 243–254. https://doi.org/10.2166/wh.2015.117.
  • Bedri, Z., A. Corkery, J. J. O’Sullivan, M. X. Alvarez, A. Chr. Erichsen, L. A. Deering, K. Demeter, G. M. P. O’Hare, W. G. Meijer, and B. Masterson. 2014. “An Integrated Catchment-Coastal Modelling System for Real-Time Water Quality Forecasts.” Environmental Modelling & Software 61:458–476. https://doi.org/10.1016/j.envsoft.2014.02.006.
  • Boehm, A. B., S. B. Grant, J. H. Kim, S. L. Mowbray, C. D. McGee, C. D. Clark, D. M. Foley, D. E. Wellman. 2002. “Decadal and Shorter Period variability of Surf Zone Water Quality at Huntington Beach, California.” Environmental Science & Technology 36 (18): 3885–3892. https://doi.org/10.1021/es020524u.
  • Chen, H. J., and H. Chang. 2014. “Response of Discharge, TSS, and E. Coli to Rainfall Events in Urban, Suburban, and Rural Watersheds.” Environment Science Processes Impacts 16:2313. https://doi.org/10.1039/c4em00327f.
  • Choi, K. W., S. N. Chan, and J. H. W. Lee. 2022. “The WATERMAN System for Daily Beach Water Quality Forecasting: A Ten-Year Retrospective.” Environment Fluid Mech 23 (2): 205–228. https://doi.org/10.1007/s10652-022-09839-4.
  • Choi, M., Y. Park, M. Cha, K. H. Cho, and J. H. Kim. 2012. “Comparison of Numerical Schemes for Improved Prediction Model of Fecal Indicator Bacteria in a Riverine System.” Desalination and Water Treatment 38 (1–3): 1-3, 373–381. https://doi.org/10.1080/19443994.2012.664409.
  • Davies-Colley, R., A. Valois, and J. Milne. 2018. “Faecal Pollution and Visual Clarity in New Zealand Rivers: Correlation of Key Variables Affecting Swimming Suitability.” Journal of Water and Health 16 (3): 329–339. https://doi.org/10.2166/wh.2018.214.
  • De Brauwere, A., N. K. Ouattara, and P. Servais. 2014. “Modeling Fecal Indicator Bacteria Concentrations in Natural Surface Waters: A Review.” Critical Reviews in Environmental Science and Technology 44 (21): 2380–2453. https://doi.org/10.1080/10643389.2013.829978.
  • De Jong, A., S. Van der Meulen, R. Melman, and A. Vaarten. 2022. Zwemmen in niet-aangewezenzwemwater; Risico’s en maatregelen. Delft: (in Dutch), Deltares. https://www.deltares.nl/app/uploads/2016/09/11206881-005-BGS-0001_v1.0-Zwemmenin-niet-aangewezen-zwemwater_publicatie.pdf.
  • Desai, A. M., and H. S. Rifai. 2010. “Variability of Escherichia coli Concentrations in an Urban Watershedin Texas.” Journal of Environmental Engineering 136 (12): 1347–1359. https://doi.org/10.1061/(ASCE)EE.1943-7870.0000290.
  • Dorevitch, S., A. Shrestha, S. DeFlorio-Barker, C. Breitenbach, and I. Heimler. 2017. “Monitoring Urban Beaches with qPCR vs. Culture Measures of Fecal Indicator Bacteria: Implications for Public Notification.” Environmental Health: A Global Access Science Source 16 (1): 45. https://doi.org/10.1186/s12940-017-0256-y.
  • Durham, B. W., L. Porter, A. Webb, and J. Thomas. 2016. “Seasonal Influence of Environmental Variables and Artificial Aeration on Escherichia coli in Small Urban Lakes.” Journal of Water and Health 14 (6): 929–941. https://doi.org/10.2166/wh.2016.020.
  • Dwivedi, D., B. P. Mohanty, and B. J. Lesikar. 2013. “Estimating Escherichia coli Loads in Streams Based on Various Physical, Chemical, and Biological Factors.” Water Resources Research 49 (5): 2896–2906. https://doi.org/10.1002/wrcr.20265.
  • EC (European Commission). 2006. “Directive 2006/7/EC of the European Parliament and of the Council of 15 February 2006 Concerning the Management of Bathing Water Quality and Repealing Directive 76/160/EEC.” https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32006L0007.
  • EEA (European Environment Agency). 2020. Bathing Water Management in Europe: Successes and Challenges. EEA Report Number 11/2020). Luxembourg: Publications Office of the European Union. ISBN 978-92-9480-261-3, https://www.eea.europa.eu/publications/bathing-water-quality-2020.
  • Fisher, J. R., B. I. Dvorak, D. M. Admiraal, and A. A. Hosni. 2011. “Water Quality Prediction Models for Storm Water Runoff in an Urban Watershed.” In World Environmental and Water Resources Congress 2011: Bearing Knowledge for Sustainability, edited by R. E. Beighley and M. W. Killgore, Palm Springs, California, USA: 22-26 May 2011 Vol. 1 ISBN: 978-1-61782-987-1. https://doi.org/10.1061/41173(414)82.
  • Florczyk, A., C. Corbane, M. Schiavina, M. Pesaresi, L. Maffenini, M. Melchiorri, P. Politis, et al. 2019. “GHS Urban Centre Database 2015.” Multitemporal and Multidimensional Attributes, R2019A European Commission, Joint Research Centre (JRC)PID. Accessed September 29, 2022. https://data.jrc.ec.europa.eu/dataset/53473144-b88c-44bc-b4a3-4583ed1f547e.
  • Fluke, J., R. González-Pinzón, and B. Thomson. 2019. “Riverbed Sediments Control Thespatiotemporal Variability of E. Coli in a Highly Managed, Arid River.” Frontiers in Water 1:2019. https://doi.org/10.3389/frwa.2019.00004.
  • Francy, D. S., E. A. Stelzer, J. W. Duris, A. M. Brady, J. H. Harrison, H. E. Johnson, and M. W. Ware. 2013. “Predictive Models for Escherichia coli Concentrations at Inland Lake Beaches and Relationship of Model Variables to Pathogen Detection.” Applied and Environmental Microbiology 79 (5): 1676–1688. https://doi.org/10.1128/AEM.02995-12.
  • Griffith, J. F., and S. B. Weisberg. 2011. “Challenges in Implementing New Technology for Beach Water Quality Monitoring: Lessons from a California Demonstration Project.” Marine Technology SocietyJournal 45 (2): 65–73. https://doi.org/10.4031/MTSJ.45.2.13.
  • Heasley, C., J. J. Sanchez, J. Tustin, and I. Young. 2021. “Systematic Review of Predictive Models Ofmicrobial Water Quality at Freshwater Recreational Beaches.” PloS ONE 16 (8): e0256785. https://doi.org/10.1371/journal.pone.0256785.
  • Herrig, I., W. Seis, H. Fischer, J. Regnery, W. Manz, G. Reifferscheid, and S. Böer. 2019. “Prediction of Fecal Indicator Organism Concentrations in Rivers: The Shifting Role of Environmental Factors Under Varying Flow Conditions.” Environment Science Europe 31 (1): 59. https://doi.org/10.1186/s12302-019-0250-9.
  • Hintaran, A. D., S. J. Kliffen, W. Lodder, R. Pijnacker, D. Brandwagt, A. K. van der Bij, Siedenburg, E., et al. 2018. “Infection Risks of City Canal Swimming Events in the Netherlands in 2016.” PloS One 13 (7): e0200616. https://doi.org/10.1371/journal.pone.0200616.
  • Jagupilla, S. C. K., V. Shah, V. Ramaswamy, P. Gurumurthy, and D. A. Vaccari. 2020. “Prediction of Boundary and Stormwater E. Coli Concentrations Using River Flows and Baseflow Index.” Journal of Environmental Engineering 146 (4): 04020017. https://doi.org/10.1061/(ASCE)EE.1943-7870.0001681.
  • Koschorreck, M., A. S. Downing, J. Hejzlar, R. Marce, A. Laas, W. G. Arndt, P. S. Keller, et al. 2020. “Hidden treasures: Human-made aquatic ecosystems harbour unexplored opportunities.” Ambio 49 (2): 531–540. https://doi.org/10.1007/s13280-019-01199-6.
  • Lieftink, H. J., F. C. Boogaard, and J. G. Langeveld. 2020. “Kwaliteit afstromend hemelwater in Nederland; Database kwaliteit afstromend hemelwater (In Dutch).” ISBN: 978.90.5773.884.5. https://www.stowa.nl/sites/default/files/assets/PUBLICATIES/Publicaties%202020/2020-05%20STOWA%202020-05%20Kwaliteit%20van%20afstromend%20hemelwater%20in%20Nederland.pdf.
  • Li, L., J. Qiao, G. Yu, L. Wang, H.-Y. Li, C. Liao, and Z. Zhu. 2022. “Interpretable Tree-Based Ensemble Model for Predicting Beach Water Quality.” Water Research 211:118078. https://doi.org/10.1016/j.watres.2022.118078.
  • Madani, M., and R. Seth. 2020. “Evaluating Multiple Predictive Models for Beach Management at a Freshwater Beach in the Great Lakes Region.” Journal of Environmental Quality 49 (4): 896–908. https://doi.org/10.1002/jeq2.20107.
  • Mallin, M. A., K. E. Williams, E. C. Esham, and R. P. Lowe. 2000. “Effect of Human Development on Bacteriological Water Quality in Coastal Watersheds.” Ecological Applications 10 (4): 1047–1056. https://doi.org/10.2307/2641016.
  • Mori-Sánchez, O. L., L. Ramos-Fernández, W. E. Lluén-Chero, E. Pino-Vargas, and L. Flores Del Pino. 2023. “Application of the Iber Two-Dimensional Model to Recover the Water Quality in the Lurín River.” Hydrology 10:84. https://doi.org/10.3390/hydrology10040084.
  • Myers, D. N., G. F. Koltun, and D. S. Francy. 1998. Effects of Hydrologic, Biological, and Environmental Processes on Sources and Concentrations of Fecal Bacteria in the Cuyahoga River, with Implications for Management of Recreational Waters in Summit and Cuyahoga Counties, Ohio. Columbus, OH: USGS. https://doi.org/10.3133/wri984089.
  • Nafsin, N., and J. Li. 2023. “Prediction of Total Organic Carbon and E. Coli in Rivers within the Milwaukee River Basin Using Machine Learning Methods.” Environmental Science Advances 2 (2): 278. https://doi.org/10.1039/D2VA00285J.
  • Nagels, J. W., R. J. Davies-Colley, and D. G. Smith. 2001. “A Water Quality Index for Contact Recreation in New Zealand.” Water Science and Technology 43 (5): 285–292. https://doi.org/10.2166/wst.2001.0307.
  • Naloufi, M., F. S. Lucas, S. Souihi, P. Servais, A. Janne, and T. Wanderley Matos De Abreu. 2021. “Evaluating the Performance of Machine Learning Approaches to Predict the Microbial Quality of Surface Waters and to Optimize the Sampling Effort.” Water 13:1–17. https://doi.org/10.3390/w13182457.
  • Nevers, M. B., D. A. Shively, G. T. Kleinheinz, C. M. McDermott, W. Schuster, V. Chomeau, and R. L. Whitman. 2009. “Geographic Relatedness and Predictability of Escherichia coli Along a Peninsular Beach Complex of Lake Michigan.” Journal of Environmental Quality 38 (6): 2357–64. https://doi.org/10.2134/jeq2009.0008. 2009 Oct 29.
  • Peel, M. C., B. L. Finlayson, and T. A. McMahon. 2007. “Updated World Map of the Köppen-Geiger Climate Classification.” Hydrology and Earth System Sciences 11 (5): 1633–1644. https://hess.copernicus.org/articles/11/1633/2007/hess-11-1633-2007.pdf.
  • Rossi, A., B. T. Wolde, L. H. Lee, and M. Wu. 2020. “Prediction of Recreational Water Safety Using Escherichia coli as an Indicator: Case Study of the Passaic and Pompton Rivers, New Jersey.” The Science of the Total Environment 714:136814. https://doi.org/10.1016/j.scitotenv.2020.136814.
  • Seifert-Dähnn, I., I. Skumlien Furuseth, G. Kofi Vondolia, G. Gal, E. de Eyto, E. Jennings, and D. Pierson. 2021. “Costs and Benefits of Automated High-Frequency Environmental Monitoring – The Case of Lake Water Management.” Journal of Environmental Management 285:112108. https://doi.org/10.1016/j.jenvman.2021.112108.
  • Soller, J. A., M. E. Schoen, T. Bartrand, J. E. Ravenscroft, and N. J. Ashbolt. 2010. “Estimated Human Health Risks from Exposure to Recreational Waters Impacted by Human and Non-Human Sources of Faecal Contamination.” Water Research 44 (16): 4674–4691. https://doi.org/10.1016/j.watres.2010.06.049.
  • Thoe, W., and J. H. W. Lee. 2014. “Daily Forecasting of Hong Kong Beach Water Quality by Multiple Linear Regression Models. ”Journal of Environmental Engineering (United States) 140 (2): 04013007. https://doi.org/10.1061/(ASCE)EE.1943-7870.0000800.
  • UNHABITAT. 2020. “What is a City?.” https://unhabitat.org/sites/default/files/2020/06/city_definition_what_is_a_city.pdf.
  • USEPA (U.S. Environmental Protection Agency). 1986. Ambient Water Quality Criteria for Bacteria, 4405–86001. Washington, DC EPA: Office of Water Regulations and Standards Criteria and Standards Division. https://www.epa.gov/sites/default/files/2019-03/documents/ambient-wqc-bacteria-1986.pdf.
  • Van der Meulen, E. S., N. B. Sutton, F. H. M. van de Ven, P. R. van Oel, and H. H. M. Rijnaarts. 2020. “Trends in Demand of Urban Surface Water Extractions and in Situ Use Functions.” Water Resources Management 34 (15): 4943–4958. https://doi.org/10.1007/s11269-020-02700-7.
  • Van der Meulen, E. S., F. H. M. van de Ven, P. R. van Oel, H. H. M. Rijnaarts, and N. B. Sutton. 2023. “Improving Suitability of Urban Canals and Canalized Rivers for Transportation, Thermal Energy Extraction and Recreation in Two European Delta Cities.” Ambio 52 (1): 195–209. https://doi.org/10.1007/s13280-022-01759-3.
  • Van der Meulen, E. S., P. R. van Oel, H. H. M. Rijnaarts, N. B. Sutton, and F. H. M. van de Ven. 2022. “Suitability Indices for Assessing Functional Quality of Urban Surface Water.” City and Environment Interactions 13:100079. https://doi.org/10.1016/j.cacint.2022.100079.