Publication Cover
Bioacoustics
The International Journal of Animal Sound and its Recording
Volume 33, 2024 - Issue 2
1,003
Views
0
CrossRef citations to date
0
Altmetric
Articles

Efficient quality assurance and quality control for passive acoustic monitoring data: reducing and documenting false-positive and false-negative errors

ORCID Icon, ORCID Icon & ORCID Icon
Pages 178-202 | Received 10 Oct 2023, Accepted 02 Mar 2024, Published online: 19 Mar 2024

ABSTRACT

Autonomous Recording Units (ARUs) are widely used to survey for a variety of taxa. This survey method allows for high spatial and temporal coverage but will typically include identification errors that can bias estimates of occupancy. In some instances, verifying all individual detections is prohibitive. To direct verification effort, we developed a model to estimate the probability that transcribers would agree on an identification. Agreement probability was positively influenced by transcriber skill, identification confidence, species commonness and some song types. In contrast, agreement probability was lower when an acoustic signal was classified as a trill. We evaluated our model on independent data where all species detections were verified, and verification effort (time) was quantified. Our model performed well at predicting transcriber agreement on independent data (AUC = 0.71). We applied the model to randomised subsets of the independent data to compare the cost benefit of three approaches to verification under varying effort. We show how modelling probability of transcriber agreement can be used to more efficiently direct verification of species acoustic tags. Our approach could be adapted elsewhere to quantify and reduce species misidentifications in unverified passive acoustic monitoring data for either manual processing or detections from automated classifiers.

Introduction

Autonomous recording units (ARUs) are increasingly used to monitor wildlife in terrestrial and marine ecosystems (Gibb et al. Citation2018). ARUs are now a common tool to survey birds and bats (Shonfield and Bayne Citation2017; Roemer et al. Citation2021) but are also used to monitor amphibians, insects, mammals and even fish (Van Parijs et al. Citation2009; Brauer et al. Citation2016; Wrege et al. Citation2017; Symes et al. Citation2022). While they have some disadvantages, such as high start-up costs and data storage demands, in most contexts ARUs perform similarly to in-person surveys with the added benefit of low costs for repeat visit samples (Darras et al. Citation2019). The use of ARUs is particularly beneficial in situations when in-person surveys have logistical challenges, such as surveying for nocturnal species or in remote areas (Darras et al. Citation2019; Drake et al. Citation2021). In addition, ARU data can complement data from traditional survey methods to increase spatial and temporal coverage (Van Wilgenburg et al. Citation2017) or improve accuracy and precision of abundance estimates (Doser et al. Citation2021). One clear advantage of ARUs is that they create a permanent, acoustic record of surveys (Rempel et al. Citation2005), providing the opportunity for species identifications to be verified so that observer biases can be assessed and/or addressed in subsequent analyses.

The availability of tools to process high volumes of data collected by ARUs has increased significantly in recent years. Tools range from acoustic diversity indices measuring general characteristics of soundscapes (Buxton et al. Citation2018) to manual transcription by skilled observers that can provide estimates of species richness and abundance from a subset of recordings. Although improvements in automated species recognition software reduce the time necessary to extract target species detections and process large volumes of ARU data (Knight et al. Citation2017; Kahl et al. Citation2021), these methods are still ineffective for busy recordings with many individuals vocalising simultaneously, and performance varies widely by species (Priyadarshani et al. Citation2018; Pérez-Granados Citation2023). In avian research, manual transcription can handle busy recordings with many overlapping signals, identify individuals and partial songs that are masked by other sounds, and produce abundance estimates, which can be combined with traditional human point count data and used to calculate density (Van Wilgenburg et al. Citation2017). While labour intensive, manual transcription by skilled observers currently provides the most parsimonious approach to extracting multi-species counts from ARU recordings.

Unlike single observer, in-person surveys, transcription of audio recordings also provides opportunities to assess observer bias and review processed data for errors. Bias from detection errors is a known issue for abundance and occupancy estimates of wildlife and while statistical methods to deal with false negatives are well established (Buckland et al. Citation2001; Farnsworth et al. Citation2002; Guillera-Arroita Citation2016), methods to correct for false positives (species misidentifications) are less well established, particularly outside of occupancy frameworks (Rempel et al. Citation2019; Strickfaden et al. Citation2020). Furthermore, some data products rely on unverified species detections without any statistical adjustments. For example, defining critical habitat for species at risk, a legal requirement under the federal Species at Risk Act in Canada, often relies on mapping locations of known occurrences (Environment and Climate Change Canada Citation2016). It may therefore be preferable to reduce errors directly in the original data than be overly reliant on post-hoc statistical adjustments. In addition, we target false positives as they result in bias for data from two species simultaneously since misclassifying a site as occupied by one species results in the species that is truly present also being inappropriately labelled as absent. In a multi-species context, this means we are fixing a false-negative error for the true species identity. Thus, while several approaches to reduce bias from misidentifications exist (e.g. dependent double observer with multi-species N-mixture frameworks (Golding et al. Citation2017; Hoekman Citation2021), we focus here on reducing errors in the original single-observer data. Specifically, we focus on reducing species misidentifications in unverified data obtained from manually transcribing recordings from ARUs using acoustic ‘tagging’ environments that facilitate tag review such as WildTrax (https://www.wildtrax.ca/) and ecoSound-web (Darras et al. Citation2020).

For most monitoring programmes, the number of detections generated for even a moderate number of recordings could become cost prohibitive to comprehensively review for misidentifications. There is therefore a need for tools to facilitate efficient review of manually transcribed acoustic data. Since the probability of misidentification varies across species and transcribers (Rempel et al. Citation2019), having the ability to predict which detections in a dataset are more likely to contain misidentifications would increase the efficiency of data verification. Our objective was to model the agreement on species identifications between two transcribers (as a proxy for the probability of a correct identification following Rempel et al. Citation2019) as a function of species and transcriber-specific covariates. We hypothesised that more skilled transcribers would agree upon species identifications more frequently. We expected rare species, species with high-pitched songs, species with very similar ‘trilling’ songs, and species identified with a lower confidence to have lower agreement between observers. We then applied our model to the community level output from WildTrax to predict the proportion of disagreements that are likely for a given recording or species. We used randomised sub-setting of a fully verified dataset to test whether these models can be used to target recordings or species with higher predicted error rates and correct a greater proportion of false-positive errors in the dataset without increasing effort. Finally, we use these predictions to compare the costs and benefits of alternative approaches to verification.

Materials and methods

Study area and transcription

We conducted our study using recordings collected with ARUs as part of long-term monitoring in the boreal forest of Saskatchewan and Alberta. Study sites were located in Bird Conservation Regions 6 and 8, and we used recordings taken in the summer of 2012–2014 and 2017–2021. ARUs in Saskatchewan were deployed following the Boreal Optimal Sampling Strategy (Van Wilgenburg et al. Citation2020) in grids of nine stations spaced 300 m apart. In Alberta, recordings were collected during in-person point counts and survey locations were selected using a stratified design based on either habitat or disturbance types (Mahon et al. Citation2019). We used several models of ARUs including Frontier Labs Bar-LTs and Wildlife Acoustics Song Meter SM2+ and SM4. All models recorded in stereo, with factory default gain settings and a sampling rate of 44.1 kHz and 16-bit depth between the dates May 28 and July 7. We scheduled ARUs to record ≥ six 10-min recordings between 1 h before local sunrise to 4.5 h after sunrise, and three 3-min recordings in the evening from 0.5 h before to 0.5 h after local sunset. We attempted to record ≥ four mornings of good survey conditions; however, ARUs were occasionally deployed for a single morning, or in Alberta for a single point count. We selected a stratified random draw of one 10-min and five 3-min recordings from all available recordings at each station. Specifically, we stratified by time of day (early dawn chorus: 60 min prior to sunrise until 50 min after sunrise; mid-dawn chorus: 51–150 min after sunrise; and late dawn chorus: 151–300 min after sunrise) and date of deployment (before or after the median dates available for the site). We generated four to six random samples of recordings (depending on availability) for each temporal stratum and selected the first sample taken in good weather, with little to no wind or rain. Recordings made during poor weather were replaced with the next acceptable recording in the random selection.

Recordings were transcribed by skilled observers using WildTrax (https://www.wildtrax.ca/), an online platform for managing, storing and processing data from environmental sensors, including ARUs. The WildTrax platform allows users to view spectrograms while listening to recordings and ‘tag’ vocalisations by drawing a box on the spectrogram that encloses the acoustic signal (). Tags provide information on the time of detection, maximum and minimum frequency, and length and amplitude of the sound within the box dimensions. When a user creates a tag, the recording is paused and metadata is entered which includes the species identification, abundance, whether the transcriber feels the tag needs review, and signal type (song, call or non-vocal). Each tag is drawn closely around the signal, allowing additional observers to precisely review a given species identification. Each tag generally represented a single unique species-individual; however, WildTrax allows for an abundance value > 1 within a single tag if spectral signatures are not clear enough to draw a tag around an individual due to masking. Transcribers were randomly assigned tasks across spatial and temporal replicates to avoid observer biases. We used a count-removal approach (Farnsworth et al. Citation2002) where we tagged each individual only the first time it was detected and could be confidently identified by the transcriber. Count-removal tagging was accomplished using the ‘tag per task limit’ or ‘1SPT’ method in WildTrax. We provided guidance to expert listeners to estimate counts of individuals based on a combination of relative signal strength, stereo effect, partial overlap of signals, timing of counter-singing events, and occasionally individual variation in song spectral characteristics. Abundance estimates derived from ARUs in this fashion correlate well with abundance estimates from point-counts for most species (Van Wilgenburg et al. Citation2017; Bombaci and Pejchar Citation2018; Stewart et al. Citation2020). It is important to note, however, that experimental evidence suggests that abundance tends to be underestimated when birds are very (≥ circa 7 individuals) abundant (Drake et al. Citation2016).

Figure 1. An example of the transcription tagging interface on the WildTrax platform (https://wildtrax.ca/). A tag will most often designate one species-individual, with the boundaries of the tag enclosing the signal as closely as possible. When a new tag is created, the metadata box opens (right side of the figure), where species name, individual number, abundance estimate, and vocalisation type are recorded.

Figure 1. An example of the transcription tagging interface on the WildTrax platform (https://wildtrax.ca/). A tag will most often designate one species-individual, with the boundaries of the tag enclosing the signal as closely as possible. When a new tag is created, the metadata box opens (right side of the figure), where species name, individual number, abundance estimate, and vocalisation type are recorded.

Verification

We selected three of the most skilled transcribers in each year to review and verify tags created by other transcribers, referred to as verifiers (see below for details about quantifying transcriber skill). We ensured that verifiers did not review any tags they had originally created. We randomly selected tags stratified by species from the transcribed data and had a minimum of 14 tags per species each reviewed by two verifiers. Depending on the species, the 14 verified tags could include both songs and calls as well as substantial individual variation in either. We treated each review as an independent sample. We verified all tags for any species that had fewer than 14 tags in the transcribed data. Verifiers navigated to the appropriate recording and species tag on Wildtrax and reviewed each tag in a list provided and recorded whether they agreed with the species identified in the tag. If the verifier disagreed on the species identification, they recorded what species they believed to be singing or calling. We converted the data to a binomial outcome where the verifier’s species identification for a tag was categorised as agreeing (1) or disagreeing (0) with the transcriber’s original identification. While it is possible that both transcriber and verifier could be wrong on occasion, the highly skilled verifiers will capture more misidentifications than they introduce, and for our analysis we treated an agreement between transcriber and verifier as a true positive and disagreement as a false-positive error.

Transcriber and species traits

We created a dataset of transcriber and species traits that are likely to influence the probability of a transcriber and verifier agreeing on species identifications. As part of our contracting process, we required potential transcribers to take a challenging species identification exam of 80 songs and calls, and they had to achieve a score of ≥ 80% to be able to process data. We subsequently used transcribers qualifying test scores as a covariate in our analyses (mean = 89.0%, range = 81.3–97.5%). We also used the test scores to select the most skilled transcribers to be verifiers. We initially considered two other measures of transcriber skill: the number of in-person (field) point counts a transcriber had previously conducted and the number of hours of ARU transcription experience. The number of in-person point counts was positively correlated with exam scores (Pearson r = 0.734, p = 0.038). The number of transcription hours was not correlated with exam scores (Pearson r = −0.018, p = 0.966) or transcriber/verifier agreement on a species identification, as many staff with point count experience had not previously transcribed any recordings. In a preliminary analysis of our first year of tagging data, exam score was consistently in the top models for transcriber agreement, so we used it in subsequent modelling (below) and excluded the number of point-count hours due to the aforementioned correlation. The second transcriber trait we considered was tag confidence. WildTrax allows transcribers to report the confidence of a species identification at the tag level by checking a box for ‘Needs Review’. We inferred that a transcriber was ‘Confident’ in the identification if this box was left unchecked (n = 1443 Confident, n = 136 Needs Review).

We considered four species-specific covariates that we hypothesised would influence agreement between transcribers and verifiers. The first was the maximum frequency (kHz) for each species song, sourced from the supplementary material in Sólymos et al. (Citation2018). For species not present in Sólymos et al. (Citation2018) data, we calculated the mean maximum frequency from all tags present in our WildTrax projects (mean = 5.7 kHz, range = 1.2–9.8 kHz). In addition, we considered two categorical variables related to the type of sound. First, we included a factor based on whether the signal was tagged as a Song or Call in the WildTrax tag metadata. Second, we created a Song Type variable because some song types may be more difficult to identify than others; specifically, we combined four categories of basic bird sound patterns and six basic tone patterns from Pieplow (Citation2017) to create 24 pattern-tone categories. Out of these 24 categories, 15 song types were assigned to species in our dataset. We created a song type for each species by selecting the most appropriate pattern and tone for the song based on definitions adapted from Pieplow (Citation2017) presented in Appendix A1. For example, we categorised Canada Warbler songs as a whistle-warble song type and Clay-Coloured Sparrow as a buzzy-series following Pieplow (Citation2017). We combined polyphonic and noisy sounds into a ‘polynoise’ pattern category on the basis of acoustic similarity. We only applied the song types to tags that were identified as songs within WildTrax; otherwise, they remained as a call category. After generating preliminary models of transcriber/verifier agreement (see below) with all song type categories, we only retained click-trill and polynoise-phrase categories () based on 95% confidence intervals that did not overlap zero; all other signal types were grouped into an ‘Other’ category resulting in a three-level factor for song type (n = 122 click-trill, n = 67 polynoise-phrase, n = 1390 other). Finally, we quantified species rarity by dividing the number of tags for a given species by the total tags in the transcribed dataset.

Figure 2. Example spectrograms of two song types (adapted from Pieplow Citation2017) used as covariates (relative to the ‘other’ song type category) in our best approximating model. (a) Click-trill song type represented by a dark-eyed Junco and (b) polynoise-phrase song type represented by a Le Conte’s sparrow. See methods and Appendix A1 for further descriptions of song types.

Figure 2. Example spectrograms of two song types (adapted from Pieplow Citation2017) used as covariates (relative to the ‘other’ song type category) in our best approximating model. (a) Click-trill song type represented by a dark-eyed Junco and (b) polynoise-phrase song type represented by a Le Conte’s sparrow. See methods and Appendix A1 for further descriptions of song types.

Data analysis

We subset our data to include only species with > 5 tags reviewed by a verifier prior to analysis. We also removed species that were not well surveyed by point counts and where only one or two transcribers were comfortable identifying the vocalisations, including white-headed gulls (Larus spp.), quacking ducks, and Great Blue Herons (Ardea herodias). Additionally, some transcribers initially tended to identify distant signals to a species level and verifiers were unable to confirm or deny these identifications, so these tags were removed from the analysis.

We used generalised linear mixed models (GLMMs) to determine the factors influencing the probability of agreement between the transcriber and verifier. We modelled agreement as a binomial response with a logit link function and incorporated species as a random intercept. We treated maximum frequency (kHz) of the tag, transcriber test score, and species rarity as continuous covariates, and song type and transcriber tag confidence as categorical. Song type was included as a three-level factor via dummy variables with Other as the base category, whereas tag confidence had two levels (needs review and confident). We standardised continuous variables with a z-score transformation prior to modelling. We fit preliminary models to data from 930 tags of boreal recordings that were transcribed in 2020 before adding 969 verified tags from 2021 and revising the analyses. We considered a suite of 16 a priori models (Appendix A2) which included main effects as well as two models including an interaction between transcriber test score and species rarity based on previous work by Farmer et al. (Citation2012). We conducted the analysis in R, version 4.1.0 (R Core Team Citation2022), using the lme4 package (Bates et al. Citation2015). We used Akaike’s information criteria (AIC) to rank models based on parsimony (Burnham and Anderson Citation2002); see Appendix A2. We considered the model with the lowest AIC as the most parsimonious and only considered other models competitive if they were ≤ 2 AIC units from the top model and did not have parameter estimates with 95% confidence intervals overlapping zero (Arnold Citation2010).

We evaluated model adequacy by examining residual Q-Q plots and histograms (on data used in model creation), and by applying the model to an independent dataset not used in model parameterisation. The independent dataset came from recordings collected in boreal Saskatchewan in 2021 following the same survey design and deployment as described above. The dataset had all tags reviewed by one additional verifier (n = 4,982). Our best approximating model was used to predict the probability of agreement (P(agree)) for tags from the independent dataset. We evaluated model performance using Pearson’s correlation coefficient to compare observed and predicted agreement and the area under the receiver operating characteristic curve (AUC) using the roc function in the R package pROC (Robin et al. Citation2023).

While verifying tags in the independent dataset and three additional Wildtrax projects separate from the training and independent datasets, verifiers used a stopwatch to record how long it took to verify batches of tags. Verifiers began their timer when they opened a select species in the species verification tab (https://wildtrax.ca/resources/user-guide/#acoustic-data-species-verification/) within the WildTrax project and hit stop when they had finished a fixed number of tags. Thus, the duration reflected the true amount of time it took verifiers to review the tags for a given species, including any opening of individual tags for detailed review when visual verification was insufficient, or the tag metadata needed to be corrected. The verifier recorded the species, total time, number of tags verified, and the number of tags for which corrections to species identity and/or type of acoustic cue were changed. We used this information to estimate mean time per tag per species and to build a linear mixed model with species as a random effect to quantify the influence of mean disagreement on the time required to verify tags on average and for a given species.

Application of the model for cost-benefit comparison

We used the known errors in our independent dataset as a case study to evaluate the costs and benefits of three different scenarios for verifying species identifications. We converted P(agree) to the probability of disagreement (1 – P(agree)) for each tag to use as an inclusion weight in bootstrap resampling of the independent dataset. We resampled the dataset under three different QA/QC scenarios described below and compared the percentage of errors detected between the options and the amount of time required to verify tags under a specific level of verification effort. We then compared all three scenarios based on the time it took to remove the same percentage of errors from the dataset.

Scenario 1: verification of individual tags

We took 100 bootstrap samples of individual tags in our independent dataset, varying the proportion of the total sample taken from 10% to 90% (in increments of 10%), using (1) a simple random sample and (2) an unequal probability sample in which the sample inclusion probability was proportional to the predicted probability of disagreement from our model. For each sample size and strategy, we summed the total number of tags that transcribers and verifiers disagreed upon and treated them as misidentifications to calculate percent error. We then used our estimates of the mean time it takes to verify a tag for each species to calculate the time it would take to verify each sample of tags.

Scenario 2: verification of whole recordings

We calculated the mean probability of disagreement for all tags within a given recording in the project and then took 100 bootstrap samples of between 10% and 90% of the recordings (in 10% increments), again using both simple random and unequal probability random samples in which the sample inclusion probability was proportional to the mean predicted probability of disagreement across all tags in the recording. In each sample, we summed the total number of tags that transcribers and verifiers disagreed upon and treated them as misidentifications to calculate percent error. We then used our estimates of the mean time it takes to verify a tag for each species to calculate the time it would take to verify each recording. This is a conservative estimate since validating whole recordings in practice will take longer as there may be false negatives to address.

Scenario 3: verification on a species-by-species basis

Finally, we took the mean probability of disagreement across tags for each species in our independent dataset and ranked species in order from the most to the least likely to be misidentified. We then estimated the number of errors found and the cumulative time required to verify all tags within a dataset where species were added cumulatively to the verification process in descending order of mean predicted probability of disagreement.

Results

Our full model calibration dataset consisted of 1,898 verified tags across 192 species (mean = 10 verified tags species−1; range = 1–22 verified tags species−1). After filtering very rare species and weak signals that could not be verified (see Methods), 1,579 tags of 130 species remained to train the model(s). Since each tag was reviewed separately by two verifiers, this resulted in a total sample size of 3158. Mean agreement across species meeting the inclusion criteria was 81.1% (±19.8 range = 0–100) and varied widely by species (Appendix A2). The three top models for probability of agreement performed similarly with Akaike weights range from 0.16 to 0.38 and ΔAIC values ≤ 1.75 (). The second and third models differed only from the first by the inclusion of an interaction between test score and rarity, and maximum song frequency, respectively (). However, parameter estimates for both additional variables were not significantly different than 0, so we excluded the second and third models from further consideration.

Table 1. AIC ranking of generalised linear mixed effects models used to test the influence of species and transcriber traits on identification agreement between a transcriber and a verifier. Bold indicates the most parsimonious of those models with AIC < 2. Each model was fit with species ID as a random intercept.

Our most parsimonious model contained a combination of transcriber and species traits, including transcriber test score, tag confidence, species rarity, and song type (). Transcriber test scores had a positive but weak linear effect on agreement: an increase in test score from 80% to 95% resulted in only a 4% increase in the probability of agreement when the transcriber was confident compared to a 12% increase when the transcriber marked a tag as needing review (). As predicted, transcriber confidence had a positive effect on agreement () and species rarity had a negative effect (). For example, a common species making up 5% of the detections in a dataset had 2% higher agreement probability when the transcriber was confident in their identification; however, when a species was rare (0.01% of the dataset), the probability of agreement was 36% higher if a tag was identified confidently compared to a tag that needed review. With respect to song types, agreement probability was 15% lower for click-trill songs and 6% higher for polynoise-phrase compared to all other song types when test score and rarity were held at mean values and confidence was set to confident.

Figure 3. Predicted probability of agreement on species identities between a transcriber and a verifier as a function of transcriber confidence and either A) transcriber test score or B) species rarity. Rarity was estimated as the relative frequency within our project dataset for a given species. Predictions were made by our most parsimonious model ().

Figure 3. Predicted probability of agreement on species identities between a transcriber and a verifier as a function of transcriber confidence and either A) transcriber test score or B) species rarity. Rarity was estimated as the relative frequency within our project dataset for a given species. Predictions were made by our most parsimonious model (Table 2).

Table 2. Parameter estimates from the most parsimonious model (as determined by AIC) predicting the probability that a species identification is agreed on by a transcriber and a verifier. Observer confidence was a two-factor category represented as a dummy variable with confident tags as the base. Song type was a three-factor category represented by two dummy variables with ‘other’ as the base.

Our independent dataset had 4992 tags verified across 107 species with each tag reviewed by one verifier. The false-positive error rate was 11.3% across all tags.

Agreement for individual species was variable with a mean of 81.4% (±22.6, range = 0–100) across species meeting the same inclusion criteria as our training data. Our model performed moderately well at correctly predicting tag agreement (AUC: 0.71; 95% CI: 0.69–0.74), and empirical and model predicted agreement was correlated across species (Pearson r = 0.79, p = <0.0001). Species with higher predicted agreement relative to observations were Least Flycatcher (Empidonax minimus), Golden-crowned Kinglet (Regulus satrapa), and American Goldfinch (Spinus tristus; ). Conversely, several species, such as Ovenbird (Seiurus aurocapilla), Black-and-white warbler (Mniotilta varia) and Blue Jay (Cyanocitta cristata) had 97–100% agreement rates in the independent dataset, but our model predicted < 85% agreement (). Only three species had predicted and empirical agreement rates below 60%: Spotted Sandpiper (Actitis macularius), Spruce Grouse (Falcipennis canadensis), and American Three-toed Woodpecker (Picoides dorsalis).

Figure 4. Relationship between predicted and observed agreement probabilities for each species included in our analysis. Empirical agreement represents the mean observed value (0 = disagree, 1 = agree) from all tags reviewed in the independent dataset. Predicted values are the mean agreement probabilities estimated from our most parsimonious model () across all tags for each species. Diagonal line indicates a 1:1 relationship.

Figure 4. Relationship between predicted and observed agreement probabilities for each species included in our analysis. Empirical agreement represents the mean observed value (0 = disagree, 1 = agree) from all tags reviewed in the independent dataset. Predicted values are the mean agreement probabilities estimated from our most parsimonious model (Table 2) across all tags for each species. Diagonal line indicates a 1:1 relationship.

Verifying an unequal probability random sample drawn in proportion to model predicted probability of disagreement corrected proportionately more errors for the same level of verification effort under a simple random sample scenario (). Selecting a simple random sample detected errors proportional to the number of tags verified (). For example, verifying a simple random sample of 50% of tags would detect approximately 50% of errors, but taking an unequal probability random sample would detect approximately 70% of errors (). However, selecting an unequal probability random sample of whole recordings did not show as great an improvement over simple random sampling () in comparison to individual verification (). For example, taking a 50% random sample of whole recordings with unequal probability would detect ~60% of errors compared to 50% using a simple random sample.

Figure 5. Percentage of errors detected in a fully verified dataset as a function of the proportion of tags/recordings sampled for verification under alternative scenarios. (a) Scenario 1: a weighted (green) and simple (purple) random bootstrap sample of tags taken in 10% increments. (b) Scenario 2: a weighted (green) and simple (purple) random bootstrap sample of entire recordings taken in 10% increments.

Figure 5. Percentage of errors detected in a fully verified dataset as a function of the proportion of tags/recordings sampled for verification under alternative scenarios. (a) Scenario 1: a weighted (green) and simple (purple) random bootstrap sample of tags taken in 10% increments. (b) Scenario 2: a weighted (green) and simple (purple) random bootstrap sample of entire recordings taken in 10% increments.

On average, it took 39 ± 35 s (n = 7487) to verify a single species tag. Species that had higher mean disagreement took longer to verify on a per tag basis: we saw a 390% increase in mean verification time from 0% to 100% disagreement (0% disagreement = 14.9 s/tag, 100% disagreement = 73.0 s/tag (). It took on average 7.6% (±0.9) longer to verify tags from the weighted sample compared to the simple random sample in Scenario 1 ().

Figure 6. Mean time to verify species tags (see methods) as a function of species mean percent disagreement between a transcriber and a verifier. The line and shading represent model predictions and 95% confidence intervals, respectively, from a linear mixed effects model with species as a random effect (see methods). Dots represent observations from our timed tags dataset.

Figure 6. Mean time to verify species tags (see methods) as a function of species mean percent disagreement between a transcriber and a verifier. The line and shading represent model predictions and 95% confidence intervals, respectively, from a linear mixed effects model with species as a random effect (see methods). Dots represent observations from our timed tags dataset.

When comparing alternative scenarios for verifying tags, we found that iteratively verifying species-by-species in descending order of mean probability of disagreement found more errors per unit effort than either individual tag or whole recording verification (). For example, it took almost 765 min to find 50% of the errors using the species-by-species method compared to approximately 888 and 1032 min (16–35% longer) for individual tag and whole recording verification, respectively ().

Figure 7. Efficiency of detecting errors (i.e. amount of time to verify a given percentage of total errors) in a dataset when taking unequal probability random samples of tags (scenario 1), unequal probability random samples of whole recording (scenario 2) and cumulative species review by iteratively verifying species tags in descending order of modelled probability of disagreement (scenario 3). See methods for detailed description.

Figure 7. Efficiency of detecting errors (i.e. amount of time to verify a given percentage of total errors) in a dataset when taking unequal probability random samples of tags (scenario 1), unequal probability random samples of whole recording (scenario 2) and cumulative species review by iteratively verifying species tags in descending order of modelled probability of disagreement (scenario 3). See methods for detailed description.

Discussion

There has been increased recognition of the need to address identification errors in species surveys and there are several potential approaches to deal with misidentifications during data collection and in modelling frameworks. Understanding the factors contributing to misidentification rates can help target resources to improve data quality and usefulness for conservation efforts and statistical modelling. In this study, we provide insights into factors influencing species misidentification and suggest several approaches for reducing errors. We found the probability of a transcriber and a verifier agreeing on an identification (i.e. the probability of an identification being correct) was influenced by species rarity, song type, transcriber skill, and tag confidence, but not maximum song frequency (kHz). Similar to previous studies, the probability that an identification was correct increased with transcriber skill (Farmer et al. Citation2012) and decreased with species rarity (Farmer et al. Citation2012; Rempel et al. Citation2019; but see; Campbell and Francis Citation2011 for contrasting result). Two song type categories (polynoise phrase and ‘Other’ vocalisation types) had a positive effect on agreement, whereas the click-trill category had a negative effect. In addition, when transcribers reported a tag as needing review, there was a higher probability that the identification was incorrect. Finally, we demonstrated that using a model to predict the probability of false-positive errors can be used to improve the efficiency of tag verification.

The relative frequency of false positives in our independent data (11.3%) was generally consistent with previous studies on auditory (Alldredge et al. Citation2007, Citation2008; Simons et al. Citation2007; Campbell and Francis Citation2011; Farmer et al. Citation2012) and visual bird identification (Gorleri et al. Citation2023). Previous estimates of misidentification rates have ranged from as low as 0% (Alldredge et al. Citation2008) to as high as 75% for some species (Rempel et al. Citation2019). Similar to our results, previous studies have shown significant differences in misidentification rates among individual transcribers with varying skill levels (Alldredge et al. Citation2007; Farmer et al. Citation2012; Rempel et al. Citation2019). Other studies found little relationship between transcriber skill and species misidentification when transcribers were very experienced (Lotz and Allen Citation2007; McClintock et al. Citation2010). We started with a pool of relatively homogenous expert transcribers because all had to achieve a minimum score of 80% on our species identification exam. Our results provide further evidence that using experienced transcribers can reduce, but not eliminate, species misidentifications. Transcribers’ exam scores can also be used to predict the likelihood of identification errors. We posit that using an exam of known species identity cues would be even more predictive of transcriber agreement when skill level is more variable, such as when projects have a larger number of transcribers or are using volunteers. Thus, we strongly encourage subjecting potential staff, contractors, or citizen scientists to identification exams to provide an objective estimate of their skill and as a predictor of potential false-positive errors.

As we predicted, identification errors were more likely for rare than common species. Previous studies have found similar results (Farmer et al. Citation2012; Rempel et al. Citation2019), although our study differed in using project-level rarity (i.e. the relative frequency of detections in our dataset) instead of rarity at the population level. But we assume project-level rarity would correlate with regional population rarity. We recognise that given the spatial coverage of sampling sites within a given year, some regionally common species may have been assigned a lower value based on our data. Depending on a given dataset, it could be worthwhile to calculate species rarity across multiple years and projects for a given biome. Interestingly, the results of Farmer et al. (Citation2012) found support for an interaction between species rarity and transcriber skill, which suggested moderately skilled transcribers tended to have higher misidentification rates for common species while expert transcribers tended to do so for rare species. Presumably, our exclusion of transcribers with exam scores <80% explains why the inclusion of an interaction between transcriber skill and species rarity was not supported by our results.

We considered maximum song frequency (kHz) because high-pitched sounds tend to attenuate with distance and Alldredge et al. (Citation2008) showed that misidentification rates increased with distance from observers in a playback experiment. Even though song frequency (kHz) has been shown to affect detectability (Sólymos et al. Citation2018), we failed to find support for song frequency influencing transcriber/verifier agreement; this may have been due to the verifier being directed to the precise sound in question with the help of tags in WildTrax. While we failed to find support for song frequency affecting transcriber agreement, we did find support for song type influencing agreement. Click-trill song types had a negative effect on agreement, which is unsurprising since the group of birds known as ‘trillers’ are notoriously difficult to identify (Rempel et al. Citation2019). Polynoise-phrase song types and the Other category (including calls and remaining songs) had a positive effect on agreement. The polynoise-phrase category primarily included calls of blackbirds, a group which generally provide loud and distinct songs that are generally easy to identify.

Our model is most efficient at identifying probable false-positive errors in individual tags, however recording-level verification (i.e. reviewing the entire recording both for existing tags and acoustic signals that may have been missed by the transcriber) allows for simultaneous quantification of false negatives. While processing multiple recordings from the same location increases the chance of detecting all species present, thereby reducing the need for catching false-negatives in an individual recording, more thorough quantification of false-negative rates within and between recordings would provide insights into sources of false-negative errors. The other component of non-detection data in manually transcribed recordings that we did not address in our study is verifying tags classified with ‘unidentified’ codes. Checking a subset of these tags, especially those marked as needing review, represents valuable data but would also require more time and resources to verify.

The amount of verification required will vary depending on a project’s goals and intended use of the data. In cases where there could be legal ramifications from a species presence, for example with species at risk, verifying all tags would be advisable, and the total number of detections are unlikely to be prohibitive. For projects using occupancy models, ignoring false positives can yield biased parameter estimates and overestimate occupancy, even in situations where false positives make up as little as 1–14% of all detections (McClintock et al. Citation2010; Campbell and Francis Citation2011). Accounting for false positives can complicate the design of occupancy studies (Clement Citation2016) and incorporating our method of unequal probability random sampling could be used along with false-positive models. Chambert et al. (Citation2018) found estimates became mostly unbiased when ≥5% of the detections were verified by a human observer. Currently, we know of no published studies on the effects of false positives on count-based species distribution models (SDMs). Since SDMs are largely driven by species presence-absence, we therefore anticipate misidentifications would have a similar impact as seen in occupancy frameworks. However, future assessment of the amount of verification required to reduce bias in SDMs would run bootstrap exercises withholding subsets of spatially independent sites to test the predictive accuracy of SDMs fit while varying the number of confirmed vs. unconfirmed true positives.

Biases in occupancy estimates can be substantial when identification errors and false-negatives are not accounted for (McClintock et al. Citation2010; Miller et al. Citation2011) but the degree of the bias will be partly dependent on the size of the dataset and the scope of the project. Consider recordings available in WildTrax for Environment and Climate Change Canada’s Boreal Bird Monitoring Program (Van Wilgenburg et al. Citation2020) within the Prairie provinces where dozens of transcribers have processed recordings across approximately 1600 unique locations, each typically with ≥6 recordings. In this case, the impact of misidentifications for many species may be minimal, especially if estimators accounting for false positives or explanatory covariates are used (MacKenzie et al. Citation2006). There may, however, be projects or situations where far fewer transcribers, recordings, or locations are available. Often these small regional datasets may be collected with no intention or capacity to apply model-based corrections and therefore reducing error rates in the raw data would be needed to guard against incorrect inference. Bias from unaddressed false-positive errors can be especially high for rare species (Miller et al. Citation2011) and rare species often carry some form of local or national conservation status requiring added attention. Therefore, in most instances, it would be beneficial to have at least one verifier review all tags for any species of conservation concern regardless of the model predicted probability of agreement. In addition, any species commonly confused with a species of concern should also be reviewed to avoid false-negative errors induced by misidentification. Thorough vetting of records of species that are rare or of conservation concern could prevent incorrect inferences and potentially costly conservation decisions (Taylor et al. Citation2005)

Online acoustic tagging platforms now facilitate the ability for research and monitoring programmes to measure false positive and -negative error rates and reduce absolute error in passive acoustic datasets. Here, we have demonstrated several alternative approaches to direct tag verification efforts by random selection of tags and/or recordings and demonstrated that unequal probability sampling could be particularly efficient for directing verification effort. Where verification of all tags is prohibitive, we suggest some form of unequal probability random sampling for tag verification would be useful to facilitate modelling of false-positive errors in occupancy models. Our verification method could be used to quantify and facilitate modelling of false-negative errors as well, if the true species identity was not recorded at a given point or if species are added during whole recording review. Furthermore, implementing tag- or recording-level verification using weighted randomised sampling would be more statistically rigorous than ad-hoc approaches (Smith et al. Citation2017) because it allows efficient targeting of tags and recordings that are more prone to misidentification. We recommend that acoustic processing environments should retain the key variables such as ‘Needs review’ as filters in species verification tools so that verifiers can be efficiently presented with the tags most likely to be false positives. Furthermore, the ability to import binomial selection variables derived from our weighted inclusion probability approach into species verification tools would facilitate presenting reviewers with a randomised selection as we have done here.

We suggest that projects focused on multi-species passive acoustic monitoring should employ or adapt our framework to draw representative random samples of tags to verify using unequal probability sampling. In contrast, single-species projects, or projects with a limited number of target species could likely have all tags verified by at least one qualified verifier given that it should be relatively easy to budget. Finally, our results suggest that sequentially verifying species by descending order of disagreement probability is likely the most cost-effective approach to removing errors. The sequential species tag verification approach would benefit from determining acceptable thresholds in false-positive rates for a given objective (e.g. occupancy modelling). We suggest that running species specific occupancy models with vs. without accounting for false-positive errors would allow an assessment of occupancy estimate sensitivity to false positives (see Rempel et al. Citation2019) to determine which species may not require verification. Our general methodology could also be applied to output from automated recognisers to help make species verification more efficient and could use recogniser confidence scores in place of transcriber traits. In addition, weighted inclusion probabilities could also be used within other stratified sampling schemes to avoid other spatial, temporal or habitat gradient biases while maintaining efficiencies by increasing sampling of tags more likely to contain errors. Ultimately, building on our method to use either automated recognisers or multiple verifiers whenever two transcribers disagree on species identity and thereby assign species identity via weight-of-evidence or consensus might further improve on our approach. We encourage others to test our method over a broader range of species and ecosystems and suggest regular reporting of estimated false-positive rates should be part of project meta-data when archiving or making data publicly available.

Acknowledgements

This work was made possible through operating grants from Environment and Climate Change Canada. We would like to thank LeeAnn Latremouille, Thea Carpenter, Enid Cumming, Mark Dorriesfield, Stan Shadick, Laura Stewart and five additional expert transcribers for their assistance in acoustic tagging and verification. We would like to thank the Editor and two anonymous referees for comments that improved the manuscript.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • Alldredge MW, Pacifici K, Simons TR, Pollock KH. 2008. A novel field evaluation of the effectiveness of distance and independent observer sampling to estimate aural avian detection probabilities. J Appl Ecol. 45(5):1349–1356. doi: 10.1111/j.1365-2664.2008.01517.x.
  • Alldredge MW, Simons TR, Pollock KH. 2007. Factors affecting aural detections of songbirds. Ecol Appl. 17(3):948–955. doi: 10.1890/06-0685.
  • Arnold TW. 2010. Uninformative parameters and model selection using Akaike’s information criterion. J Wildl Manage. 74(6):1175–1178. doi: 10.1111/j.1937-2817.2010.tb01236.x.
  • Bates D, Mächler M, Bolker B, Walker S. 2015. Fitting linear mixed-effects models using lme4. J Stat Softw. 67(1):1–48. doi: 10.18637/jss.v067.i01.
  • Bombaci SP, Pejchar L. 2018. Using paired acoustic sampling to enhance population monitoring of New Zealand’s forest birds. N Z J Ecol. 43(1):3356. doi: 10.20417/nzjecol.43.9.
  • Brauer CL, Donovan TM, Mickey RM, Katz J, Mitchell BR. 2016. A comparison of acoustic monitoring methods for common anurans of the northeastern United States. Wildl Soc Bull. 40(1):140–149. doi: 10.1002/wsb.619.
  • Buckland S, Anderson D, Burnham K, Laake J, Borchers D, Thomas L. 2001. Introduction to distance sampling: estimating abundance of biological populations. Vol. xv. Oxford, U.K.: Oxford University Press.
  • Burnham KP, Anderson DR, Eds. 2002. Model selection and multimodel inference. Springer. doi: 10.1007/b97636.
  • Buxton RT, McKenna MF, Clapp M, Meyer E, Stabenau E, Angeloni LM, Crooks K, Wittemyer G. 2018. Efficacy of extracting indices from large-scale acoustic recordings to monitor biodiversity. Conserv Biol. 32(5):1174–1184. doi: 10.1111/cobi.13119.
  • Campbell M, Francis CM. 2011. Using stereo-microphones to evaluate observer variation in north American breeding bird survey point counts. Auk. 128(2):303–312. doi: 10.1525/auk.2011.10005.
  • Chambert T, Waddle JH, Miller DAW, Walls SC, Nichols JD, Yoccoz N. 2018. A new framework for analysing automated acoustic species detection data: occupancy estimation and optimization of recordings post-processing. Methods Ecol Evol. 9(3):560–570. doi: 10.1111/2041-210X.12910.
  • Clement MJ. 2016. Designing occupancy studies when false-positive detections occur. Methods Ecol Evol. 7(12):1538–1547. doi: 10.1111/2041-210X.12617.
  • Darras K, Batáry P, Furnas BJ, Grass I, Mulyani YA, Tscharntke T. 2019. Autonomous sound recording outperforms human observation for sampling birds: a systematic map and user guide. Ecol Appl. 29(6):e01954. doi: 10.1002/eap.1954.
  • Darras K, Pérez N, Mauladi, Hanf-Dressler T. 2020. BioSounds: an open-source, online platform for ecoacoustics. F1000Research. 9:1224. doi: 10.12688/f1000research.26369.1.
  • Doser JW, Finley AO, Weed AS, Zipkin EF. 2021. Integrating automated acoustic vocalization data and point count surveys for estimation of bird abundance. Methods Ecol Evol. 12(6):1040–1049. doi: 10.1111/2041-210X.13578.
  • Drake A, de Zwaan DR, Altamirano TA, Wilson S, Hick K, Bravo C, Ibarra JT, Martin K. 2021. Combining point counts and autonomous recording units improves avian survey efficacy across elevational gradients on two continents. Ecol Evol. 11(13):8654–8682. doi: 10.1002/ece3.7678.
  • Drake KL, Frey MD, Hogan D, Hedley R. 2016. Using digital recordings and sonogram analysis to obtain counts of yellow rails. Wildl Soc Bull. 40(2):346–354. doi: 10.1002/wsb.658.
  • Environment and Climate Change Canada. 2016. Species at Risk Act Implementation Guidance: for recovery practitioners (2.3). Critical habitat identification toolbox: Species at Risk Act guidance - Canada.ca
  • Farmer R, Leonard M, Horn A. 2012. Observer effects and avian-call-count survey quality: rare-species biases and overconfidence. Auk. 129:76–86. doi: 10.1525/auk.2012.11129.
  • Farnsworth GL, Pollock KH, Nichols JD, Simons TR, Hines JE, Sauer JR. 2002. A removal Model for estimating detection probabilities from point-count surveys. Auk. 119(2):414–425. doi: 10.1093/auk/119.2.414.
  • Gibb R, Browning E, Glover-Kapfer P, Jones K. 2018. Emerging opportunities and challenges for passive acoustics in ecological assessment and monitoring. Methods Ecol Evol. doi: 10.1111/2041-210X.13101.
  • Golding JD, Nowak JJ, Dreitz VJ. 2017. A multispecies dependent double‐observer model: a new method for estimating multispecies abundance. Ecol Evol. 7(10):3425–3435. doi: 10.1002/ece3.2946.
  • Gorleri FC, Jordan EA, Roesler I, Monteleone D, Areta JI. 2023. Using photographic records to quantify accuracy of bird identifications in citizen science data. Ibis (Lond 1859). 165(2):458–471. doi: 10.1111/ibi.13137.
  • Guillera-Arroita G. 2016. Modelling of species distributions, range dynamics and communities under imperfect detection: advances, challenges and opportunities. Ecography. 40(2):281–295. doi: 10.1111/ecog.02445.
  • Hoekman ST. 2021. Multi-observer methods for estimating uncertain species identification. Ecosphere. 12(9):e03648. doi: 10.1002/ecs2.3648.
  • Kahl S, Wood CM, Eibl M, Klinck H. 2021. BirdNET: a deep learning solution for avian diversity monitoring. Ecol Inf. 61:101236. doi: 10.1016/j.ecoinf.2021.101236.
  • Knight E, Hannah K, Foley G, Scott C, Brigham R, Bayne E. 2017. Recommendations for acoustic recognizer performance assessment with application to five common automated signal recognition programs. Avian Conserv Ecol. 12(2). doi: 10.5751/ACE-01114-120214.
  • Lotz A, Allen CR. 2007. Observer bias in anuran call surveys. J Wildl Manage. 71(2):675–679. doi: 10.2193/2005-759.
  • MacKenzie DI, Nichols JD, Royle JA, Pollock KP, Bailey LL, Hines JE. 2006. Occupancy estimation and modeling: inferring patterns and dynamics of species occurrence. San Diego, California, USA: Academic Press.
  • Mahon CL, Holloway GL, Bayne EM, Toms JD. 2019. Additive and interactive cumulative effects on boreal landbirds: winners and losers in a multi-stressor landscape. Ecol Appl. 29(5):e01895. doi: 10.1002/eap.1895.
  • McClintock BT, Bailey LL, Pollock KH, Simons TR. 2010. Unmodeled observation error induces bias when inferring patterns and dynamics of species occurrence via aural detections. Ecology. 91(8):2446–2454. doi: 10.1890/09-1287.1.
  • Miller DA, Nichols JD, Mcclintock BT, Grant EHC, Bailey LLL, Weir LA. 2011. Improving occupancy estimation when two types of observational error occur: non-detection and species misidentification. Ecology. 92(7):1422–1428. doi: 10.1890/10-1396.1.
  • Pérez-Granados C. 2023. BirdNET: applications, performance, pitfalls and future opportunities. Ibis (Lond 1859). 165(3):1068–1075. doi: 10.1111/ibi.13193.
  • Pieplow N. 2017. Peterson field guide to bird sounds of Eastern North America. Boston New York: Houghton Mifflin Harcourt.
  • Priyadarshani N, Marsland S, Castro I. 2018. Automated birdsong recognition in complex acoustic environments: a review. null. 49(5):jav–01447. doi: 10.1111/jav.01447.
  • R Core Team. 2022. R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
  • Rempel RS, Hobson KA, Holborn G, Van Wilgenburg SL, Elliott J. 2005. Bioacoustic monitoring of forest songbirds: Interpreter variability and effects of configuration and digital processing methods in the laboratory. J Field Ornithol. 76(1):1–11. doi: 10.1648/0273-8570-76.1.1.
  • Rempel R, Jackson J, Van Wilgenburg S, Rodgers J. 2019. A multiple detection state occupancy model using autonomous recordings facilitates correction of false positive and false negative observation errors. Avian Conserv Ecol. 14(2):1. doi: 10.5751/ACE-01374-140201.
  • Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C, Müller M, code), S. S. (Fast D., Multiclass), M. D. (Hand & T., & CI), Z. B. (DeLong paired test. 2023. pROC: display and Analyze ROC Curves (1.18.4) [Computer Software]. https://cran.r-project.org/web/packages/pROC/index.html.
  • Roemer C, Julien J-F, Bas Y. 2021. An automatic classifier of bat sonotypes around the world. Methods Ecol Evol. 12(12):2432–2444. doi: 10.1111/2041-210x.13721.
  • Shonfield J, Bayne E. 2017. Autonomous recording units in avian ecological research: Current use and future applications. Avian Conserv Ecol. 12(1). doi: 10.5751/ACE-00974-120114.
  • Simons TR, Alldredge MW, Pollock KH, Wettroth JM. 2007. Experimental analysis of the auditory detection process on avian point counts. Auk. 124(3):124. doi: 10.1642/0004-8038(2007)124[986:EAOTAD]2.0.CO;2.
  • Smith ANH, Anderson MJ, Pawley MDM. 2017. Could ecologists be more random? Straightforward alternatives to haphazard spatial sampling. Ecography. 40(11):1251–1255. doi: 10.1111/ecog.02821.
  • Sólymos P, Matsuoka SM, Stralberg D, Barker NKS, Bayne EM. 2018. Phylogeny and species traits predict bird detectability. Ecography. 41(10):1595–1603. doi: 10.1111/ecog.03415.
  • Stewart LN, Tozer DC, McManus JM, Berrigan LE, Drake KL. 2020. Integrating wetland bird point count data from humans and acoustic recorders. Avian Conserv Ecol. 15(2):9. doi: 10.5751/ACE-01661-150209.
  • Strickfaden K, Fagre D, Golding J, Harrington A, Reintsma K, Tack J, Dreitz V. 2020. Dependent double‐observer method reduces false positive errors in auditory avian survey data. Ecol Appl. 30(2):30. doi: 10.1002/eap.2026.
  • Symes LB, Madhusudhana S, Martinson SJ, Kernan CE, Hodge KB, Salisbury DP, Klinck H, Hofstede HT. 2022. Estimation of katydid calling activity from soundscape recordings. J Orthoptera Res. 31(2):Article 2. doi: 10.3897/jor.31.73373.
  • Taylor MFJ, Suckling KF, Rachlinski JJ. 2005. The effectiveness of the endangered species act: a quantitative analysis. BioScience. 55(4):360–367. doi: 10.1641/0006-3568(2005)055[0360:TEOTES]2.0.CO;2.
  • Van Parijs S, Clark C, Sousa-Lima R, Parks S, Rankin S, Risch D, Van Opzeeland I. 2009. Management and research applications of real-time and archival acoustic sensors over varying temporal and spatial scales. Mar Ecol Prog Ser. 395:37–53. doi: 10.3354/meps08123.
  • Van Wilgenburg SL, Mahon CL, Campbell G, McLeod L, Campbell M, Evans D, Easton W, Francis CM, Haché S, Machtans CS, et al. 2020. A cost efficient spatially balanced hierarchical sampling design for monitoring boreal birds incorporating access costs and habitat stratification. PloS One. 15(6):e0234494. doi: 10.1371/journal.pone.0234494.
  • Van Wilgenburg S, Sólymos P, Kardynal K, Frey M. 2017. Paired sampling standardizes point count data from humans and acoustic recorders. Avian Conserv Ecol. 12(1). doi: 10.5751/ACE-00975-120113.
  • Wrege PH, Rowland ED, Keen S, Shiu Y, Matthiopoulos J. 2017. Acoustic monitoring for conservation in tropical forests: examples from forest elephants. Methods Ecol Evol. 8(10):1292–1301. doi: 10.1111/2041-210X.12730.

Appendix A1

Song Type Definitions (Pieplow Citation2017)

Appendix A2.

Mean percent agreement of species identifications between a transcriber and verifier. Species excluded from model training had < 6 tags reviewed or fell under special circumstances described in the methods