426

Views

CrossRef citations to date

Altmetric

Review Article

A systematic review of the validity of Criteria-based Content Analysis in child sexual abuse cases and other field studies

Siegfried Ludwig Sporera Department of Psychology and Sports Science, Justus-Liebig-University of Giessen, Giessen, GermanyCorrespondence[email protected] [email protected]

https://orcid.org/0000-0002-7236-4300

Jaume Masipb Department of Social Psychology and Anthropology, Universidad de Salamanca, Salamanca, Spain

https://orcid.org/0000-0002-7783-9547

ABSTRACT

Criteria-based Content Analysis (CBCA) has been primarily employed to assess the credibility of child sexual abuse (CSA) allegations. However, several studies on the validity of CBCA have focused on autobiographical events other than CSA. Because of the differences between real cases and the laboratory, we focused specifically on CBCA field studies on both CSA and other areas of application. We formally assessed several ground-truth criteria (and other methodological aspects) in a pool of 36 field studies. Seven archival studies (six of which were on CSA) and seven quasi-experiments (none of which was on CSA) were found to be either methodologically sound (12 studies) or acceptable with reservations (two studies), and were therefore included. We describe the paradigm and methods used in each study. Across studies, most CBCA criteria significantly differed between truthful and deceptive accounts, with similar medium to large effect sizes for the methodologically sound quasi-experiments and archival CSA studies. Our review shows that CBCA criteria may discriminate in domains other than CSA. The implications for the real-world usage of CBCA are discussed.

KEYWORDS:

Introduction

Deception and its detection are as old as the history of humankind, and they are of particular importance in legal settings. At the very beginning of experimental psychology at the onset of the twentieth century, there were some isolated studies on the psychology of lying. In fact, Louis William Stern testified as early as 1903 on cases of alleged child sexual abuse (Stern, Citation1926; see Sporer & Antonelli, Citation2022). Since the 1970s, there has been an upsurge of interest by physiological, social, cognitive and legal psychologists investigating cues to deception and methods to detect deception in laboratory simulations. The cues investigated encompass physiological (e.g. heart rate), nonverbal (visual and vocal), and linguistic (e.g. the use of personal pronouns) indicators, as well as verbal content criteria (e.g. logical consistency or the number of details). Importantly, the analysis of the content of a statement is by no means new but has always been used by judges and legal scholars (e.g. Hellwig, Citation1951; Leonhardt, Citation1931; Mittermaier, Citation1834; see Sporer & Masip, Citation2023). Here, we focus on such verbal content cues.

Criteria-based Content Analysis (CBCA)

Verbal content cues to assess the credibility of witness statements have been proposed in Central Europe since the beginning of the twentieth century (see Sporer & Antonelli, Citation2022). After WWII, special Juvenile courts were established, and a 1954 German Supreme Court decision ruled that psychiatrists or psychologists could be called to assess the credibility of statements in child sexual abuse cases (Bundesgerichtshof in Strafsachen [BGHSt], 1954, 7, 82, Urteil vom 3. 12, Citation1954). This allowed expert witnesses to evaluate victim and witness statements in many cases, and several such experts developed individual verbal content criteria to systematically assess credibility (e.g. Arntzen, Citation1970; Dettenborn et al., Citation1984; Szewczyk, Citation1973; Trankell, Citation1972; Undeutsch, Citation1967). In the 1980s, Köhnken (Citation1982) and Steller and Köhnken (Citation1989) integrated criteria from different sources under a set of 19 items which they named Criteria-based Content Analysis (CBCA). The CBCA criteria are grouped in several categories (), and their presence in a (transcribed) statement suggests that the narrator experienced the reported event. However, their absence does not signal deception, as it can be due to many other reasons (such as poor cognitive or verbal skills; e.g. Raskin & Esplin, Citation1991b; Steller, Citation1989).

Table 1. List of CBCA criteria (adapted from Steller & Köhnken, Citation1989).

Download CSV Display Table

CBCA is embedded into Statement Validity Assessment (SVA), which is a more general complex assessment procedure. SVA has three major components: (a) a semi-structured interview, which is designed to increase the amount and quality of information collected from witnesses; (b) CBCA; and (c) the so-called validity checklist, which considers several aspects that are relevant to assess witness credibility (Raskin & Esplin, Citation1991b; Steller, Citation1989; Steller & Boychuk, Citation1992).

Neither SVA nor CBCA are standardized psychometric tests. Instead, SVA is an overarching clinical diagnostic assessment approach applied by court experts when the credibility of a witness statement is crucial for the case disposition. CBCA is one of its components, and it is not a ‘lie detection tool’ (for thorough discussions of common misunderstandings of SVA and CBCA, see Köhnken et al., Citation2015; Volbert & Steller, Citation2014). Also, CBCA was first designed to assess the credibility of children's allegations of having been sexually abused (Steller & Köhnken, Citation1989). This is why so many field studies included in this review focus on CSA, though CBCA may probably also work in other contexts.

Theoretical assumptions of CBCA

Because CBCA criteria were primarily developed by practitioners, originally there was not much of a theoretical basis. However, Köhnken (Citation1990, Citation1996) discussed several possible theoretical underpinnings, and other authors have further broadened the theoretical basis. CBCA is based on the hypothesis that truthful and fabricated statements differ in content and quality (Undeutsch, Citation1967). Though this has been named ‘the Undeutsch hypothesis’ (e.g. Steller, Citation1989), merely stated this way, this is not a testable scientific hypothesis because it is not derived from a theory, nor does it specify the boundary conditions (e.g. the age of the interviewee, the length of a statement), which may limit its validity (Köhnken, Citation1990; Sporer, Citation2004). However, it is possible to ‘retrospectively’ invoke some lines of theory that could serve as a theoretical underpinning. For example, Köhnken (Citation1990) distinguished between a cognitive component involved in constructing a lie and a motivational component related to impression management (Tedeschi & Norman, Citation1985).

Cognitive aspects

Lying is considered to be more cognitively difficult than truth telling (e.g. Vrij et al., Citation2010; Zuckerman et al., Citation1981), which has led researchers to design interview procedures (for suspects) where cognitive load is artificially increased to enhance behavioral differences between liars and truth tellers. Liars, due to the cognitive difficulty involved when lying, have few cognitive resources left over to cope with additional demands (for a recent meta-analysis, see Mac Giolla & Luke, Citation2021). Additionally, working memory models of lie production have been elaborated (Sporer, Citation2016; Walczyk et al., Citation2014). These models allow specific predictions to test content differences between deceptive and truthful accounts that are relevant for CBCA criteria. For example, if liars report on an unfamiliar event they have never experienced, they may draw on episodic or autobiographical memory to search for similar events (Sporer, Citation2016). Liars may also base the account on schemata for these types of events (Köhnken, Citation1990; Sporer, Citation2004, Citation2016; Volbert & Steller, Citation2014), which will often be rather generic and lack ‘Unusual’ (CBCA08), ‘Superfluous’ (CBCA09) or ‘Event Specific Details’ (CBCA19) characteristic of experience-based accounts (see ).

Impression management aspects

The impression management approach focuses more on motivational aspects and has been particularly popular in communication research (DePaulo et al., Citation2003; Sporer, Citation2004). Lying is a goal-directed behavior, that is, liars want to be perceived as honest and therefore will avoid behaviors that they believe might give them away. For example, applied to the CBCA criteria (), ‘if a liar believes that “Admitting Lack of Memory” will undermine credibility, he or she will try to avoid such behavior’ (Köhnken, Citation1996, p. 273). More generally, both liars and truth-tellers will strive to deliberately incorporate into their statements aspects they believe will make them appear more credible, but liars will be more likely to suppress certain contents that they believe may give them away (see Granhag & Hartwig, Citation2008; Granhag et al., Citation2015; Vrij et al., Citation2016) or that could be contradicted by independent evidence (Vrij & Nahari, Citation2019). Impression management aspects have been explicitly invoked as a theoretical background for the motivational CBCA criteria set by Niehaus (Citation2008), Rönspies-Heitman (Citation2022), and Volbert and Steller (Citation2014). The latter differentiate among three different kinds of verbal content characteristics related to strategic self-presentation: (a) contents that may indicate memory-related deficits (which liars will presumably tend to avoid: Spontaneous Corrections and Admitting Lack of memory); (b) contents that may cast doubt on sincerity (Raising doubts about one's own testimony); and (c) other problematic contents (Self-deprecation and Pardoning the Perpetrator).

Past CBCA research

In the most comprehensive meta-analysis conducted to date examining the association between behavioral cues and deception, DePaulo et al. (Citation2003) found a median d of 0.10 across all studies and the 158 cues investigated. Their review included only a handful of studies (k ≤ 6) investigating verbal content cues to deception–including CBCA criteria.

More recent reviews focused specifically on CBCA. Vrij (Citation2005; see also Vrij, Citation2008) reviewed ‘the first 37 studies’ of CBCA, which included both laboratory experiments and research conducted in the field. To examine the validity of the criteria, Vrij tallied the number of studies where each individual CBCA criterion had been found significantly more often in truthful (compared to deceptive) accounts, in deceptive (vs. truthful) accounts, and with similar frequency in both conditions (i.e. no significant difference). A problem of this ‘vote-counting’ approach is that merely tallying the number of significant effects can be misleading because it gives small studies equal weight as large studies, and it gives no indication of the magnitude of the effects (for critical appraisals and potential improvements on vote-counting see Bushman, Citation1994; Bushman & Wang, Citation2009; Hedges & Olkin, Citation2000).

Meta-analyses avoid most of these problems, and two large meta-analyses have been published on the validity of CBCA criteria. One focused on children’s accounts (Amado et al., Citation2015), while the other one considered adults’ accounts (Amado et al., Citation2016). Both these meta-analyses included published and unpublished reports, as well as both laboratory and field studies. In the meta-analysis of children’s accounts, significant differences between truthful and deceptive statements were found for all CBCA criteria (with k ranging between five and 17 studies depending on the specific criterion). In the meta-analysis of adults’ accounts, all criteria but Self-deprecation (d = 0.04) and Pardoning the Perpetrator (d = −0.02) significantly differed between truthful and deceptive statements (with k ranging from 5 to 35). However, effect sizes were generally small in both these meta-analyses. In Amado et al. (Citation2015), Cohen’s d was smaller than 0.50 for 15 out of 19 criteria, and only for CBCA19 (with k = 5) was d larger than 0.80. In Amado et al. (Citation2016), d was smaller than 0.50 for 17 out of 19 criteria, and the average d across criteria was only 0.25.

More recently, Oberlader et al. (Citation2016, Citation2021) also meta-analyzed CBCA. However, rather than focusing on the validity of individual criteria, which is our focus here, they examined three dependent variables: dichotomous true/false decisions made by human raters, classifications based on multivariate statistical significance tests (e.g. discriminant analysis), and classifications based on summary scores.

Why this systematic review?

CBCA is employed in several countries to assess the credibility of child sexual abuse (CSA) allegations in forensic contexts. The reviews summarized above did not focus specifically on field studies but combined both field and laboratory studies. Therefore, their relevance for real-life settings is limited. Indeed, there are several crucial differences between laboratory CBCA experiments and real-life cases, and these differences can influence the potential of the criteria to separate self-experienced from invented accounts.

First, while most experiments focus on single events, in real criminal cases the events may be experienced repeatedly. This may influence verbal content criteria (e.g. Connolly & Lavoie, Citation2015; Strömwall et al., Citation2004). Second, compared to many laboratory experiments, where all truthful participants describe the same or a similar type of event, real life events are often more heterogeneous. This may lead to an overestimation of the validity of criteria in the laboratory (relative to the field). Third, while in CBCA experiments participants are typically interviewed once, this is often not the case for alleged victims in forensic settings. For one thing, they are likely to have described the event to someone else before testifying to the police or a child-protection professional. Moreover, these victims are often interviewed several times after the police or the justice system take over the case (e.g. Finkelhor et al., Citation2005). Arguably, single versus repeated interviewing may affect verbal content criteria. Last but not least, real-life alleged victims are more culturally or ethnically diverse than experimental participants. Some authors have argued that the narrator’s culture or ethnicity may influence CBCA criteria (Cacuci et al., Citation2021; Ruby & Brigham, Citation1998). All these aspects suggest that a review focused specifically on field studies may provide a more accurate picture of the validity of CBCA criteria in real-life contexts than extant reviews, which include both field and laboratory studies.

An additional problem refers to the specific field studies included in prior reviews or meta-analyses. Conducting deception research in field settings is extremely challenging because of the uncertainty concerning ground truth. In other words, while in the laboratory the experimenter knows for sure who is lying and who is telling the truth, this crucial condition is largely unknown in real cases. lists criteria suggested by several authors to assess the ground truth of CSA allegations (see Honts, Citation1994; Horowitz et al., Citation1995; Lamb et al., Citation1997; Welle et al., Citation2016). Additionally, in CBCA studies statements should also be considered doubtful if the event is somehow confirmed but the wrong perpetrator is accused–i.e. there is reliable evidence, such as DNA tests or a confirmed alibi, showing the offender was not the suspect.

Table 2. Main criteria used to determine ground truth of the allegations in field studies of child sexual abuse.

Download CSV Display Table

Although all these ground truth criteria have been suggested, studies differ notoriously in the specific criteria used, the number of them considered, the cutoffs employed to separate between likely and unlikely cases (e.g. particular combinations of inclusion and/or exclusion criteria), and the way the criteria were assessed (e.g. dichotomic decisions by the researchers vs. scalar ratings provided by independent blind raters). Also, some of the criteria listed in are stronger ground truth indicators than others, and some may definitely be problematic. For example, in countries where the police can use confrontational interrogation tactics, or where plea bargaining exists, suspect confessions can hardly be a reliable indicator of the validity of the allegation. Moreover, because some criteria are not independent (e.g. a confession largely increases the likelihood of a guilty verdict; Kassin, Citation2012; Kassin & Neumann, Citation1997), the co-presence of several criteria does not necessarily increase our confidence that the alleged event did happen.

In several field studies, ground truth was assessed by the same forensic experts who evaluated the credibility of the statement, often using (some of) the very same credibility criteria whose validity was to be tested. Obviously, these studies are flawed due to their circularity. Ground truth should be established based on independent evidence–that is, evidence other than the CBCA criteria, and by coders blind to the CBCA coding.

All the above considerations show there is large variation across individual field studies in how ground truth was determined. While some studies used very stringent criteria, some others are fatally flawed. We are concerned about some problematic field studies having been included in previous reviews and meta-analyses. To address this problem, we systematically assessed the way ground truth had been determined in all the retrieved field studies and selected only studies meeting our criteria for this review (see also Supplemental Materials 1: https://osf.io/5tsbz).

On the other hand, some of the field studies we retrieved and considered to be methodologically sound were not included in previous reviews or meta-analyses. These studies are included in this review.

Finally, an important goal of this review is to explore whether CBCA can be used to assess real-life events other than CSA. While many field studies examined child abuse allegations, others focused on other significant autobiographical experiences, often with adult participants. Below, in addition to describing each individual study, we also report the combined effect sizes for each CBCA criterion separately for the CSA studies and the studies focusing on other autobiographical events. Additionally, we briefly describe a field study that does not focus on validity but on the psychometric properties of CBCA criteria and the role that the criteria play in court experts’ final case assessments.

In short, there are several reasons for conducting this systematic review: (a) Prior reviews have combined field studies and laboratory experiments but included fewer studies (DePaulo et al., Citation2003) or focused only on summary indices (Oberlader et al., Citation2016, Citation2021). (b) Other vote-counting reviews (Vrij, Citation2005; Citation2008) and meta-analyses (Amado et al., Citation2015, Citation2016) combined studies from laboratory and field settings. But field and laboratory settings differ in important aspects that may arguably impact the validity of CBCA criteria. Therefore, conclusions and recommendations derived from good-quality field studies only are more relevant for field settings than those derived from a combination of field and laboratory studies. (c) Prior reviews included several field studies that used highly questionable ground truth criteria, or which had other methodological flaws. (d) Conversely, some methodologically sound studies were overlooked in prior reviews. (e) Finally, it is crucial to investigate whether CBCA can be used to assess not only child sexual abuse allegations but also accounts describing other kinds of significant autobiographical events.

Method

This systematic review is part of a larger project on the validity of CBCA and reality monitoring (RM; see Masip et al., Citation2005; Sporer, Citation2004) criteria in both field studies and laboratory experiments. Therefore, several literature searches were first performed (up to January 2014) on both these approaches. We searched Dissertation Abstracts, Google Scholar, PsycInfo, the Web of Science, Scopus, and WorldCat, entering authors’ names, ‘Criteria-based Content Analysis’, or ‘CBCA’ as keywords. We also searched through German databases (OPAC, PSYNDEX, ZPID-Datenbank Diplomarbeiten), with the keywords ‘Glaubwürdigkeit’ (in combination with ‘-merkmale’, ‘-kriterien’, ‘-diagnostik’), ‘Realkennzeichen’, ‘Merkmals-’, or ‘Kriterien-orientierte Inhaltsanalyse’. In 2018 and then again in 2023, we conducted additional searches in the Web of Science and Scopus. Specifically, we searched for reports published up to December 2022 that contained ‘Criteria-based Content Analysis’, ‘Criterion-based content analysis’, or ‘CBCA’ in either the title, the abstract or the keywords. In addition, the reference lists of other reviews and meta-analyses were also examined. Finally, we wrote to numerous authors who had supervised unpublished dissertations, Master's theses and Diploma theses, as well as to the libraries of several universities to obtain those unpublished documents.

Over 800 sources were located this way. They were subsequently examined considering the following criteria: Only primary empirical studies were to be included; these studies had to examine with at least one CBCA criterion transcripts of spoken or written truthful or deceptive accounts of past autobiographical experiences; and the language of the report had to be English or German. In addition, any report meeting one or more of the following exclusion criteria was excluded: Senders had been coached on CBCA; the original statements had been manipulated or edited; or computer programs had been used for automatic coding of verbal content criteria. After applying these inclusion and exclusion criteria, the remaining records were scanned to select field studies and quasi-experiments conducted in the field. The number of retained studies was 36.

Each of these studies was then carefully examined by two coders considering the ground truth criteria listed in . Specifically, an Excel spreadsheet was created with one study in each row and each of the ground truth criteria listed in in a separate column. The two coders had to indicate whether each criterion had been assessed (1) or not (0) in each study by the authors. Subsequently, the coders had to specify, in an additional column, whether the study was (a) methodologically sound (coded as 2), (b) acceptable with reservations (1), or (c) to be excluded (0). However, note that these final global decisions were based not only on the rough number of ground truth criteria present, but also on criteria combinations and/or the overall quality of the decision-making process, which was extremely sophisticated in some studies (for instance, see below, in the Results section, the description of how ground truth was determined by Akehurst et al., Citation2011; Lamb et al., Citation1997; and Roma et al., Citation2011). Similarly, serious methodological problems other than uncertainty about ground truth also led to classifying studies as not acceptable for inclusion.Footnote¹ Inter-rater agreement for the allocation of studies to one of the three categories was good (weighted kappa = .86). The lack of agreement usually involved assigning a 1, instead of a 0 or a 2, by one of the raters, but never 2 vs. 0. None of the studies that were finally excluded was rated 2 by any coder, and only one of the studies eventually included (Rüth-Bemelmanns, Citation1984) was rated 0 by one coder (and 1 by the other coder). Disagreements were resolved by discussion.

In the end, 12 studies were classified as methodologically sound and two (H.-U. Bender, Citation1987; Rüth-Bemelmanns, Citation1984) as acceptable with reservations. The latter two studies used somewhat different criteria definitions and hence are not fully comparable for meta-analytic comparisons. These 14 studies are listed in and described in the results section below. The other 22 studies were excluded due to either ground truth concerns or methodological limitations. These 22 studies are described in Supplemental Materials 1 (https://osf.io/5tsbz) along with a careful explanation for the reasons for their exclusion. The data that support the findings of this study are openly available at https://osf.io/5tsbz (Supplemental Materials 3).

Table 3. Overview of studies included in this systematic review.

Download CSV Display Table

Results: description of studies

As shown in , the 14 retained studies can be classified into two broad categories. The first, with seven studies, includes archival records of criminal cases where verbal content criteria were analyzed. Six of these seven studies were on CSA. These resemble the types of cases for which CBCA was originally designed, and they are described in the next section. The second category contains seven quasi-experiments conducted in the field that also tested the validity of CBCA criteria. Most participants in these studies were adults, and they reported on autobiographical experiences other than sexual abuse or criminal events. These quasi-experiments are described below after the archival studies. We also briefly describe a field research project that, though not focused on the validity of CBCA criteria, may be of interest to CBCA scholars and practitioners.

As explained above, a goal of this research was to examine whether CBCA can be used to assess not only CSA allegations but also accounts describing other kinds of significant autobiographical events in forensic and other contexts. To this end, we first calculated the effect sizes (Hedges’s unbiased g) for each CBCA criterion for the 12 methodologically-sound studies. We then computed the unweighted mean effect sizes (Hedges, Citation2019) separately for (a) the six methodologically sound CSA studies, and (b) the six methodologically sound quasi-experiments about other kinds of events (see ). Note that the two studies acceptable with reservations were purposefully excluded from these calculations due to methodological concerns. The outcomes of these analyses are presented at the end of the Results section.

Criminal trial or child sexual abuse studies

H.-U. Bender (Citation1987): archival analysis of perjury cases in Germany

In line with psychological writings (Arntzen, Citation1983/Citation1993; Köhnken, Citation1982; Undeutsch, Citation1967; Wegener, Citation1981) and legal texts by numerous authors (e.g. R. Bender & Nack, Citation1981; Leonhardt, Citation1931; Peters, Citation1972/Citation1974/Citation1976), Hans-Udo Bender (Citation1987) defined several credibility (truth) and fantasy (lie) criteria prior to Steller and Köhnken's (Citation1989) integrative catalogue (see Sporer & Masip, Citation2023). Noteworthy, H.-U. Bender connected his theoretical analysis to the Bayes theorem, deriving likelihood ratios based on a-priori and a-posteriori probabilities. He also–and this is truly innovative–considered dependencies and covariations among specific subsets of criteria. We will return to this issue in the Discussion.

More relevant for our review, Hans-Udo Bender (Citation1987) used these criteria to analyze archival records of court cases of intentional perjury. Perjury by witnesses or criminal justice personnel is considered among the top risk factors in archival studies of miscarriages of justice (Huff et al., Citation1996). These types of cases were selected by H.-U. Bender to avoid issues of circular validation often present in field studies about the credibility of sexual abuse allegations.

The statements to be analyzed contained literal transcripts of utterances by the accused (but often without the preceding questions asked by interviewers). Protocols came both from previous trials and later interrogations and trials. In the previous trials, a person (witness) had made a statement in favor of the defendant. However, the statement was later suspected to be an intentional lie told to help the defendant (who was a relative, friend, or work colleague of the witness) not to be convicted. In the later interrogations and trials, the same person (i.e. the former witness) made a statement as a defendant accused of perjury. In more than half of the cases, the defendant in the previous trial had induced the witness to make a false statement.

Over 50% of the cases were validated by objective evidence (such as a log book of car travel times) or later confessions of perjury. Contradictory statements made by other witnesses or other aspects of the case also led to a conviction in a few cases. Statements by other witnesses were classified as truthful when they contradicted the statements that had been classified as perjury. Because truthful statements came frequently from several witnesses, credible statements (n = 53) outnumbered perjuries (n = 42).

Criteria in transcripts of statements were coded by marking relevant passages as ‘clearly present’. To demonstrate at least some effort to establish inter-coder reliability, 12 statements were each coded by three experienced practitioners and the author. Although no formal statistical values for inter-coder reliability were calculated, percentages of agreement were highly satisfactory for these cases (see the appendix in H.-U. Bender, Citation1987, pp. 190–213).

However, the main problem is that the cases of the main study were apparently coded only by the author, who might not have been blind to the ultimate case outcomes (although presumably the outcome was in other parts of the case files than the statements). Therefore, we report the following results with this caveat in mind, only to demonstrate an entirely different application area for statement validity analysis.

The group of criteria coded by Hans-Udo Bender (Citation1987) contained several indicators that differed from those later included in CBCA (e.g. ‘homogeneity’ as a truth criterion derived from Trankell, Citation1972, or ‘inconstancy/stereotypicality’ as a fantasy/lie criterion). Hence, we present here (Figure A.01 in Supplemental Materials 2, https://osf.io/5tsbz) effect sizes only for those criteria that more closely resembled the typical CBCA criteria (for an analysis of the other criteria, see Sporer & Masip, Citation2023). We also report four lie criteria also examined by other authors (Figure A.02 in Supplemental Materials 2, https://osf.io/5tsbz). All truth criteria had positive effect sizes ranging from 0.233 to 1.652. Fantasy signals (lie criteria) had negative effect sizes, ranging between −0.411 and −1.520. Interestingly, additional truth and lie criteria were coded for repeated statements that could guide future research (see Sporer & Masip, Citation2023).

H.-U. Bender also reported prevalence rates of truth and lie criteria in truthful and deceptive accounts as indicators of their practical relevance (Praxisrelevanz). This emphasizes the point that a specific criterion may well be diagnostic of truth status in a given case, but it may nonetheless not be very useful if it is only rarely encountered in this type of case (we will return to this topic in the Discussion section). Furthermore, Hans-Udo Bender suggested an index for the evidentiary value of each criterion (Indizstärke) in the form of a diagnosticity ratio (=percentage of criterion present in true accounts divided by percentage of criterion present in lies).

Boychuk (Citation1991): unpublished doctoral dissertation on child sexual abuse cases in Phoenix, Arizona

Boychuk, who had been involved in a study by Raskin and Esplin (Citation1991a) that we excluded (see Supplemental Materials 1 at https://osf.io/5tsbz), also conducted a related separate study for her thesis testing CBCA with a larger sample of CSA allegations. Although we originally found similar problematic issues in Boychuk's thesis as in Raskin and Esplin's study, we believe that Boychuk tried to overcome some of these issues.

Nonetheless, awareness of case outcomes from the media could also have played a role in Boychuk’s thesis. To control for this problem, two highly experienced raters who were blind to conditions coded all cases. Only afterwards did they discuss the cases with an expert. The final sample of N = 75 contained a comparison of three groups of 25 cases each. Group A (‘highly confirmed’ group) included only cases with a confession, ‘grossly abnormal medical evidence’, and a criminal sanction of the accused. Group B included cases that contained a confession and criminal sanctions. Finally, Group C included ‘highly doubtful’ cases with no confession, a ‘truthful’ outcome of a polygraph test of the accused, no medical evidence, and an expert opinion and judicial dismission of the case (both concluding that abuse was unlikely). Unfortunately, the author only reported comparisons of Groups A + B vs. C, that is, she considered the middle group, which is more ambivalent with respect to ground truth, as part of the ‘confirmed’ cases (cf. Lamb et al., Citation1997, who used more stringent validation procedures).

Figure A.03 (Supplemental Materials 2, https://osf.io/5tsbz) displays the effect sizes for this study. Note, however, that we detected a typing error in the author's calculation of chi² for CBCA19, which was much higher than all other chi² values of all criteria reported, and consequently would yield an incorrect d = 1.94. When we re-calculated chi² and the respective effect size on the basis of the proportions reported, the corrected g = 0.413 was much smaller. The overall unweighted mean for this study was g = 0.861, 95% CI [0.362, 1.356].

Because of Boychuk's (Citation1991) typing error for CBCA19, Amado et al.’s (Citation2015) meta-analysis with children may have overestimated the mean effect size for this criterion, given the small number of field studies included (k = 5, reported g = 1.25, 95% CI [1.10, 1.40]). Similarly, Amado et al.’s (Citation2015) overall estimate for the CBCA summary score for eight field studies, g = 2.40, 95% CI [0.82, 4.60], appears all too optimistic.Footnote²

Lamb et al. (Citation1997): field study with child sexual abuse allegations in Israel

Lamb et al. (Citation1997) conducted a field study in Israel examining the validity of CBCA in CSA cases. Out of a much larger pool, the authors selected 98 cases (28 males and 70 females; M age = 8.72 years, SD = 2.35, range: 4–13) for which (a) transcribed interviews existed, and (b) independent information from law enforcement authorities was available that made it possible to determine the degree of plausibility of the allegation. All interviews had been conducted by social workers who were specifically trained to interview children, and were conducted in Hebrew.

The authors developed Independent Case Fact Scales (ICFS) that raters would later employ to establish the plausibility of the allegations. More than 20 practitioners (prosecutors, defense attorneys, physicians, psychologists …) were consulted in developing the scales. The scales covered five types of evidence: Medical, physical or material, witness statements, suspect statements, and miscellaneous information. At least two raters, who were blind to the quality of the allegations and to the CBCA scores, rated each account on these five dimensions after examining independent case information. On each of these dimensions, the raters had to determine whether the available relevant information made the allegation seem very likely, quite likely, questionable, quite unlikely, or very unlikely. They could also indicate that no judgment was possible, that the information was not relevant or that the relevant information was not obtained. Certain response options were purposefully excluded for certain dimensions. For example, because the absence of medical findings does not prove that no abuse occurred, the medical findings dimension was never rated as quite unlikely or very unlikely. The coding rules were explicit and extensive, and consistency was therefore very high (<5% of disagreements).

Not all five kinds of evidence were available for all cases. Specifically, plausibility judgments were based on one kind of evidence in 44 cases, two kinds of evidence in 41 cases, three kinds of evidence in 11 cases, and four or more kinds of evidence in the remaining two cases. After evaluating all possible dimensions for a case, the raters made an overall plausibility judgment, allocating the allegation to one of these categories: Very likely (n = 53 cases), quite likely (n = 23), questionable (n = 9), quite unlikely (n = 10) or very unlikely (n = 3). (Fourteen additional cases had been initially considered, but they were allocated to a separate no judgment possible category and were hence excluded from the study). The raters disagreed for only two cases, and these disagreements were resolved by taking the more conservative option. Unlikely or very unlikely allegations were supported by contrary evidence rather than being merely unsupported by confirmatory evidence.

Independent coders were extensively trained in using CBCA. Each criterion was coded as absent or present. The training involved practice coding, which continued until raters (who coded statements independently) reached at least 90% agreement in determining whether each criterion was present or absent. After the training, at least two of these coders rated each account included in the study using CBCA Criteria 1 through 14.Footnote³ Inter-rater agreement was around 90%, and disagreements were resolved by consensus between the two original raters and two additional experienced coders.

Lamb et al. (Citation1997) reported the prevalence (percentage present) for each CBCA criterion in plausible (very likely and quite likely groups combined; n = 76) and implausible accounts (quite unlikely and very unlikely groups combined; n = 13). Fisher's exact tests revealed that CBCA Criteria 2 through 6 were present significantly more often in truthful than in deceptive accounts. CBCA01 was present in all cases regardless of plausibility, and Criteria 9 through 11 occurred only rarely, which prevented the authors from properly assessing their discriminative value. Based on the percentages reported by Lamb et al., we calculated the effect size g for each criterion, which are displayed in Figure A.04 in Supplemental Materials 2 at https://osf.io/5tsbz. Effect sizes varied widely from −0.41 to 1.32, with an unweighted mean of 0.401, 95% CI [−0.185, 0.987]. The effect size for the CBCA summary score ( = sum of CBCA01 to CBCA14) was larger, g = 0.823, 95% CI [0.228, 1.419].

Lamb et al. (Citation1997) reported several additional analyses with the total CBCA score. These analyses showed that as judgments of plausibility of abuse decreased so did the CBCA summary score.Footnote⁴ Finally, the older the children were, the higher their total CBCA scores, r = .40, p < .001, but plausibility and age were not correlated. Gender was correlated with neither the CBCA summary score nor plausibility. Note that the proportion of confirmed cases (0.85) was higher than in any other study we reviewed ().

In a subsequent study, Hershkowitz (Citation1999) compared a subset of Lamb et al.’s (Citation1997) ‘plausible’ (n = 12) and ‘implausible’ (n = 12) accounts in terms of both the total number of criteria present (vs. absent) in each statement (for CBCA01 to CBCA14) and the ‘strength’ (frequencies) of the criteria (for CBCA04 to CBCA14) in response to different types of interviewer utterances (as well as across all utterance types). Because Hershkowitz used only a subset of the accounts, her results are less comprehensive than those of Lamb et al. (Citation1997). Therefore, we do not describe her results in detail. Nevertheless, Hershkowitz’s study makes a meaningful contribution in exploring the impact of interviewer’s utterance type on the presence and strength CBCA criteria in truthful and deceptive accounts.

Craig et al. (Citation1999): field study with child sexual abuse allegations in Salt Lake City, Utah

Craig et al. (Citation1999) contacted law-enforcement agencies to conduct a ‘limited search of past cases that had been closed and could be categorized into one of two confirmation categories’ (Craig et al., Citation1999, p. 79), ‘confirmed’ and ‘highly doubtful’ cases of CSA allegations. Information from 48 cases was obtained independently from CBCA evaluations. Cases were considered confirmed (n = 35) if an accused confessed prior to plea bargaining (34 cases), failed a polygraph test (1 case) or medical evidence supported the allegation.Footnote⁵

Of the 13 doubtful cases, seven children recanted and nine of the accused individuals passed a polygraph test. Although seven of the doubtful cases contained medical evidence, four of these cases were classified as highly doubtful based on other aspects (e.g. the child had engaged in consensual sex with a same-age person who was not the suspect). Some interviews were conducted by interviewers who had been trained in SVA, others by interviewers without that training. Unfortunately, the interview techniques contained many direct questions and multiple questions (i.e. the interviewer asked several questions in the same turn) and may therefore be considered suboptimal (Roberts & Lamb, Citation2010). Furthermore, ground truth criteria such as confessions and recantations are not foolproof.

Four CBCA raters, after about eight hours of training, rated all transcripts for the presence of CBCA04 to CBCA13, as well as a combination of CBCA14 and CBCA15 (‘Spontaneous Corrections, Additions, or Lack of Memory’, see in Craig et al., Citation1999). A criterion was considered as present when three of the four raters considered it present. Ties (2-2 splits) were resolved by the first author. Subsequently, the coders rated the three General Characteristics (CBCA01 to CBCA03). Out of 6963 ‘utterances’ made by the alleged victims, 6363 contained no criteria. In other words, there was a low prevalence of occurrence which affects both inter-coder reliability and validity. Inter-coder reliabilities were high for percent agreement but not so high for Cohen’s kappa and weighted kappa.

Effect sizes for individual criteria ranged between −0.080 and 0.642 (see Figure A.05 in Supplemental Materials 2, https://osf.io/5tsbz), with an unweighted mean g = 0.311, 95% CI [−0.318, 0.940]. The summary score of all included CBCA criteria (CBCA01 to CBCA14) yielded ‘significantly’ (p < .05) larger values for confirmed than for doubtful cases (Craig et al., Citation1999). However, the effect size we calculated contained 0 in the 95% confidence interval, g = 0.590, 95% CI [−.047, 1.227], p = .071.Footnote⁶ Accounts of 3–9 year-olds contained significantly fewer CBCA criteria than reports of 10–16 year-olds, g = 1.107, 95% CI [0.508, 1.706], irrespective of truth status.

Akehurst et al. (Citation2011): field study with child sexual abuse allegations in central England

In this study, 31 sexual abuse statements made by children to the police were examined. The statements were selected from an original pool of 176 cases retrieved via computer searches of police records. Only cases for which ground truth could be established with reasonable certainty were retained. To determine whether a statement was truthful, the authors considered the following criteria: (a) medical evidence (e.g. DNA evidence), (b) a videotape of the event, (c) corroboration by another victim or witness (not known to the alleged victim), (d) a guilty verdict by a court of law, and (e) the suspect's confession prior to plea bargaining. Criteria a, b, and c were considered strong indicators of truthfulness and were independent of the quality of the account. For a case to be included in the truthful condition, it had to satisfy at least three of the five criteria, and also at least one of the three strong criteria.

To determine whether an account was fabricated, the authors considered the following criteria: (f) evidence that the offense could not have been committed the way described by the child (e.g. description contrary to the laws of nature), (g) evidence showing that the event could not have happened at the time it was alleged to have occurred (e.g. CCTV evidence showing that the offender was somewhere else at the time of the crime), (h) a comprehensive and plausible retraction, and (i) a persistent not guilty plea by the alleged offender. Criteria f and g were considered strong indicators of fabrication and were independent from statement quality. For inclusion in the fabricated condition, a case had to meet at least three of the four criteria considered. This necessarily involves at least one of the two strong criteria. Additionally, none of the truthful cases could include any of the criteria used to determine fabrication, and none of the fabricated cases could include any of the criteria used to determine truthfulness (see Akehurst et al., Citation2011, p. 238, for more detail).

At the end, 21 truthful accounts (delivered by three males and 18 females, M age = 10.5 years, SD = 3.07, range: 6–16) and 10 fabricated accounts (from two males and eight females, M age = 12.60 years, SD = 1.95, range: 8–15) were selected. Truthful and deceptive statements did not differ significantly in length, g = 0.611, CI 95% [−0.138, 1.360]. All accounts had been collected by UK officers who were specifically trained in how to use appropriate child interviewing techniques.

Two professional psychologists were trained in rating the CBCA criteria. The training involved reading descriptions of the criteria followed by two in-person sessions with discussions, practice exercises, and feedback. Subsequently, the two coders independently rated CBCA01 through CBCA18 for all 31 transcripts, using 1 (absent) to 5 (strongly present) scales. The raters were aware of the age of each child, but they were blind with respect to both the truthfulness of each statement and the base rate of truthful vs. deceptive accounts. After assessing all 18 criteria, each rater made a dichotomous true/false judgment. Inter-rater reliabilities were Pearson's r ≥ .75 for eight criteria and < .50 for CBCA11 and CBCA17. For the sum score, r = .91. Regarding the final true/lie judgment, the raters disagreed for 7 of the 31 statements (22%). In any case, Akehurst et al. (Citation2011) reported the results separately for each rater.

Although Akehurst et al. (Citation2011) reported descriptive statistics for only a few individual criteria, the first author kindly emailed us the data for the remaining items.Footnote⁷ Effect sizes for the individual criteria, the unweighted mean, g = 0.343, 95% CI [−0.396, 1.081] and the summary score, g = 0.846, 95% CI [0.083, 1.609] are displayed in Figure A.06 in Supplemental Materials 2 (https://osf.io/5tsbz). The correlation between age and the CBCA summary score was close to zero, r = .03, ns.

The authors conducted two discriminant analyses (one for each rater) with the CBCA summary score as the only predictor variable and the truthfulness of the accounts as classification variable. For each of the two raters, the analysis yielded an accuracy rate of 68% (67% for truthful accounts and 70% for fabricated accounts). However, no cross-validation was used in these analyses; thus, these values overestimate discrimination (see Kleinberg et al., Citation2019; Sporer et al., Citation2021).

Regarding the raters’ overall true/lie judgments, Rater 1 was accurate 84% of the time (95% for truthful accounts and 60% for fabricated accounts; only for truthful accounts was the accuracy rate significantly above chance probability). Rater 2 was accurate 81% of the time (81% for truths and 80% for lies, both significantly above chance). Our recalculations of response bias B” (which appear to be incorrect in Footnotes 3 and 4 of the article) showed Rater 1 to have had a strong truth bias, while Rater 2 had adopted a neutral decision criterion.

Roma et al. (Citation2011): field study of child sexual abuse statements given in court in Italy

In a field study of trials of alleged CSA, the free narratives of 109 children were obtained from a large pool of 487 cases. Cases were from different courts throughout Italy. All accounts had been given by the alleged child victims in court. A series of successive steps were undertaken to establish ground truth.

First, from these 487 cases the authors selected only those in which the child was between 3 and 15 years old, had sufficient language proficiency, an IQ (assessed with WISC-R) over 70, no neurological or neuropsychiatric disorder, and had not been interviewed about the event more than twice before her/his court testimony. In addition, the abuse had to involve physical contact with a known perpetrator, and the statement had to have been collected with no suggestive questions and had to be long enough for meaningful analysis (≥ 1,000 words excluding articles, prepositions, and conjunctions). After screening all cases checking for these criteria, 239 cases remained (158 with a guilty verdict and 81 with an acquittal).

Second, the authors considered the following kinds of independent evidence: Telephone tapping, independent witnesses, pornographic material showing the alleged victim, biological (DNA) evidence of abuse, and unretracted confession by the purported perpetrator. Ninety-two of the guilty-verdict cases were supported by one or more of these kinds of evidence and were retained, while none of the 81 acquittal cases had any corroborating piece of information. Third, additional cases were eliminated because some interview questions had been suggestive; this left 81 guilty-verdict cases and 66 acquittal cases. Fourth, these were then assessed by three experienced forensic experts (two psychologists and one neuropsychologist) who classified each case as either likely or unlikely based on witness, medical, biological, and objective evidence. These three experts had no access to the children's testimonies or to the legal outcome of the cases. They worked independently and reached a consensus during a subsequent meeting. Only cases in which the experts’ conclusion was the same as the court decision (guilty verdict vs. acquittal) were retained (74 ‘confirmed’ and 59 ‘unconfirmed’ cases). Finally, to eliminate possible distorting variables, the authors also excluded testimonies collected directly by the judge without the help of a child expert.

At the end of this laborious selection process, 109 cases involving children between four and 14 years were retained, 60 of which had been classified as confirmed (48 females, 12 males; M age = 9.21, SD = 3.57) and 49 as unconfirmed (38 females, 11 males; M age = 7.81, SD = 3.30). The difference in mean age between confirmed and unconfirmed cases was significant (p = .037), g = 0.403, 95% CI [0.024, 0.781], thus posing a potential threat to the validity of the CBCA criteria analyzed.

All criteria but Unstructured Production,Footnote⁸ g = −0.050, 95% CI [−0.513, 0.414] showed positive effects up to g = 1.878, 95% CI [0.319, 3.437], and all differences but three (CBCA02, CBCA08, CBCA11) were significant (see Figure A.07 in Supplemental Materials 2, https://osf.io/5tsbz). The nonsignificant exceptions may have been due to floor effects or, for CBCA11, the adoption of the somewhat different definition proposed by Raskin and Esplin (Citation1991b) for this criterion (External Associations: References to Other Sexually Toned Events).

The unweighted mean effect size was quite high, g = 0.907, 95% CI [0.214, 1.601], and the CBCA summary score (for criteria CBCA01 to CBCA14) showed a highly significant effect, g = 2.664, 95% CI [2.149, 3.180].

Welle et al. (Citation2016): field study with child sexual abuse allegations in Geneva, Switzerland

Welle et al. (Citation2016) had access to an original pool of 225 children’s statements collected by the Geneva police. After excluding cases other than sexual abuse, 93 accounts were retained. However, because of technical and judicial obstacles, only 60 statements were eventually available for analysis. The sample comprised 46 females and 14 males, M age = 13 years, SD = 3.70, range: 3–17 years.

The criteria considered by Welle et al. (Citation2016) to allocate a case to the confirmed (n = 40) or to the unconfirmed group (n = 20) were medical evidence, the suspect’s confession (only if it corroborated the contents of the child’s account), witness statements, scientific evidence, physical evidence (either corroborating or falsifying the suspect’s guilt), recantations by the child, and evidence of coaching. Children in the confirmed group (M age = 13.6 years, SD = 3.12) were nonsignificantly older than those in the unconfirmed group (M age = 11.9 years, SD = 4.50), g = 0.462, 95% CI [−0.074, 0.998]. This is problematic because differences can be a result of either veracity or age. Although no main effect of gender was observed, possible interactions of Age and Gender may have played a role, but the sample sizes are too small to test this.

Three trained raters (two psychologists and a psychiatrist) who had substantial experience in using CBCA in CSA cases coded all statements. Each of the 19 CBCA criteria was coded as absent (0), present (1) or strongly present (2). An overall summary score was also calculated by summing up all individual criteria scores (CBCA01 to CBCA19). Thus, the summary score for each of the three raters could range from 0 to 38. Finally, each rater made a binary credible vs. not credible judgment for each statement. Unfortunately, reliabilities were limited for some individual criteria, but they were acceptable for others (see Hauch et al., Citation2017). Specifically, the authors computed ICCs (3,1), which were virtually zero for CBCA08 and CBCA19, .20 to .39 for six criteria, .40 to .49 for four criteria, .50 to .59 for four criteria, and .60 to .75 for the remaining three criteria. Welle et al. also calculated Krippendorff’s alpha (Kalpha), which was negative for CBCA08 and CBCA19, .20 to .39 for four criteria, .40 to .49 for four criteria, .50 to .59 for six criteria, and .60 to .75 for the remaining three criteria. However, reliability for total CBCA scores was high, with ICC(3,1) = .76 and Kalpha = .74.

To examine the discriminative value of each individual criterion, Welle et al. (Citation2016) calculated, separately for each rater, the Odds Ratio for each criterion. Fortunately, the authors sent us the original data sets, which allowed us to calculate effect sizes g both for individual raters and across them. Here we focus on the mean of raters.

All but three effect sizes were positive. While the effect sizes for six criteria (CBCA02, CBCA03, CBCA04, CBCA06, CBCA12 and CBCA19) were medium to large, most other effects were rather small and nonsignificant, and three criteria were non-significant and negative (see Figure A.08 in Supplemental Materials 2, https://osf.io/5tsbz). The unweighted mean CBCA score did not discriminate significantly, g = 0.289, 95% CI [−0.243, 0.821], while the summary score was significantly higher for confirmed than unconfirmed cases, g = 0.665, 95% CI [0.122, 1.208].

In addition, Welle et al. (Citation2016) calculated the area under the ROC curve (AUC) for each individual rater based on the summary score. AUCs were .68 for Rater 1, .63 for Rater 2, and .67 for Rater 3. AUCs did not differ significantly across raters.

The expert raters’ binary classifications of the accounts as either credible or not credible were significantly associated with truth status, with Chi² = 11.93, p < .001, for Rater 1; Chi² = 4.66, p = .031, for Rater 2; and Chi² = 7.55, p = .006, for Rater 3. However, while confirmed accounts were judged correctly most of the time (Rater 1: 90%; Rater 2: 85%; Rater 3: 80%), unconfirmed accounts were judged as credible more often than as not credible (accuracy rates: Rater 1: 50%; 2: 40%; 3: 55%). These results suggest a truth bias. While the base rate of confirmed accounts was 67%, the percentage of credible judgments ranged between 68% (Rater 3) and 77% (Raters 1 and 2). Finally, for each of the three raters, the cases judged credible were more frequently those of older than of younger children.

In summary, in this study, the authors made a strong effort to establish ground truth, train several CBCA raters and analyze a variety of aspects that could guide future researchers. Unfortunately, due to the small sample size and noncontrollable aspects of the design in a naturalistic setting, the power for many of the analyses was probably limited to draw firm conclusions. Perhaps most importantly, the study highlights the importance of base rates not only for the calculation of effect sizes but also for applying CBCA to different populations with different base rates.

Niveau (Citation2020), who was a coauthor of Welle et al.’s (Citation2016) paper, published a related article that included some of the same cases (plus additional ones). Unfortunately, this overlap, as well as a problem in the way ground truth was established, prevented us from including Niveau's study in this systematic review. The interested reader can find a brief summary of Niveau's (Citation2020) study in the Supplemental Materials 1 (https://osf.io/5tsbz) of excluded studies.

Quasi-experiments in the field

In the following, we describe a series of studies that attempted to test some of the underlying assumptions and the applicability of CBCA criteria in quasi-experiments in field settings in different areas of application.

Rüth-Bemelmanns (Citation1984): quasi-experimental study about owning a cat conducted in Cologne, Germany

One of the first studies to empirically investigate credibility and lie criteria in a quasi-experiment was conducted by Rüth-Bemelmanns (Citation1984) under Undeutsch's supervision. Participants were 50 children (22 females and 28 males) aged between 13 and 16 years, half of them in each truth status condition. Owners of a cat (or children knowing enough about cats from relatives, friends, or acquaintances) told a story about the experience of owning a cat to a blind interviewer (the author) and were compared to a ‘lie’ group who were to pretend they owned a cat. Transcripts of audiotaped reports were coded as frequencies by the author, using credibility criteria described by Undeutsch (Citation1967) as well as some ‘lie’ criteria.

A problem with the study is that the children were asked to ‘tell something about your cat’, followed by a series of questions after the free report about different types of encounters with the cat. Thus, the report was not about a singular event, as in most other studies. As there was only one rater, no inter-coder reliability could be established. At the end of coding, the author classified 100% of true accounts and 96% of lies correctly–which we consider an unrealistic ceiling effect which should not be included in a meta-analysis.

No formal statistical tests were conducted, but Rüth-Bemelmanns's (Citation1984) Appendix contained the raw data from which we calculated comparisons between the two groups. Effect sizes were all significantly positive and rather large (between g = 0.64 and 2.47). However, the definitions used by Undeutsch do not correspond sufficiently with Steller and Köhnken's (Citation1989) catalogue. We present the effect sizes in Figure A.09 in Supplemental Materials 2 (https://osf.io/5tsbz).

Wolf and Steller (Citation1997; Scheinberger, Citation1993): field study about giving birth conducted in Berlin, Germany

This study was published as a book chapter by Wolf and Steller (Citation1997), but a fuller description of the methodology was provided in a diploma thesis by Scheinberger (Citation1993). The study was included in Oberlader et al.’s (Citation2021) meta-analysis – but not as a field study – where it had the largest (g = 3.66) among all effect sizes.Footnote⁹ In our view, this is a field study (but not on sex abuse).

More specifically, the study is a quasi-experiment of real (n = 15) and fictitious (n = 15) giving-birth experiences. The authors selected this paradigm as a psychological analogue to sexual abuse (Steller et al., Citation1992), assuming that it entails ‘personal involvement’, a certain degree of ‘loss of control’ and a ‘negative emotional experience’ (pain). Participants gave first a tape-recorded spontaneous report followed by a semi-structured interview. Women in the non-birth control group were provided information brochures typically available in gynecological practices, and they were asked to prepare for the interview for one week to be able to provide a convincing account. The two groups did not differ regarding their verbal abilities based on a standardized verbal intelligence test.

Our calculations below are based on the average ratings of two raters (advanced diploma psychology students) who had been trained for this study and were blind regarding truth status. They coded the transcripts using criteria CBCA01 to CBCA18. Criteria referring to interaction partners (CBCA13, CBCA18) were reformulated to be meaningful for this context (referring to ‘doctor’ or ‘midwife’). CBCA01 to CBCA03 were coded on a 0–2 scale. All other criteria were rated as weighted frequencies by assigning a ‘1’ to a passage that contained a criterion, and ‘2’ if this criterion was ‘strongly present’. These codes were then added for each criterion to a weighted sum = f1*(‘1’) + f2*(‘2’), where f1 and f2 are the respective frequencies of these passages. For example, if a specific criterion was ‘present’ in a specific statement in three passages, and it was also ‘strongly present’ in two additional passages, then the weighted sum for this criterion in that statement was (3*1) + (2*2) = 7. After rating the individual criteria, the raters also had to make a final truth/lie decision.

The authors reported that 70 of the 153 (46%) pairwise correlations of the 18 criteria were statistically significant, demonstrating that the criteria set appears to measure a common underlying construct. (However, criteria CBCA08 and CBCA15 showed a series of negative correlations.)

Several sets of analyses were conducted to test CBCA validity, namely: (a) Mann–Whitney U-tests, (b) a simultaneous discriminant analysis, (c) means and standard deviations of the raters, (d) a relative weighted presence score, and (e) the raters’ binary classifications. We briefly describe each of these analyses below.

Mann–Whitney U-tests

Although results were reported separately for the frequencies of ‘1’ and ‘2’ codes (see above) in in Wolf and Steller (Citation1997; see also Scheinberger, Citation1993), inferential analyses were conducted on the weighted sums via Mann–Whitney U-tests, due to the positive skewness (see in Wolf & Steller, Citation1997). The effect sizes for the weighted frequencies are displayed in Figure A.10 in Supplemental Materials 2 (https://osf.io/5tsbz). We calculated the effect size r from the reported Z values as recommended by Field (Citation2009). We then transformed r into Cohen’s d and Hedges’s unbiased g (Borenstein, Citation2009). As an indicator of validity, the mean ranks of all criteria were higher in the real compared to the fictitious birth experiences, and most of these differences were highly significant. Our calculations resulted in gs ranging from 0.05–3.34, with an unweighted mean of g = 1.22, 95% CI [0.46, 1.98].

Table 4. Means, SDs, intercorrelations of CBCA criteria, mean rs and corrected item-total correlations (CITCs) in the field quasi-experiment by Sporer (Citation1998a).

Download CSV Display Table

In the Supplemental Materials 2 (https://osf.io/5tsbz), we also include a figure for effect sizes from the weighted presence scores (Figure A.11), which parallel the above results, while effect sizes based on t tests yielded much smaller, mostly nonsignificant effect sizes (see Figure A.12).

Discriminant analysis

Wolf and Steller (Citation1997) also reported a simultaneous discriminant analysis using all 18 variables, which yielded 100% correct classifications of lies and truths. This result is meaningless given the large number of predictors (18) relative to the small total sample size (N = 30) and the fact that the analysis was conducted without cross-validation (see Kleinberg et al., Citation2019; Sporer et al., Citation2021). Hence, the effect size g = 3.66, 95% CI [2.49, 4.83] reported by Oberlader et al. (Citation2016, Citation2021, Figure 2), which is the highest (outlying) effect size in their meta-analysis, is inappropriate. Our unweighted mean of all criteria reported above (g = 1.22) appears more realistic.

Means and standard deviations of the three raters

These descriptive statistics were also reported, based on the weighted coding described above. From these data, we calculated effect sizes for each individual criterion, as well as an unweighted mean g across all 18 criteria, g = 0.853, 95% CI [0.124, 1.582]. This value, while still large and significant, is smaller than the effect size for the nonparametric analyses reported above (g = 1.22), and much smaller than the g = 3.66 value used by Oberlader et al. (Citation2016, Citation2021).

Relative weighted presence score

Scheinberger (Citation1993) also reported the length of the audiotapes in seconds and number of words. True accounts (M = 39.8 min) were almost twice as long as fictitious accounts (M = 21.5 min), g = 1.300, 95% CI [0.530, 2.070]. They were also significantly longer in number of words (true accounts: M = 2388, Min = 1221, Max = 3542; invented accounts: M = 1292, Min = 871, Max = 1292), g = 1.305, 95% CI [0.535, 2.076]. Using these word counts, Scheinberger calculated a relative weighted presence score (100*weighted presence divided by the number of words) for each criterion as rated by each of the three raters. This index provides an estimate of the density of criteria presence in an account (see Schubert, Citation1999; Sporer, Citation2004).

Rater judgments

Rater A classified 100.0% of true accounts and 60.0% of lies correctly (overall: 80.0%). Rater C obtained 93.3% accuracy for both true and invented accounts.

In summary, this study demonstrated that CBCA criteria can be successfully applied to discriminate true from fictitious accounts in a field setting different from sexual abuse. It also offers statistically more sensitive ways to code and analyze these types of studies.

Greuel et al. (Citation1999): field study about giving birth conducted in Bremen, Germany

A follow-up study by Greuel et al. (Citation1999) also investigated 20 true and 20 false accounts of adult females giving birth. The summary, made available to us by the first author, did not contain many details about the methodology. However, the theoretical embedding of the event in research on autobiographical memory is noteworthy (see also Greuel, Citation2001; Sporer, Citation2004; Volbert & Steller, Citation2014). Importantly, the study was conducted as a double-blind quasi-experiment, thus overcoming a potential criticism of the Scheinberger (Citation1993) and Wolf and Steller (Citation1997) study.

Greuel et al. (Citation1999) found similarly strong results using only the first 14 CBCA criteria (except CBCA11, which they omitted for unknown reasons) as Scheinberger (Citation1993) and Wolf and Steller (Citation1997), supporting the claim that this paradigm does reveal differences between real and fictitious significant autobiographical experiences. The effect sizes for Greuel et al. (Citation1999) are displayed in Figure A.13 in Supplemental Materials 2 (https://osf.io/5tsbz). Effect sizes for CBCA01 to CBCA08 were extremely high, with CBCA05 (Descriptions of Interactions) having to be considered as an outlier, g = 3.812, 95% CI [2.779, 4.845]. The unweighted mean was also extremely high, g = 1.931, 95% CI [1.191, 2.672].

The authors emphasize that the criteria ratings must not, by themselves, be considered as evidence of credibility in an individual case. Instead, the credibility decision has to be based on an individualized case-specific SVA assessment.

Sporer (Citation1998a): overnight military exercise in the Scottish highlands

A field quasi-experiment of an overnight military exercise in the highlands of Scotland tested the utility of content criteria in a novel setting (Sporer, Citation1998a; see Sporer, Citation2004). The accounts obtained were rated with CBCA criteria and judged by two ‘naive raters’.

Participants were 72 trainees (36 female; Mdn age = 20) of the Officer Training Corps (OTC) in Scotland. Half of them had been on a week-end training unit (truth condition), while the other half had not (lie condition). The training included an overnight exercise in the Fall or Winter, involving several activities. Participants were recruited as part of a study on communication.

For this purpose, they were asked to relate the overnight exercise to another person. As this person would not know anything about the event, and did not know whether the trainee had already participated, it would be necessary to relate the experience as clearly as possible, including the following elements: if and how they camouflaged themselves, defended or attacked an area, went patrolling, and how they slept. The participants were instructed to render their account as clear and convincing as possible, and being convincing was described as an important social skill. As an additional incentive, all participants were promised (and later paid) £ 5 (approximately $ 8 at that time) if their account was among the five most convincing reports. Half of the participants received two to three minutes to prepare their accounts, the other half read the instructions and returned the next day for the interview. Interviewees first gave a free report, followed by standardized questions regarding the four activities mentioned above, which were also listed on a poster.

Two ‘naive’ raters (A and B) independently rated the credibility of all accounts on a scale from 1 to 10 which was later transformed to a binary lie-truth judgment (0 for ratings 1–5; 1 for 6–10). Rater C read a series of CBCA articles available at the time (1995) and received a short training on the use of CBCA criteria, which included rating some practice accounts. Then she rated the presence of each of the 19 criteria in each account on a scale from 0 to 2. Criterion 19 was rephrased as ‘Characteristics of the Event’.

Length of accounts

The accounts varied widely in length, with M = 974, SD = 429, Mdn = 903. True accounts (M = 1055, SD = 454) were not significantly longer than invented accounts (M = 894, SD = 393), g = 0.375, 95% CI [−0.0861, 0.836].

CBCA criteria

Descriptive statistics (Ms, SDs) of all criteria, as well as their inter-correlations and corrected item-total correlations (CITCs), are presented in . Of the 136 correlations, 32 (|r| > 0.232) were significant. There were also 29 small, negative correlations (most of these with CBCA13 and CBCA14). The average inter-correlation was r = .13. Of the 17 CITCs, 13 were larger than .20. For CBCA13 and CBCA14 the CITCs were −.06 and −.02, suggesting that these two criteria should not be included in a CBCA summary score (see Anastasi, Citation1990; Sporer et al., Citation2021). Cronbach's alpha was .74 considering 17 criteria (two criteria had 0 ratings on all accounts; see below), and .76 considering only 15 criteria (after also excluding CBCA13 and CBCA14).

Prevalence rates of CBCA criteria varied widely, with a ceiling effect for CBCA01 and floor effects for CBCA11, CBCA13, CBCA14, CBCA16 and CBCA17. Because CBCA10 and CBCA18 received ‘0’ ratings on all accounts, we conservatively assigned an effect size g = 0 to them, as these ratings represent a failure to support the underlying hypotheses. We also calculated two CBCA summary scores to compare with other studies and meta-analyses: One with 17 criteria (omitting the two criteria with zero prevalence), and one with 15 criteria (omitting also CBCA13 and CBCA14 because of their negative CITCs).

Validity of CBCA criteria

The effect sizes of the differences between lies and truths are presented in Figure A.14 in Supplemental Materials 2 (https://osf.io/5tsbz). Differences were significant only for CBCA02, CBCA04, CBCA15, and CBCA19. CBCA01 and CBCA06 showed nonsignificant negative effect sizes. The unweighted mean for 17 criteria for truths (M = 0.615) was not significantly higher than for lies (M = 0.490), g = 0.278, 95% CI [−0.181, 0.737].

Summary scores with 17 criteria yielded significantly higher values for true accounts (M = 11.67, SD = 3.23) than for lies (M = 9.31, SD = 2.72), g = 0.783, 95% CI [0.308, 1.257]. Differences for sum scores with 15 criteria were slightly smaller (truths: M = 11.53, SD = 2.74; lies: M = 9.25, SD = 3.24), g = 0.751, 95% CI [0.278, 1.224].

Multiple discriminant analysis

A multiple discriminant analysis with 17 criteria resulted in 81.9% correct classifications, 80.6% for lies and 83.3% for true accounts. Cross-validation with the leave-one-out method yielded 72.2%, 75.0%, and 69.4%, respectively.

A multiple discriminant analysis with 15 criteria (omitting CBCA13 and CBCA14) resulted in the same 81.9% correct classifications, 80.6% for lies and 83.3% for true accounts. Cross-validation with the leave-one-out method yielded 73.6%, 77.8%, and 69.4%, respectively.

Rater accuracy as a function of guidance

Accuracy of the CBCA-guided Rater C (M = 59.7, SD = 49.4) was not better than that of the naive raters (mean of Raters A and B: M = 58.3, SD = 40.3), g = 0.030, 95% CI [−0.184, 0.244]. However, there was a stronger truth bias, and hence a stronger veracity effect, for the CBCA rater than for the two naive raters: The guided rater had 91.7% (SD = 28.0) accuracy for true accounts and only 27.8% (SD = 45.4) for lies. Naive raters judged 75.0% (SD = 34.8) of true accounts and 41.7% (SD = 38.7%) of lies correctly.

In summary, although this study demonstrated that CBCA criteria could be applied in a novel setting in a quasi-experiment, the discriminative value of the CBCA criteria was weak. Also, the rater training appeared insufficient.

Schubert (Citation1999; re-analysis by Sporer): reports of a driving exam in Giessen, Germany

Data from an unpublished diploma thesis by Schubert (Citation1999) were re-analyzed by the first author. The re-analysis was necessary because Schubert had reported only results regarding the relative frequency per 1500 words. Consequently, his results would not have been comparable to the other studies analyzed here.

Schubert (Citation1999) conducted a repeated-measures study comparing reports of 25 fictitious and 25 self-experienced driving exams in a small German city (11 females and 14 males; age: 20–25 years).Footnote¹⁰ In a post-experimental questionnaire, only 32% of liars reported that they had freely invented the exam; the rest conceded that they had ‘borrowed’ from experiences of friends, acquaintances, family members and the media.

Lies and truths were told in counter-balanced order to two different interviewers who were blind to condition. Preparation time was 15 min. Transcripts of audio tapes were coded by a single rater who had participated in a seminar on eyewitness testimony and had read relevant CBCA articles available at that time. Only seven CBCA criteria (CBCA05/06 combined, CBCA07 to CBCA09, CBCA12, CBCA14 and CBCA15) were rated.

True accounts (M = 584, SD = 261) were significantly longer than invented accounts (M = 451, SD = 216), g = 0.541, 95% CI [0.225, 0.823]. Differences between lies and truths in the seven CBCA criteria were all in the expected direction and significant, with the exception of CBCA14 (see Figure A.15 in Supplemental Materials 2, https://osf.io/5tsbz). The unweighted mean of these differences yielded g = 0.646, 95% CI [0.183, 1.108]. All CITCs were above 0.28, except for item CBCA14 (0.09), justifying the calculation of a summary score. Summary scores differed significantly in the expected direction (lies: M = 8.88, SD = 4.45; truths: M = 16.3, SD = 6.48), g = 1.211, 95% CI [0.809, 1.613].

Niehaus (Citation2001): quasi-experiment with child victims of traffic accidents in Germany

Niehaus (Citation2001) compared transcripts of audiotaped reports of child victims (between four and 12 years) of a traffic accident (Group 1) with several groups of children matched with respect to gender, age and scores on a verbal ability test. First, with a group of children who had to freely invent an account as if they had experienced such an accident (Group 2). After a two-week delay, the same children (re-labelled as Group 3) had to re-tell another child's account of an accident that had been read to them twice before. Another independent re-telling group who had experienced another accident themselves before (Group 4) had also to re-tell another child's account. Here we only compare data from the first two groups (n = 40 each), resulting in 80 accounts.

In contrast to the author, who treated the data from Group 1 vs. the other three groups as repeated measures due to matching, we calculated effect sizes from the Ms and SDs as if they had come from independent groups to make results comparable to other studies. Fortunately, Appendix E in Niehaus (Citation2001, p. 447 ff.), as well as other parts of that monograph, provided all the data to calculate the necessary comparisons. By using only Groups 1 and 2 we avoided dependencies in the data by having only one ‘experience’ group. While comparisons of Group 1 with Groups 3 and 4 are worthwhile additions to the literature, they simulate situations where witnesses are coached with somebody else's account to tell a convincing story.

The truth status of true accounts was reliably established by hospital records and interviews with the parents. Medical consequences were from very mild to severe long-term damages. For non-experienced accidents, truth status was similarly established by interviews with the parents, who corroborated either that the children were not telling a true experience or that the experience was different from the one they were re-telling (Group 4). If a child made an utterance during the interview that revealed that the child was telling somebody else's story, thus violating the instruction given, the passage was removed from the transcript. This occurred in about 5% of the sample. The delay (retention interval) between the accident and the interview was, on average, 2.9 months. Although there was substantial variation in length between accounts, the average number of words across all four groups was quite long (M = 688). True accounts of accidents (M = 1003, SD = 581 words) were significantly longer than those of invented experiences (M = 435, SD = 269), g = 1.244, 95% CI [0.769, 1.719].

Niehaus tested a series of theoretically derived hypotheses regarding CBCA criteria. She also examined a selection of RM and lie criteria (e.g. clichés), with a total of 28 verbal characteristics rated. Some of the items used were reformulated by the author or split into subcategories. CBCA01 to CBCA03 were coded on 0–4 rating scales, while the other items were assessed via frequency counts. Here, we focus only on traditional CBCA criteria, but we also mention some additional results. We did not use the overall credibility index reported by Niehaus because it is not comparable to the CBCA summary scores reported by other authors in the CBCA literature.Footnote¹¹

Two coders underwent highly intensive training through literature reviews and practice ratings, including several knowledge tests of the material learned. Inter-coder reliabilities, both for individual criteria and for summary scores, were among the highest in the CBCA reliability literature (Hauch et al., Citation2017).

Figure A.16 in Supplemental Materials 2 at https://osf.io/5tsbz displays the effect sizes g and the corresponding 95% confidence intervals for 18 CBCA criteria, as well as the unweighted mean g = 0.675, 95% CI [0.226, 1.124]. Figure A.17 in Supplemental Materials 2 displays four additional criteria hypothesized to indicate deception by the author (in addition to their unweighted mean and summary score). Only two of these lie criteria showed significant negative differences.

A multiple discriminant analysis was only reported for the classification of the self-experience group (n = 40) versus the three deception groups (n = 120). The analysis resulted in 90% correct classifications (75% of true and 95% of false statements). Using the jack-knife method for cross-validation, these values reduced to 81.9% overall correct classifications (62.5% of true and 88.3%% of false statements).

Trained raters classified 67.5% of true and 77.5% of freely invented accounts correctly, with both values being significantly above chance (50%). Two additional naïve (untrained) raters achieved accuracies of 67.5% (significantly above chance) for true accounts, and 50% (chance level) for freely invented accounts.

Kirchler-Bölderl et al. (Citation2013): quasi-experiment with young children victims of traffic accidents in Austria

A small-scale double-blind follow-up study to Niehaus's (Citation2001) was conducted by Kirchler-Bölderl et al. (Citation2013) in Austria.Footnote¹² We only used data from the ten children who had recently experienced an accident and from the 29 children in the fabrication group. The authors used the same revised coding scheme as Niehaus (Citation2001), but we considered only CBCA criteria that were close in definition to the original CBCA studies. Information on rater training and inter-coder reliability was not available. Effect sizes varied widely, with only three being significant. There was a large negative effect size for Logical Consistency, and two large positive effects for Unexpected Complications and Admitting Lack of Memory (see Figure A.18 in Supplemental Materials 2 at https://osf.io/5tsbz). We calculated an unweighted mean for CBCA01 to CBCA19 that was not significant, g = 0.317, 95% CI [−0.411, 1.044].

Kirchler-Bölderl et al. (Citation2013) also reported the same four lie criteria as Niehaus (Citation2001). Two of these criteria had zero values both for lies and truths (that is, gs = 0). For the other two, the differences were not significant but were in the expected negative direction (Alleged Motive of the Perpetrator and Repetition of Content).

Four raters classified the accounts, displaying large inter-individual differences in accuracy for true and invented reports and a strong truth bias. However, the ns are too small to calculate reliable effect sizes.

Compared to Niehaus's (Citation2001) study, these results were rather disappointing. Possible explanations may be the much smaller sample size in the two subgroups analyzed here, and that the raters were not as well trained as those in the Niehaus study.

The use of CBCA in expert evaluations

In the following, we describe a research project conducted in the field that did not investigate the validity of CBCA criteria per se but that we nevertheless think may be of interest to CBCA scholars and practitioners. The project focused on psychometric properties of CBCA criteria in expert evaluations in court proceedings and their role in these experts’ final case assessments.

Steck et al. (Citation2010, Study 2): CBCA credibility criteria in experts’ case evaluations in Germany

In an online depository, Steck et al. (Citation2010) summarized and integrated the results from five diploma theses. One of these was published as a monograph in a book series (Maier, Citation2006). The others are unpublished (Geiger, Citation2005; Hettler, Citation2005; Lafrenz, Citation2006), although one is available at an online depository (Schwind, Citation2006).

These theses were conducted in the context of a research program to improve police interrogations with respect to detecting potential deception, and did not examine the validity of verbal content criteria. Instead, they addressed the experts’ use of the criteria to arrive at a recommendation about the credibility of a statement. Although such a recommendation is determined by many factors considered in the context of SVA, this project explored, via different correlational techniques, to what extent the credibility recommendation by an expert can be post-dicted by the individual CBCA criteria. In addition, psychometric properties of CBCA criteria, as well as cut-off scores for CBCA summary scores, were explored.

The whole program of research investigated different CBCA and RM criteria issues in a laboratory study of police reports (Study 1) and a field study (Study 2) of allegations of (child) sexual abuse. We draw on all of these theses and the integrative report by Steck et al. (Citation2010), but focus here only on CBCA criteria in the field study (Study 2).Footnote¹³

The field study analyzed 138 forensic expert evaluations of sexual abuse conducted by the Gesellschaft für Wissenschaftliche Gerichtspsychologie in Munich (a large organization of forensic experts). Witnesses were German citizens aged between five and 62 (121 of whom were females). Cronbach’s alpha = .84 was very high, and all but two CITCs were larger than .20. Ground truth was not established in this study, as it did not focus on the CBCA validity but on the experts’ assessments. The credibility evaluations by the experts (79 credible, 59 not credible) were reproduced by a binary logistic regression analysis with the 19 CBCA criteria of the experts’ evaluations as predictor variables. Using an ‘optimal’ cut-off score of ≥ 6 of the raters’ summary score for a classification as ‘credible’, all but four cases were classified within the logistic regression model (Nagelkerke R² = 0.94) in line with the experts’ credibility ratings. Testimonies classified as ‘credible’ contained M = 10.0 (SD = 2.4) criteria, while those classified as non-credible had M = 2.5 (SD = 1.8) criteria, g = 3.45, 95% CI [2.92, 3.97].

Maier (Citation2006) focused on separating statements ultimately considered credible by the experts from statements deemed unconfirmed via discriminant analyses. She also explored the use of ‘optimal’ cut-off scores in summary scores to classify these statements.

As noted above, none of these analyses provide evidence for the validity of the criteria but only on experts’ use of criteria in their assessments. Therefore, these results must not be mistaken as an indicator of the validity of CBCA criteria. Rather, the data simply indicate that the experts’ SVA/CBCA evaluations were related to their CBCA ratings (for a similar approach, see Littmann & Szewczyk, Citation1983). In other words, the data show that the criteria were indeed used in the final evaluation by the experts. But in terms of validity, the study is ‘circular’.

However, this study does provide valuable prevalence data of the rated presence of individual CBCA criteria in cases of (child) sexual abuse, irrespective of truth status. It also demonstrated that the 19 CBCA criteria fulfill some of the psychometric requirements of an assessment instrument.

Summary of the results

Taken together, both the studies dealing with materials from criminal cases and from quasi-experiments in the field show relatively large effect sizes for most CBCA criteria. However, effect sizes vary widely from study to study, indicating that not all criteria are equally valid in different settings or domains of application. Reasons for these differences are manifold, and the small number of studies does not allow for formal comparisons. Any such comparison would confound multiple differences between studies–for example, the age of participants, gender, the type and emotionality of the event, consequences for the witness or suspects, etc.

Keeping this caveat in mind, in we summarize and contrast the effect sizes for all 19 CBCA criteria, the unweighted means, and the summary scores, separately for the six archival studies and six quasi-experiments that were coded as methodologically sound (the two studies acceptable with reservations were excluded from these calculations). These summary effect sizes show that in both domains most criteria had medium to large effect sizes (except for CBCA16 to CBCA18) in the expected directions, most of them being significant. Importantly, this is true for both, CSA abuse studies and quasi-experiments about other kinds of autobiographical events. For all the statistical information relative to , we refer the reader to Supplemental Materials 3 at https://osf.io/5tsbz.

Figure 1. Mean unbiased effect sizes g for quasi-experiments (light bars) and archival analyses of court cases (dark bars).

Note: Error bars are 95% confidence limits for the respective criteria. When they include zero, the effect size was not statistically significant (two-sided testing).

Discussion

CBCA is employed in forensic contexts to assess the credibility of CSA allegations. Although qualitative reviews and meta-analyses of CBCA exist, they combine both laboratory and field studies. A review focused specifically on field studies may provide a more accurate picture of the validity of CBCA criteria in real-life settings.

Indeed, compared to laboratory experiments, real life events are often more heterogeneous and story-tellers are more diverse (e.g. in terms of culture or ethnicity). Also, in the field, the events may be experienced repeatedly, and the story-teller may be interviewed several times. Finally, previous reviews overlooked some field studies, and in some of the field studies they included the criteria used to determine ground-truth seem to have been insufficient.

To address all these limitations, we conducted this systematic review. Both published and unpublished English- and German-language field studies testing the validity of CBCA criteria were carefully searched and assessed in terms of how ground truth had been determined (as well as other, more general, methodological aspects). This led to the selection of seven archival studies and seven quasi-experiments about significant autobiographical experiences. Twelve of these 14 studies were considered as methodologically sound, and two as acceptable with reservations.

We described the paradigm and methods of each study and examined whether the CBCA criteria differed between truthful and deceptive accounts. Across studies, most criteria did differ significantly. Also, an analysis of the 12 methodologically sound studies revealed that effect sizes were similarly large for quasi-experiments and archival CSA studies. This is encouraging, as it suggests that the CBCA criteria may discriminate in domains other than CSA.

However, these findings do not imply that any CBCA criteria or summary score will necessarily lead to accurate classifications of individual statements in forensic cases. Indeed, little is known about the way individual criteria are to be integrated or weighted to make an overall credibility judgment, and there are no established cut-off scores (as with standardized tests). Importantly, CBCA cannot be used in isolation from SVA.

Intercorrelations and combinations of criteria

Regarding the intercorrelations of CBCA criteria, we found both dependencies among content cue ratings within studies (e.g. Scheinberger, Citation1993; Sporer, Citation1998a; see ) as well as covariations in the effect sizes. Intercorrelations indicate that separate criteria measure a common underlying construct, but if these correlations are too high, then the criteria are redundant with each other.

A possible danger of finding many criteria in an account is that this may induce a truth bias (Dukala et al., Citation2019; Niehaus, Citation2001), as all CBCA criteria are considered to indicate truth (see Masip et al., Citation2009). Mittermaier (Citation1834), Hellwig (Citation1951), Rolf Bender and Nack (Citation1981, Citation1995), Hans-Udo Bender (Citation1987) and other legal scholars proposed specific lie indicators judges should pay attention to and investigate further, including verbal content qualities. Researchers should test lie criteria to compensate for the risk of truth bias that the exclusive use of truth criteria entails (e.g. Köhnken et al., Citation1995; Nahari et al., Citation2019; Niehaus, Citation2001).

There is a danger that coding Quantity of Details at the beginning may affect the way subsequent criteria are coded (see our discussion of the interdependence of ground truth criteria below). Alternatively, if specific details (e.g. CBCA08, CBCA09, …) are coded first, this may bias the subsequent coding of Quantity of Details (e.g. Sporer et al., Citation2014). Hence, establishing the validity of individual criteria in a given domain would require that their presence be established independently. A stringent test would require that each criterion be coded by different raters (at least by two raters per criterion to establish inter-rater reliability). This would also be desirable in real-life contexts where CBCA is employed, but we are aware that this requirement is unrealistic. However, at the very least, different raters should code different sets of criteria whenever there is a risk for bias. For instance, those individuals coding the General Characteristics of the Statement (CBCA01, CBCA02, and CBCA03) should not code more specific criteria, and vice versa.

While it appears desirable that the coding of criteria be independent of each other, this does not imply that some of the criteria will not covary in a set of statements. In fact, in the archival study of perjury cases described above, H.-U. Bender (Citation1987) argued that specific combinations of criteria might indicate truthfulness (for subsequent tests of this idea, see Hommers, Citation1997; Hommers & Hennenlotter, Citation2006). We find this idea appealing, but such combinations should be based on theory and their validity should be empirically tested. Alternatively, we encourage the construction of new criteria, which should be theoretically derived and defined by the combined presence of specific criteria subsets–but this is beyond the scope of this paper.

H.-U. Bender also emphasized that the ‘practical relevance’ (or diagnosticity) of certain cues or cue combinations depends on their prevalence rate. However, criteria with small effect sizes because of a floor effect may still be very important in a specific case. For instance, floor effects may be found for Details Misunderstood, as this criterion can only be present if the person has a poor understanding of the event. But if the criterion is found, then it may be highly indicative that the narrator did experience the reported event. In individual studies, this has led authors to drop low prevalence variables from analyses, sometimes even beforehand (e.g. in studies with adults).

In meta-analyses, the low prevalence of rare but diagnostic criteria may result in small effect sizes, or in underestimating its potential diagnostic value in a given case. In addition, if summary scores are uncritically calculated not considering the particularly high diagnostic value of certain individual criteria (like Details Misunderstood), wrong decisions about the credibility of the account may follow.

Children’s age and the prevalence of CBCA criteria

Several field studies have investigated the prevalence of CBCA criteria in real-life settings as a function of children's ages. We have summarized some of these studies above, as they examine the validity of CBCA criteria (Craig et al., Citation1999; Lamb et al., Citation1997; Welle et al., Citation2016). But several additional field studies not focusing on validity also examined the covariation between CBCA criteria and age (Anson et al., Citation1993; Davies et al., Citation2000; Horowitz et al., Citation1997; Lamers-Winkelman, Citation1999; Lamers-Winkelman & Buffing, Citation1996). This literature concurs in showing that the older the children, the more criteria they report (though such covariations are not uniform for all criteria). This also results in higher overall CBCA scores for older than for younger children (see Sporer, Citation2004).Footnote¹⁴

In the study by Roma et al. (Citation2011), children in the confirmed group were significantly older than those in the unconfirmed group. This may result in older children providing more detailed testimony, leading ultimately to higher conviction rates. Considering the large age range in all the archival studies summarized above (from 3 to 18 years; see ), the conflation between truth status and age may also have occurred in other studies.

Also, in line with the development of cognitive abilities, specific criteria may be less likely to be found in young children's accounts than others (e.g. temporal details; see Orbach & Lamb, Citation2007). Thus, developmental trends in cognitive and linguistic abilities must be considered not only in research studies but even more so in credibility assessments in individual cases (see the case study by Orbach & Lamb, Citation1999).

The correlation between age and CBCA scores is routinely considered by court experts, as age is typically considered as an integral part of SVA. However, some recent findings are troublesome. For instance, Welle et al. (Citation2016) found that the correlation between age and the criteria summary score was stronger in unconfirmed compared to confirmed cases.

Applying CBCA criteria to real cases

In addition to the alleged victim’s age, many other variables must be considered in using CBCA in forensic settings. The originators of SVA and CBCA have repeatedly emphasized that simple frequency counts of CBCA criteria do not allow an assessment of the ultimate truthfulness of a statement. To arrive at such a conclusion, about a dozen of rival hypotheses have to be examined before the null hypothesis–that is, the statement is false–can be rejected (see the extensive literature on the use of SVA and CBCA in central Europe; e.g. Fiedler & Schmid, Citation1999; Oberlader et al., Citation2021; Steller & Volbert, Citation1999; Volbert & Steller, Citation2014). If these rival hypotheses are not considered, miscarriages of justice are likely to occur (e.g. the Montessori cases in Münster, the Worms trials and other cases; see Köhnken & Gallwitz, Citation2021; Sporer, Citation2021; Volbert & Steller, Citation2014). To falsify these rival hypotheses, information other than the CBCA criteria needs to be considered.

The validity of CBCA criteria may be limited with specific types of witnesses, for example, those with borderline personality disorders (Böhm & Lau, Citation2007), depression, or cognitive limitations or disabilities (Manzanero et al., Citation2019). When there are signs of psychoses, psychiatric evaluations may be necessary (Fegert et al., Citation2018).Footnote¹⁵ Certain groups of witnesses may also engage in deception strategies that may invalidate specific CBCA criteria, in particular the motivational ones (Niehaus, Citation2001; Niehaus et al., Citation2005).

Also, special care must be taken with witnesses in family courts and in tort law where the standard of proof is not as strict as in criminal trials. New witness compensation laws may also make it necessary that credibility experts will be involved in civil compensation trials.

With the widespread availability of CBCA content criteria on the Internet, (self-)coaching by witnesses or their parents, friends or attorneys becomes more likely. There is empirical evidence that CBCA coaching allows participants to artificially increase the presence of CBCA criteria in their accounts, which results in higher overall CBCA scores and false positive evaluations of witness statements (see Vrij et al., Citation2000, Citation2002).

A related interesting question is to what extent certain witnesses or suspects (e.g. inmates) may have specific knowledge about certain types of events or crimes that will allow them to produce more credible accounts. Similarly, some groups of witnesses may have knowledge about these approaches as a function of their professional training or experience (e.g. police officers, social workers, psychologists). Their statements may be affected or distorted by such prior knowledge.

The idiographic approach of CBCA within SVA allows the expert to assess the CBCA criteria against the background of personal and situational variables surrounding the case. However, some degree of discriminatory power of the CBCA criteria is needed for the approach to be useful.

Age, personality, psychopathology, and situational variables such as the occurrence of coaching and the type of case are relevant not only in using CBCA in real cases in the field, but also in laboratory settings. However, while in the field CBCA criteria are often assessed considering the background of personal and situational variables (see Köhnken, Citation2004; Volbert & Steller, Citation2014), these variables are typically ignored in laboratory experiments (as well as in many quasi-experiments). This neglect restricts the general validity of the outcomes of experimental CBCA research. However, experimental research allows causal conclusions about the examination of moderator variables which are not possible in field research.

Potential new areas of application of CBCA

This systematic review shows that, in field settings, the effect sizes for the difference in CBCA criteria between truthful and deceptive accounts was similar for quasi-experiments about significant autobiographical events, often with adult participants (for the unweighted mean, g = 0.57, 95% CI [0.31, 0.83]; for the summary score, g = 1.03, 95% CI [0.21, 1.85]), and archival CSA studies (for the unweighted mean, g = 0.53, 95% CI [0.27, 0.79]; for the summary score, g = 1.08, 95% CI [0.42, 1.75]). Overall, those criteria with higher effect sizes for archival CSA studies also had higher effect sizes for quasi-experiments, and an analysis across all 19 criteria revealed a strong and significant correlation between the effect sizes for the two study types, r = .73, p < .001. Across studies, the least discriminative criteria appeared to be CBCA16 and CBCA18 (which did not discriminate significantly for either kind of study), as well as CBCA17 (that discriminated for quasi-experiments only). All other CBCA criteria discriminated at least for one kind of study, and eight criteria discriminated for both study types.

These outcomes may suggest that CBCA can be used in areas other than CSA. However, at this point, doing so would be premature and dangerous: First, in this review, we only demonstrated the validities of criteria in the small number of quasi-experiments involving true and false witness accounts that we were able to retrieve. The target events in these quasi-experiments do not represent all real-life circumstances where CBCA could potentially be employed. Second, before using CBCA in a given domain, the respective validities of individual criteria in this specific domain ought to be considered. Unfortunately, the number of extant CBCA studies focusing on any given domain other than CSA is still too small to draw any strong conclusion. Thus, although the current results are indeed encouraging, widespread use of CBCA in all kinds of applied areas is not warranted.

A potential future area of application of CBCA could be the examination of suspect (rather than alleged victims’) statements. But, first, methodologically sound studies need to be conducted with suspects. These studies may render different validities depending on the type of crime–for example, perjury, insurance fraud, and property and violent crimes. Another new potential area of application may be the credibility assessment of claims made by refugees seeking political asylum. However, sound studies testing the validity of CBCA criteria in such circumstances must be conducted first. In any case, criteria definitions need to be adapted to the specific crime or event (in particular, the definition of Characteristics of the Offense).

The Aberdeen Report Judgment Scales were designed to integrate CBCA, RM and other verbal content criteria and to be used across domains (Sporer, Citation1998b, Citation2004). However, they should first be tested in a variety of such domains. Furthermore, it also remains an open question if CBCA or other criteria can be applied to (false) confessions (see Kassin, Citation2022) or (false) accusations (see Sporer et al., Citation2014). Future research should address all these questions.

Although we have enumerated several potential new areas of application of CBCA, a crucial aspect to consider is the extent to which self-experienced and fabricated accounts are likely to differ in those specific domains. Some real-life events, like giving birth, are much more complex than others, like losing one’s wallet, thus lending themselves to higher CBCA ratings. Fabricating a description of a complex event such as birth-giving requires relying on general scripts (e.g. Sporer, Citation2016), which are far, in terms of vividness and richness in detail, from what actually happens in reality (for a discussion of the influence of event characteristics on CBCA, see Schemmel et al., Citation2020). This has implications for both research and applied settings. First, the topic of a research study (e.g. giving birth vs. losing one’s wallet) may have a huge impact on the effect sizes, that is, on the CBCA score differences between true and fabricated accounts. Second, in applied settings, CBCA works best for those kinds of events that are likely to produce rich memories with a wealth of detail.Footnote¹⁶

Limitations

This systematic review is not free from limitations. First, the number of retained, methodologically sound studies was limited. One may wonder the extent to which these few studies do represent all relevant real-life cases. This issue is particularly relevant for the quasi-experiments. Note that, in this review, no more than two quasi-experiments (and typically only one) focused on a specific kind of event (e.g. on giving birth). Furthermore, as pointed out above, the few specific event types in the included quasi-experiments are only a limited sample of all kinds of real-life circumstances where CBCA and other content criteria could potentially be employed. This limitation can only be addressed by conducting additional high-quality field studies and quasi-experiments.

Second, one of the problems with CBCA is that there is no standardized coding manual nor training available (see the outline of a training in Köhnken, Citation2004). Consequently, training quality and inter-coder reliability vary widely across studies (see Hauch et al., Citation2016). This is also true for the studies included in this systematic review. Results in a field study may be improved by using several well-trained raters with demonstrated inter-coder reliability for all statements coded in that study, not just a few example accounts.

Third, though we were very strict in selecting studies based on the way ground truth was determined, there is still some degree of uncertainty concerning ground truth that may have affected the reported effect sizes. Specifically, in many factually true CSA cases, DNA or other ‘objective’ evidence is not available. These cases may consequently be classified as ‘unconfirmed’ by CBCA field researchers, but they are not based on fabrication. Therefore, a sizable proportion of unconfirmed cases may, in reality, be true cases. If many CBCA criteria are rated as present in unconfirmed cases, this does not necessarily imply that these cases are false accusations. They may be actual cases for which no corroborating evidence was available. Thus, for instance, the ‘truth bias’ we noted for Welle et al.’s (Citation2016) study may not be a bias if many children in the unconfirmed group were actually real victims of abuse. This problem may decrease the observed effect sizes, which may not be as high as practitioners might hope for. Indeed, an examination of reveals that, for most criteria, effect sizes were slightly lower for archival CSA studies (in which some unconfirmed cases can be true) than for quasi-experiments (where ground truth is certain for both true and false cases).

Conclusion

Despite the above limitations, we believe this systematic review contributes to the literature. Our focus was specifically on field studies to increase the external validity of our conclusions. We made an effort to address several major limitations of prior CBCA reviews and meta-analyses. We found that, across studies, most CBCA criteria significantly differed between truthful and deceptive accounts. Medium to large effect sizes were found for both methodologically sound quasi-experiments and archival CSA studies.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

1 These are some examples of such methodological problems: Having used CBCA raters not trained on CBCA, having used CBCA raters who were not blind with respect to the truth value of the allegations, having established ground truth based on CBCA criteria, existence of clear confounds between veracity and other variables that may have influenced the CBCA criteria, having examined police notes rather than verbatim transcripts of the victim or witness statements, having examined alleged molesters' explanations of the charges against them rather than the description of events, the study being a single case study, obvious statistical errors that we were unable to address, etc. See Supplemental Materials 1 (https://osf.io/5tsbz) for detail.

2 Unfortunately, from Amado et al.'s (Citation2015) codings it is not clear which studies were classified as field studies, and we were unable to clarify this issue after contacting the authors.

3 We do not know why only these ‘cognitive’ criteria were reported. Data for inter-rater reliability for all 19 criteria had just been provided by the same research group (Horowitz et al., Citation1997), but the exclusion of criteria does not seem to have been based on poor reliability.

4 The means of summary scores of the five groups were as follows: Very Likely: 7.13; Quite Likely: 5.83; Questionable: 5.33; Quite Unlikely: 4.70; Very Unlikely: 5.33.

5 Although the listing of cases and the justifications for inclusion/exclusion are somewhat vague in Craig et al. (Citation1999), these classifications are described more clearly by Roberts and Lamb (Citation2010, Study 2), who reanalyzed the same cases with RM criteria.

6 The discrepancy may be explained by the ANOVA analysis the authors conducted, which also included age of children (dichotomized) and interviewer training (untrained/trained) as independent variables that reduced the mean square error.

7 She also informed us that some relevant data reported in the article contained errors. Our calculations here are based on the correct numbers that she emailed us.

8 It seems that the effect size for Unstructured Production (CBCA02) in Table 4 in Roma et al. (Citation2011) should be positive. We also detected a few other minor mistakes in that table, which we corrected in our analyses.

9 Note that Scheinberger's (Citation1993) diploma thesis was included in Oberlader et al. (Citation2016) as a separate study, but was later excluded from Oberlader et al. (Citation2021) because it was based on the same data set as Wolf and Steller's (Citation1997) study. We integrated data from both the diploma thesis, which provides more complete information about the methodology and results, and the book chapter for our description, coding, and calculations.

10 Obtaining a driving license is very expensive in Germany. Thus, passing the exam may be considered a significant personal life event.

11 Niehaus's (Citation2001) overall credibility index is defined as a difference score between truth criteria and several "lie" criteria that were subtracted from the sum of truth criteria.

12 We are grateful to the first author for sending us Ms and SDs for the individual criteria.

13 In the laboratory study (Study 1), the content criteria used were based on Niehaus's (Citation2001) modified catalogue. Three groups were compared, made up of participants (1) truthfully reporting an event they themselves had experienced during police activity, (2) reporting, as an experience of their own, an event somebody else had experienced, and (3) reporting a fabricated event. Interviewers and raters were blind to truth status.

14 For an exception, see Hershkowitz et al. (Citation1997), who failed to find a significant correlation between age and the total CBCA score. However, note that Pearson r was .37, which is a sizable effect size but failed to yield significance (p < .10) due to the small subsample analyzed (N = 20).

15 There is also a century-old controversy whether in case of traumatic experiences psychiatrists or psychologists are the proper experts (see Fegert et al., Citation2018; Sporer & Antonelli, Citation2022; Volbert, Citation2018).

16 We would like to thank an anonymous reviewer for suggesting the considerations expressed in this paragraph, as well as for several additional valuable suggestions to add aspects from the perspective of an experienced court expert which we had not thought of (including the third limitation discussed below).

References

References preceded by one asterisk are included in the systematic review.
Google Scholar
*Akehurst, L., Manton, S., & Quandte, S. (2011). Careful calculation or a leap of faith? A field study of the translation of CBCA ratings to final credibility judgements. Applied Cognitive Psychology, 25(2), 236–243. https://doi.org/10.1002/acp.1669
Web of Science ®Google Scholar
Amado, B. G., Arce, R., & Fariña, F. (2015). Undeutsch hypothesis and Criteria-based Content Analysis: A meta-analytic review. The European Journal of Psychology Applied to Legal Context, 7(1), 3–12. https://doi.org/10.1016/j.ejpal.2014.11.002
Web of Science ®Google Scholar
Amado, B. G., Arce, R., Farina, F., & Vilarino, M. (2016). Criteria-based Content Analysis (CBCA) reality criteria in adults: A meta-analytic review. International Journal of Clinical and Health Psychology, 16(2), 201–210. https://doi.org/10.1016/j.ijchp.2016.01.002
PubMed Web of Science ®Google Scholar
Anastasi, A. (1990). Psychological testing. Macmillan.
Google Scholar
Anson, D. A., Golding, S. L., & Gully, K. J. (1993). Child sexual abuse allegations: Reliability of Criteria-based Content Analysis. Law and Human Behavior, 17(3), 331–341. https://doi.org/10.1007/BF01044512
Web of Science ®Google Scholar
Arntzen, F. (1970). Psychologie der Zeugenaussage. Einführung in die forensische Aussagepsychologie [Psychology of eyewitness testimony: Introduction to the forensic psychology of eyewitness testimony]. Hogrefe.
Google Scholar
Arntzen, F. (1983/1993). Psychologie der Zeugenaussage: Systematik der Glaubwürdigkeitsmerkmale [Psychology of eyewitness testimony: A system of credibility criteria] (2nd/3rd ed.). C. H. Beck.
Google Scholar
*Bender, H.-U. (1987). Merkmalskombinationen in Aussagen. Theorie und Empirie zum Beweiswert beim Zusammentreffen von Glaubwürdigkeitskriterien [Criteria combinations in statements. Theory and empirical data on the probative value of covariations of credibility criteria]. J. C. B. Mohr (Paul Siebeck).
Google Scholar
Bender, R., & Nack, A. (1981). Tatsachenfestellung vor Gericht. Band 1: Glaubwürdigkeits- und Beweislehre [Establishing facts in courts of law: Vol. 1: Doctrine of credibility and proof] (1st ed.). C. H. Beck.
Google Scholar
Bender, R., & Nack, A. (1995). Tatsachenfestellung vor Gericht. Band 1: Glaubwürdigkeits- und Beweislehre [Establishing facts in courts of law: Vol. 1: Doctrine of credibility and proof] (2nd ed.). C. H. Beck.
Google Scholar
Böhm, C., & Lau, S. (2007). Borderline-Persönlichkeitsstörung und Aussagetüchtigkeit [Borderline personality disorder and the assessment of witness competence]. Forensische Psychiatrie und Psychologische Kriminologie, 1(1), 50–58. https://doi.org/10.1007/s11757-006-0007-3
Google Scholar
Borenstein, M. (2009). Effect sizes for continuous data. In H. Cooper, L. V. Hedges, & J. C. Valentine (Eds.), The handbook of research synthesis and meta-analysis (2nd ed, pp. 221–235). Russell Sage Foundation.
Google Scholar
*Boychuk, T. D. (1991). Criteria-based Content Analysis of children's statements about sexual abuse: A field-based validation study [Unpublished doctoral dissertation]. Arizona State University.
Google Scholar
Bundesgerichtshof in Strafsachen [BGHSt]. (1954). 7, 82, Urteil vom 3. 12. 1954.
Google Scholar
Bushman, B. J. (1994). Vote-counting procedures in meta-analysis. In H. Cooper & L. V. Hedges (Eds.), The handbook of research synthesis (pp. 193–213). Russell Sage Foundation.
Google Scholar
Bushman, B. J., & Wang, M. C. (2009). Vote-counting procedures in meta-analysis. In H. Cooper, L. V. Hedges, & J. C. Valentine (Eds.), The handbook of research synthesis and meta-analysis (2nd ed., pp. 207–220). Russell Sage Foundation.
Google Scholar
Cacuci, S.-A., Bull, R., Huang, C.-Y., & Visu-Petra, L. (2021). Criteria-based Content Analysis in child sexual abuse cases: A cross-cultural perspective. Child Abuse Review, 30(6), 520–535. https://doi.org/10.1002/car.2733
Google Scholar
Connolly, D. A., & Lavoie, J. A. (2015). Discriminating veracity between children's reports of single, repeated, and fabricated events: A critical analysis of Criteria-based Content Analysis. American Journal of Forensic Psychology, 33, 25–48.
Google Scholar
*Craig, R. A., Scheibe, R., Raskin, D. C., Kircher, J. C., & Dodd, D. H. (1999). Interviewer questions and content analysis of children's statements of sexual abuse. Applied Developmental Science, 3(2), 77–85. https://doi.org/10.1207/s1532480xads0302_2
Google Scholar
Davies, G. M., Westcott, H. L., & Horan, N. (2000). The impact of questioning style on the content of investigative interviews with suspected child sexual abuse victims. Psychology, Crime and Law, 6(2), 81–97. https://doi.org/10.1080/10683160008410834
Web of Science ®Google Scholar
DePaulo, B. M., Lindsay, J. J., Malone, B. E., Muhlenbruck, L., Charlton, K., & Cooper, H. (2003). Cues to deception. Psychological Bulletin, 129(1), 74–118. https://doi.org/10.1037/0033-2909.129.1.74
PubMed Web of Science ®Google Scholar
Dettenborn, H., Fröhlich, H. H., & Szewczyk, H. (1984). Forensische Psychologie: Lehrbuch der gerichtlichen Psychologie für Juristen, Kriminalisten, Psychologen, Pädagogen und Mediziner [Forensic psychology: Textbook of psychology and law for legal professionals, criminologists, psychologists, educators, and physicians]. VEB Deutscher Verlag der Wissenschaften.
Google Scholar
Dukala, K., Sporer, S. L., & Polczyk, R. (2019). Detecting deception: Does the cognitive interview impair discrimination with CBCA criteria in elderly witnesses? Psychology, Crime & Law, 25(2), 195–217. https://doi.org/10.1080/1068316X.2018.1511789
Web of Science ®Google Scholar
Fegert, J. M., Gerke, J., & Rassenhofer, M. (2018). Enormes professionelles Unverständnis gegenüber Traumatisierten. Ist die Glaubhaftigkeitsbegutachtung und ihre undifferenzierte Anwendung in unterschiedlichen Rechtsbereichen eine Zumutung für von sexueller Gewalt Betroffene? [Enormous professional misunderstanding regarding traumatized persons. Is credibility assessment and its undifferentiated application in differentiated areas of law unreasonable for persons affected by sexual violence?]. Nervenheilkunde, 36, 525–534.
Google Scholar
Fiedler, K., & Schmid, J. (1999). Gutachten über die Methodik für Psychologische Glaubwürdigkeitsgutachten [Expert evaluation of the methodology of psychological credibility assessment by forensic experts]. Praxis der Rechtspsychologie, 9(2), 5–45.
Google Scholar
Field, A. (2009). Discovering statistics using SPSS (3rd. ed.). Sage.
Google Scholar
Finkelhor, D., Cross, T. P., & Cantor, E. N. (2005). The justice system for juvenile victims. A comprehensive model of case flow. Trauma, Violence & Abuse, 6(2), 83–102. https://doi.org/10.1177/1524838005275090
PubMed Web of Science ®Google Scholar
*Geiger, S. (2005). Aussagepsychologische Untersuchung von Änderungen im Berichtsstil bei falschen Zeugenaussagen [An examination of changes in reporting style in false eyewitness statements] [Unpublished Diploma thesis]. University of Konstanz.
Google Scholar
Granhag, P. A., & Hartwig, M. (2008). A new theoretical perspective on deception detection: On the psychology of instrumental mind-reading. Psychology, Crime & Law, 14(3), 189–200. https://doi.org/10.1080/10683160701645181
Web of Science ®Google Scholar
Granhag, P.-A., Hartwig, M., Mac Giolla, E., & Clemens, F. (2015). Suspects’ verbal counter-interrogation strategies: Towards an integrative model. In P.-A. Granhag, A. Vrij, & B. Verschuere (Eds.), Detecting deception. Current challenges and cognitive approaches (pp. 293–313). Wiley.
Google Scholar
Greuel, L. (2001). Wirklichkeit–Erinnerung–Aussage [Reality–recollection–testimony]. Psychologie Verlags Union.
Google Scholar
*Greuel, L., Brietzke, S., & Stadler, M. A. (1999, July 6–9). Credibility assessment. New research perspectives [Paper presentation]. AP-LS/EAPL Psychology and Law—International Conference, Dublin, Ireland.
Google Scholar
Hauch, V., Sporer, S. L., Masip, J., & Blandón-Gitlin, I. (2017). Can credibility criteria be assessed reliably? A meta-analysis of Criteria-based Content Analysis. Psychological Assessment, 29(6), 819–834. https://doi.org/10.1037/pas0000426
PubMed Web of Science ®Google Scholar
Hauch, V., Sporer, S. L., Michael, S. W., & Meissner, C. A. (2016). Does training improve the detection of deception? A meta-analysis. Communication Research, 43(3), 283–343. https://doi.org/10.1177/0093650214534974
Web of Science ®Google Scholar
Hedges, L. V. (2019). Stochastically dependent effect sizes. In H. Cooper, L. V. Hedges, & J. C. Valentine (Eds.), The handbook of research synthesis and meta-analysis (3rd ed., pp. 281–297). Russell Sage Foundation.
Google Scholar
Hedges, L. V., & Olkin, I. (2000). Vote-counting methods in research synthesis. Psychological Bulletin, 88(2), 359–369. https://doi.org/10.1037/0033-2909.88.2.359
Google Scholar
Hellwig, A. (1951). Psychologie und Vernehmungstechnik bei Tatbestandsermittlungen [Psychology and technique of interrogation in the establishment of facts in courts of law] (2nd ed.; 1st ed., 1927). Ferdinand Enke.
Google Scholar
Hershkowitz, I. (1999). The dynamics of interviews involving plausible and implausible allegations of child sexual abuse. Applied Developmental Science, 3(2), 86–91. https://doi.org/10.1207/s1532480xads0302_3
Google Scholar
Hershkowitz, I., Lamb, M. E., Sternberg, K. J., & Esplin, P. W. (1997). The relationships among interviewer utterance type, CBCA scores and the richness of children's responses. Legal and Criminological Psychology, 2(2), 169–176. https://doi.org/10.1111/j.2044-8333.1997.tb00341.x
Google Scholar
*Hettler, S. (2005). Evaluation eines erweiterten Kanons inhaltlicher Kennzeichen wahrer und falscher Zeugenaussagen [Evaluation of an extended set of credibility criteria for true and false eyewitness statements]. [Unpublished diploma thesis]. University of Konstanz.
Google Scholar
Hommers, W. (1997). Die aussagepsychologische Kriteriologie unter kovarianzstatistischer und psychometrischer Perspektive [Credibility criteria from a covariance and psychometric perspective]. In L. Greuel, T. Fabian, & M. Stadler (Eds.), Psychologie der Zeugenaussage (pp. 87–100). Psychologie Verlags Union.
Google Scholar
Hommers, W., & Hennenlotter, A. (2006). Zur Konfiguralität von aussagepsychologischen Realitätskriterien: Eine kreuzvalidierte TYPAG-Anwendung [Configural properties of credibility criteria: A cross-validated TYPAG application]. In T. Fabian & S. Nowara (Eds.), Neue Wege und Konzepte in der Rechtspsychologie (pp. 63–88). LITVerlag.
Google Scholar
Honts, C. R. (1994). Assessing children’s credibility: Scientific and legal issues in 1994. North Dakota Law Review, 70, 879–903.
Google Scholar
Horowitz, S. W., Lamb, M. E., Esplin, P. W., Boychuk, T. D., Krispin, O., & Reiter-Lavery, L. (1997). Reliability of Criteria-based Content Analysis of child witness statements. Legal and Criminological Psychology, 2(1), 11–21. https://doi.org/10.1111/j.2044-8333.1997.tb00329.x
Google Scholar
Horowitz, S. W., Lamb, M. E., Esplin, P. W., Boychuk, T. D., Reiter-Lavery, L., & Krispin, O. (1995). Establishing ground truth in studies of child sexual abuse. Expert Evidence, 4, 42–51.
Google Scholar
Huff, C. R., Rattner, A., & Sagarin, E. (1996). Convicted but innocent. Sage.
Google Scholar
Kassin, S. M. (2012). Why confessions trump innocence. American Psychologist, 67(6), 431–445. https://doi.org/10.1037/a0028212
PubMed Web of Science ®Google Scholar
Kassin, S. M. (2022). Duped: Why innocent people confess–and why we believe their confessions. Prometheus Books.
Google Scholar
Kassin, S. M., & Neumann, K. (1997). On the power of confession evidence: An experimental test of the fundamental difference hypothesis. Law and Human Behavior, 21(5), 469–484. https://doi.org/10.1023/A:1024871622490
PubMed Web of Science ®Google Scholar
*Kirchler-Bölderl, C., Bölderl, A., Ertl, M., & Giacomuzzi, S. (2013, September). Doppelblindstudie zum integrativen Merkmalssystem nach Niehaus, evaluiert bei Kindern im Alter von 3 bis 7 Jahren [Double-blind study on Niehaus’s integrative criteria system, applied to children aged between 3 and 7 years old] [Poster presentation]. 15th meeting of the division legal psychology of the German Psychological Association in Nuremberg, Germany.
Google Scholar
Kleinberg, B., Arntz, A., & Verschuere, B. (2019). Being accurate about accuracy in verbal deception detection. PLoS One, 14(8), e0220228. https://doi.org/10.1371/journal.pone.0220228
PubMed Web of Science ®Google Scholar
Köhnken, G. (1982). Sprechverhalten und Glaubwürdigkeit: Eine experimentelle Studie zur extralinguistischen und textstilistischen Aussageanalyse [Speech behavior and credibility: An experimental study on extra-linguistic and text-stylistic statement analysis] [Unpublished doctoral dissertation]. University of Kiel, Germany.
Google Scholar
Köhnken, G. (1990). Glaubwürdigkeit [Credibility]. Psychologie Verlags Union.
Google Scholar
Köhnken, G. (1996). Social psychology and the law. In G. R. Semin & K. Fiedler (Eds.), Applied social psychology (pp. 257–282). Sage.
Google Scholar
Köhnken, G. (2004). Statement Validity Analysis and the “detection of the truth”. In P.-A. Granhag & L. A. Strömwall (Eds.), The detection of deception in forensic contexts (pp. 41–63). Cambridge University Press.
Google Scholar
Köhnken, G., & Gallwitz, S. (2021). Fehlerquellen in aussagepsychologischen Gutachten [Sources of error in psychological expert evaluations of eyewitness statements]. In R. Deckers & G. Köhnken (Eds.), Die Erhebung und Bewertung von Zeugenaussagen im Strafprozess (pp. 17–58). Berliner Wissenschaftsverlag.
Google Scholar
Köhnken, G., Manzanero, A. L., & Scott, M. T. (2015). Análisis de la validez de las declaraciones: Mitos y limitaciones [Statement Validity Assessment: Myths and limitations]. Anuario de Psicología Jurídica, 25(1), 13–19. https://doi.org/10.1016/j.apj.2015.01.004
Web of Science ®Google Scholar
Köhnken, G., Schimossek, E., Aschermann, E., & Höfer, E. (1995). The Cognitive Interview and the assessment of the credibility of adults’ statements. Journal of Applied Psychology, 80(6), 671–684. https://doi.org/10.1037/0021-9010.80.6.671
Web of Science ®Google Scholar
Lafrenz, B. (2006). Trennschärfeanalyse der so genannten Realkennzeichen der aussagepsychologischen Diagnostik [The analysis of corrected item-total correlations of so-called reality criteria in credibility assessment]. [Unpublished diploma thesis]. University of Konstanz.
Google Scholar
*Lamb, M. E., Sternberg, K. J., Esplin, P. W., Hershkowitz, I., Orbach, Y., & Hovav, M. (1997). Criterion-based content analysis: A field validation study. Child Abuse & Neglect, 21(3), 255–264. https://doi.org/10.1016/S0145-2134(96)00170-6
PubMed Web of Science ®Google Scholar
Lamers-Winkelman, F. (1999). Statement Validity Analysis: Its application to a sample of Dutch children who may have been sexually abused. In K. Coulborn Faller & R. VanderLaan (Eds.), Maltreatment in early childhood. Tools for research-based intervention (pp. 59–81). Haworth Press.
Google Scholar
Lamers-Winkelman, F., & Buffing, F. (1996). Children’s testimony in the Netherlands: A study of Statement Validity Analysis. In B. L. Bottoms & G. S. Goodman (Eds.), International perspectives on child abuse and children’s testimony (pp. 45–61). Sage.
Google Scholar
Leonhardt, C. (1931). Psychologische Beweisführung [Psychological method of proof]. Archiv für Kriminologie, 89, 203–206.
Google Scholar
Littmann, E., & Szewczyk, H. (1983). Zu einigen Kriterien und Ergebnissen forensisch-psychologischer Glaubwürdigkeitsbegutachtung von sexuell mißbrauchten Kindern und Jugendlichen [Criteria to be examined in forensic-psychological credibility assessment in child sexual abuse cases]. Forensia, 4, 55–72.
Google Scholar
Mac Giolla, E., & Luke, T. J. (2021). Does the cognitive approach to lie detection improve the accuracy of human observers? Applied Cognitive Psychology, 35(2), 385–392. https://doi.org/10.1002/acp.3777
Web of Science ®Google Scholar
*Maier, B. (2006). Glaubhaftigkeitsdiagnostik von Zeugenaussagen: Eine diskriminanzanalytische Untersuchung [Credibility assessment of witness statements: A discriminant analysis]. VDM Verlag Dr. Müller.
Google Scholar
Manzanero, A. L., Scott, M. T., Vallet, R., Aróztegui, J., & Bull, R. (2019). Criteria-based Content Analysis in true and simulated victims with intellectual disability. Anuario de Psicología Jurídica, 29(1), 55–60. https://doi.org/10.5093/apj2019a1
Web of Science ®Google Scholar
Masip, J., Alonso, H., Garrido, E., & Herrero, C. (2009). Training to detect what? The biasing effects of training on veracity judgments. Applied Cognitive Psychology, 23(9), 1282–1296. https://doi.org/10.1002/acp.1535
Web of Science ®Google Scholar
Masip, J., Sporer, S. L., Garrido, E., & Herrero, C. (2005). The detection of deception with the reality monitoring approach: A review of the empirical evidence. Psychology, Crime & Law, 11(1), 99–122. https://doi.org/10.1080/10683160410001726356
Web of Science ®Google Scholar
Mittermaier, C. J. A. (1834). Die Lehre vom Beweise im deutschen Strafprozesse: Nach der Fortbildung durch Gerichtsgebrauch und deutsche Gesetzbücher in Vergleichung mit den Ansichten des englischen und französischen Strafverfahrens [The doctrine of evidence in German criminal procedure: With consideration of court usage and German law codes in comparison with the views of English and French criminal procedures]. Heyer.
Google Scholar
Nahari, G., Ashkenazi, T., Fisher, R. P., Granhag, P.-A., Hershkowitz, I., Masip, J., Meijer, E. H., Nisin, Z., Sarid, N., Taylor, P. J., Verschuere, B., & Vrij, A. (2019). “Language of Lies”: Urgent issues and prospects in verbal lie detection research. Legal and Criminological Psychology, 24(1), 1–23. https://doi.org/10.1111/lcrp.12148
Web of Science ®Google Scholar
*Niehaus, S. (2001). Zur Anwendbarkeit inhaltlicher Glaubhaftigkeitsmerkmale bei Zeugenaussagen unterschiedlichen Wahrheitsgehaltes [Applicability of content criteria to testimonies with different truth status]. Europäische Hochschulschriften.
Google Scholar
Niehaus, S. (2008). Merkmalsorientierte Inhaltsanalyse [Criteria-based Content Analysis]. In R. Volbert & M. Steller (Eds.), Handbuch der Rechtspsychologie (pp. 311–321). Hogrefe.
Google Scholar
Niehaus, S., Krause, A., & Schmidke, J. (2005). Täuschungsstrategien bei der Schilderung von Sexualstraftaten [Deception strategies in descriptions of sexual offenses]. Zeitschrift für Sozialpsychologie, 36(4), 175–187. https://doi.org/10.1024/0044-3514.36.4.175
Google Scholar
Niveau, G. (2020). Sensory information in children’s statements of sexual abuse. Forensic Sciences Research, 6(2), 97–102. https://doi.org/10.1080/20961790.2020.1814000
PubMedGoogle Scholar
Oberlader, V. A., Naefgen, C., Koppehele-Gossel, J., Quinten, L., Banse, R., & Schmidt, A. F. (2016). Validity of content-based techniques to distinguish true and fabricated statements: A meta-analysis. Law and Human Behavior, 40(4), 440–457. https://doi.org/10.1037/lhb0000193
PubMed Web of Science ®Google Scholar
Oberlader, V. A., Quinten, L., Banse, R., Volbert, R., Schmidt, A. F., & Schönbrodt, F. D. (2021). Validity of content-based techniques for credibility assessment. How telling is an extended meta-analysis taking research bias into account? Applied Cognitive Psychology, 35(2), 393–410. https://doi.org/10.1002/acp.3776
Web of Science ®Google Scholar
Orbach, Y., & Lamb, M. E. (1999). Assessing the accuracy of a child's account of sexual abuse: A case study. Child Abuse & Neglect, 23(1), 91–98. https://doi.org/10.1016/S0145-2134(98)00114-8
PubMed Web of Science ®Google Scholar
Orbach, Y., & Lamb, M. E. (2007). Young children’s references to temporal attributes of allegedly experienced events in the course of forensic interviews. Child Development, 78(4), 1100–1120. https://doi.org/10.1111/j.1467-8624.2007.01055.x
PubMed Web of Science ®Google Scholar
Peters, K. (1972/1974/1976). Fehlerquellen im Strafprozess [Miscarriages of justice in criminal procedures] (3 vols.). C. F. Müller.
Google Scholar
Raskin, D. C., & Esplin, P. W. (1991a). Assessment of children's statements of sexual abuse. In J. Doris (Ed.), The suggestibility of children's recollections. Implications for eyewitness testimony (pp. 153–164). American Psychological Association.
Google Scholar
Raskin, D. C., & Esplin, P. W. (1991b). Statement Validity Assessment: Interview procedures and content analysis of children's statements of sexual abuse. Behavioral Assessment, 13, 265–291.
Google Scholar
Rönspies-Heitman, J. (2022). Kriterienorientierte Inhaltsanalyse von Zeugenaussagen: Eine empirische Untersuchung zur Validität ausgewählter Glaubhaftigkeitsmerkmale [Criteria-based Content Analysis of witnesses’ testimonies: An empirical investigation of the validity of selected credibility criteria]. Springer Nature.
Google Scholar
Roberts, K. P., & Lamb, M. E. (2010). Reality monitoring characteristics in confirmed and doubtful allegations of child sexual abuse. Applied Cognitive Psychology, 24(8), 1049–1079. https://doi.org/10.1002/acp.1600
Web of Science ®Google Scholar
*Roma, P., San Martini, P., Sabatello, U., Tatarelli, R., & Ferracuti, S. (2011). Validity of Criteria-based Content Analysis (CBCA) at trial in free-narrative interviews. Child Abuse & Neglect, 35(8), 613–620. https://doi.org/10.1016/j.chiabu.2011.04.004
PubMed Web of Science ®Google Scholar
Ruby, C. L., & Brigham, J. C. (1998). Can Criteria-based Content Analysis distinguish between true and false statements of African-American speakers? Law and Human Behavior, 22(4), 369–388. https://doi.org/10.1023/A:1025766825429
Web of Science ®Google Scholar
*Rüth-Bemelmanns, E. (1984). Experimentelle Erprobung der Kriterien der Aussagenanalyse [Experimental pilot evaluation of criteria of statement analysis]. [Unpublished diploma thesis]. University of Cologne.
Google Scholar
*Scheinberger, R. (1993). Inhaltliche Realkennzeichen in Aussagen von Erwachsenen [Content credibility criteria in testimonies of adults]. [Unpublished diploma thesis]. Freie Universität Berlin.
Google Scholar
Schemmel, J., Maier, B. G., & Volbert, R. (2020). Verbal baselining: Within-subject consistency of CBCA scores across different truthful and fabricated accounts. European Journal of Psychology Applied to Legal Context, 12(1), 35–42. https://doi.org/10.5093/ejpalc2020a4
Web of Science ®Google Scholar
*Schubert, J. (1999). Experimentelle Untersuchung zur Validierung extralinguistischer und verbaler Glaubwürdigkeitskriterien in Berichten über erlebte und phantasierte Ereignisse [Experimental study to test the validity of extralinguistic and verbal content criteria in accounts of true and fabricated events]. [Unpublished diploma thesis]. Justus-Liebig-University Giessen.
Google Scholar
*Schwind, D. (2006). Testkritische Analyse der Realkennzeichen nach Steller und Köhnken anhand von Daten aus Glaubhaftigkeitsgutachten [Psychometric analysis of credibility criteria of Steller and Köhnken on the basis of data from forensic credibility assessments]. [Diploma thesis]. University of Konstanz. http://www.ub.uni-konstanz.de/kops/volltexte/2006/1997/
Google Scholar
*Sporer, S. L. (1998a, March). CBCA criteria ratings of a quasi-experiment on an overnight military exercise in the Scottish Highlands [Unpublished raw data]. University of Aberdeen, Scotland/University of Giessen, Germany.
Google Scholar
Sporer, S. L. (1998b, March). Detecting deception with the Aberdeen Report Judgment Scales (ARJS): Theoretical development, reliability and validity [Paper presentation]. Biennial Meeting of the American Psychology-Law Society, Redondo Beach, CA, United States.
Google Scholar
Sporer, S. L. (2004). Reality monitoring and the detection of deception. In P.-A. Granhag & L. A. Strömwall (Eds.), The detection of deception in forensic contexts (pp. 64–102). Cambridge University Press. https://doi.org/10.1017/CBO9780511490071.004
Google Scholar
Sporer, S. L. (2016). Deception and cognitive load: Expanding our horizon with a working memory model. Frontiers in Psychology, 7, 420. https://doi.org/10.3389/fpsyg.2016.00420
PubMed Web of Science ®Google Scholar
Sporer, S. L. (2021). Verfahrensfehler und Justizirrtümer: Kognitive und soziale Erklärungsansätze [Procedural errors and miscarriages of justice: Cognitive and social explanations]. In R. Deckers & G. Köhnken (Eds.), Erhebung und Bewertung von Zeugenaussagen (pp. 163–207). Berliner Wissenschaftsverlag.
Google Scholar
Sporer, S. L., & Antonelli, M. (2022). Psychology of eyewitness testimony in Germany in the 20th century. History of Psychology, 25(2), 143–169. https://doi.org/10.1037/hop0000199
PubMed Web of Science ®Google Scholar
Sporer, S. L., Manzanero, A. L., & Masip, J. (2021). Optimizing CBCA and RM research: Recommendations for analyzing and reporting data on content cues to deception. Psychology, Crime, & Law, 27(1), 1–39. https://doi.org/10.1080/1068316X.2020.1757097
Web of Science ®Google Scholar
Sporer, S. L., & Masip, J. (2023). Millennia of legal content criteria of lies and truths: Wisdom or common-sense folly? Frontiers in Psychology, 14, 1219995. https://doi.org/10.3389/fpsyg.2023.1219995
PubMed Web of Science ®Google Scholar
Sporer, S. L., Masip, J., & Cramer, M. (2014). Guidance to detect deception with the Aberdeen Report Judgment Scales: Are verbal content cues useful to detect false accusations? American Journal of Psychology, 127(1), 43–61. https://doi.org/10.5406/amerjpsyc.127.1.0043
PubMed Web of Science ®Google Scholar
*Steck, P., Hermanutz, M., Lafrenz, B., Schwind, D., Hettler, S., Maier, B., & Geiger, S. (2010). Die psychometrische Qualität von Realkennzeichen [The psychometric quality of reality criteria]. https://doi.org/10.25968/opus-263
Google Scholar
Steller, M. (1989). Recent developments in statement analysis. In J. C. Yuille (Ed.), Credibility assessment (pp. 135–154). Kluwer.
Google Scholar
Steller, M., & Boychuk, T. (1992). Children as witnesses in sexual abuse cases: Investigative interview and assessment techniques. In H. Dent & R. Flin (Eds.), Children as witnesses (pp. 47–71). Wiley.
Google Scholar
Steller, M., & Köhnken, G. (1989). Criteria-based Statement Analysis. In D. C. Raskin (Ed.), Psychological methods in criminal investigation and evidence (pp. 217–245). Springer.
Google Scholar
Steller, M., & Volbert, R. (1999). Wissenschaftliches Gutachten. Forensisch-aussagepsychologische Begutachtung (Glaubwürdigkeitsbegutachtung) [Scientific expert evaluation. Forensic psychological statement evaluation (credibility assessment)]. Praxis der Rechtspsychologie, 9(2), 46–112.
Google Scholar
Steller, M., Wellershaus, P., & Wolf, T. (1992). Realkennzeichen in Kinderaussagen: Empirische Grundlagen der Kriterienorientierten Aussagenanalyse [Reality criteria in the statements of children: Empirical foundations of Criteria-based Content Analysis]. Zeitschrift für Experimentelle und Angewandte Psychologie, 34, 151–170.
Google Scholar
Stern, L. W. (1926). Jugendliche Zeugen in Sittlichkeitsprozessen [Juvenile witnesses in criminal trials of sexual abuse]. Quelle & Meyer.
Google Scholar
Strömwall, L. A., Bengtsson, L., Leander, L., & Granhag, P.-A. (2004). Assessing children's statements: The impact of a repeated experience on CBCA and RM ratings. Applied Cognitive Psychology, 18(6), 653–668. https://doi.org/10.1002/acp.1021
Web of Science ®Google Scholar
Szewczyk, H. (1973). Kriterien der Beurteilung kindlicher Zeugenaussagen [Content criteria for credibility assessment of children’s testimonies]. Probleme und Ergebnisse der Psychologie, 46, 47–66.
Google Scholar
Tedeschi, J. T., & Norman, N. (1985). Social power, self-presentation, and the self. In B. R. Schlenker (Ed.), The self and social life (pp. 293–322). McGraw-Hill.
Google Scholar
Trankell, A. (1972). Reliability of evidence. Rotobeckmann.
Google Scholar
Undeutsch, U. (1967). Beurteilung der Glaubhaftigkeit von Zeugenaussagen [Evaluation of the credibility of eyewitness statements]. In U. Undeutsch (Ed.), Handbuch der Psycholoqie (Vol. 11, pp. 26–181). Hogrefe.
Google Scholar
Volbert, R. (2018). Scheinerinnerungen von Erwachsenen an traumatische Erlebnisse und deren Prüfung im Rahmen der Glaubhaftigkeitsbegutachtung: Eine rein traumatologische Perspektive ist irreführend [Pseudomemories of adults of traumatic experiences and their examination in the context of credibility assessment: A purely traumatological perspective is misleading]. Praxis der Rechtspsychologie, 28, 61–95.
Google Scholar
Volbert, R., & Steller, M. (2014). Is this testimony truthful, fabricated, or based on false memory? European Psychologist, 19(3), 207–220. https://doi.org/10.1027/1016-9040/a000200
Web of Science ®Google Scholar
Vrij, A. (2005). Criteria-based Content Analysis: A qualitative review of the first 37 studies. Psychology, Public Policy, and Law, 11(1), 3–41. https://doi.org/10.1037/1076-8971.11.1.3
Web of Science ®Google Scholar
Vrij, A. (2008). Detecting lies and deceit: Pitfalls and opportunities. Wiley.
Google Scholar
Vrij, A., Akehurst, L., Soukara, S., & Bull, R. (2002). Will the truth come out? The effect of deception, age, status, coaching, and social skills on CBCA scores. Law and Human Behavior, 26(3), 261–283. https://doi.org/10.1023/A:1015313120905
PubMed Web of Science ®Google Scholar
Vrij, A., Granhag, P.-A., & Porter, S. (2010). Pitfalls and opportunities in nonverbal and verbal lie detection. Psychological Science in the Public Interest, 11(3), 89–121. https://doi.org/10.1177/1529100610390861
PubMedGoogle Scholar
Vrij, A., Kneller, W., & Mann, S. (2000). The effect of informing liars about Criteria-based Content Analysis on their ability to deceive CBCA-raters. Legal and Criminological Psychology, 5(1), 57–70. https://doi.org/10.1348/135532500167976
Google Scholar
Vrij, A., & Nahari, G. (2019). The verifiability approach. In J. J. Dickinson, N. Schreiber Combo, R. N. Carol, B. L. Schwartz, & M. R. McCauley (Eds.), Evidence-based investigative interviewing. Applying cognitive principles (pp. 116–133). Routledge.
Google Scholar
Vrij, A., Nahari, G., Isitt, R., & Leal, S. (2016). Using the verifiability lie detection approach in an insurance claim setting. Journal of Investigative Psychology and Offender Profiling, 13(3), 183–197. https://doi.org/10.1002/jip.1458
Web of Science ®Google Scholar
Walczyk, J. J., Harris, L. L., Duck, T. K., & Mulay, D. (2014). A social-cognitive framework for understanding serious lies: Activation-Decision-Construction-Action Theory. New Ideas in Psychology, 34, 22–36. https://doi.org/10.1016/j.newideapsych.2014.03.001
Web of Science ®Google Scholar
Wegener, H. (1981). Einführung in die Forensische Psychologie [Introduction to forensic psychology]. Wissenschaftliche Buchgesellschaft Steinkopff.
Google Scholar
*Welle, I., Berclaz, M., Lacasa, M.-J., & Niveau, G. (2016). A call to improve the validity of criterion-based content analysis (CBCA): Results from a field-based study including 60 children's statements of sexual abuse. Journal of Forensic and Legal Medicine, 43, 111–119. https://doi.org/10.1016/j.jflm.2016.08.001
PubMed Web of Science ®Google Scholar
*Wolf, P., & Steller, M. (1997). Realkennzeichen in Aussagen von Frauen. Zur Validierung der Kriterienorientierten Aussageanalyse für Zeugenaussagen von Vergewaltigungsopfern [Reality criteria in statements of women. Validating Criteria-based Content Analysis for statements of rape victims]. In L. Greuel, T. Fabian, & M. Stadler (Eds.), Psychologie der Zeugenaussage (pp. 122–130). Psychologie Verlags Union.
Google Scholar
Zuckerman, M., DePaulo, B. M., & Rosenthal, R. (1981). Verbal and nonverbal communication of deception. In L. Berkowitz (Ed.), Advances in experimental social psychology (pp. 1–60). Academic Press. https://doi.org/10.1016/S0065-2601(08)60369-X
Google Scholar

A systematic review of the validity of Criteria-based Content Analysis in child sexual abuse cases and other field studies

ABSTRACT

Introduction

Criteria-based Content Analysis (CBCA)

Table 1. List of CBCA criteria (adapted from Steller & Köhnken, Citation1989).

Theoretical assumptions of CBCA

Cognitive aspects

Impression management aspects

Past CBCA research

Why this systematic review?

Table 2. Main criteria used to determine ground truth of the allegations in field studies of child sexual abuse.

Method

Table 3. Overview of studies included in this systematic review.

Results: description of studies

Criminal trial or child sexual abuse studies

H.-U. Bender (Citation1987): archival analysis of perjury cases in Germany

Boychuk (Citation1991): unpublished doctoral dissertation on child sexual abuse cases in Phoenix, Arizona

Lamb et al. (Citation1997): field study with child sexual abuse allegations in Israel

Craig et al. (Citation1999): field study with child sexual abuse allegations in Salt Lake City, Utah

Akehurst et al. (Citation2011): field study with child sexual abuse allegations in central England

Roma et al. (Citation2011): field study of child sexual abuse statements given in court in Italy

Welle et al. (Citation2016): field study with child sexual abuse allegations in Geneva, Switzerland

Quasi-experiments in the field

Rüth-Bemelmanns (Citation1984): quasi-experimental study about owning a cat conducted in Cologne, Germany

Wolf and Steller (Citation1997; Scheinberger, Citation1993): field study about giving birth conducted in Berlin, Germany

Mann–Whitney U-tests

Table 4. Means, SDs, intercorrelations of CBCA criteria, mean rs and corrected item-total correlations (CITCs) in the field quasi-experiment by Sporer (Citation1998a).

Discriminant analysis

Means and standard deviations of the three raters

Relative weighted presence score

Rater judgments

Greuel et al. (Citation1999): field study about giving birth conducted in Bremen, Germany

Sporer (Citation1998a): overnight military exercise in the Scottish highlands

Length of accounts

CBCA criteria

Validity of CBCA criteria

Multiple discriminant analysis

Rater accuracy as a function of guidance

Schubert (Citation1999; re-analysis by Sporer): reports of a driving exam in Giessen, Germany

Niehaus (Citation2001): quasi-experiment with child victims of traffic accidents in Germany

Kirchler-Bölderl et al. (Citation2013): quasi-experiment with young children victims of traffic accidents in Austria

The use of CBCA in expert evaluations

Steck et al. (Citation2010, Study 2): CBCA credibility criteria in experts’ case evaluations in Germany

Summary of the results

Discussion

Intercorrelations and combinations of criteria

Children’s age and the prevalence of CBCA criteria

Applying CBCA criteria to real cases

Potential new areas of application of CBCA

Limitations

Conclusion

Disclosure statement

Notes

References

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature