583
Views
0
CrossRef citations to date
0
Altmetric
Research Articles

The problem of false positives in automated census linking: Nineteenth-century New York’s Irish immigrants as a case study

, , &

Abstract

Automated census linkage algorithms have become popular for generating longitudinal data on social mobility, especially for immigrants and their children. But what if these algorithms are particularly bad at tracking immigrants? This study utilizes a database on nineteenth-century Irish immigrants, generated from the most widely used algorithms, created by Abramitzky, Boustan, and Eriksson (ABE). Our objective is to assess the extent to which different individuals are erroneously linked together across census years and the consequences of these “false positives” for calculating social mobility. Our findings raise serious questions about the quality of the matches generated by the “first generation” of automated census linkage algorithms. False positives range from about one-third to one-half of all links. These bad links lead to sizeable estimation errors when measuring Irish immigrant social and geographic mobility.

For well over half a century, scholars have been trying to trace individuals’ occupations and locations over the course of their lifetimes to measure and compare rates of socio-economic mobility. Thernstrom’s pioneering work in the 1960s has inspired many imitators. Yet methodological issues and technological constraints—in particular, selection issues—stymied research in this subfield until computers made it easier to track people as they moved around the United States. In the 1990s, economists and other social scientists began to reexamine Thernstrom’s questions, now that they could do so with microprocessors rather than microfilm. Over the past decade or so, mobility studies have gained added impetus due to interest in the origins and history of inequality, easy access to online census databases, and the development of algorithms producing data that trace millions of individuals over many decades of census returns.

This article considers the performance of these algorithms in tracking individuals across census returns along several key dimensions. Specifically, we are concerned with the problem of “false positives,” cases of mistaken identity where different individuals with similar or near-identical characteristics are presumed to be the same person by the linkage algorithm. While several groups of scholars have created such algorithms, the algorithms of Abramitzky, Boustan, Eriksson (hereafter “ABE”) and their collaborators are now widely used, in part because the authors have made both the datasets and the coding that generated them easily available on their project website. Data generated by the first generation of ABE algorithms have been used recently for several ambitious studies of intra- and inter-generational mobility, focusing in particular on immigrant groups in the United States (e.g. Abramitzky, Boustan, and Eriksson Citation2012, Citation2014; Alexander and Ward Citation2018; Connor Citation2019; Pérez Citation2019; Beck Knudsen Citation2022). There is a clear need to assess the performance and accuracy of these algorithms in tracing immigrants across censuses, as well as the potential implications of poor algorithmic performance for social mobility research.

It was our own interest in American immigrants that led us to the world of automated census linking. Three of us have been working for a decade compiling and analyzing, with the help of a professional genealogist, a database tracking the lives of the 15,000 New Yorkers (mostly Irish-born refugees of the Great Famine) who opened accounts at the Emigrant Industrial Savings Bank (hereafter “ESB”)Footnote1 from 1850 to 1858. We found that (1) the New York Irish saved much more money than we had imagined given the prevailing view that Irish immigrants in this era were mired in poverty; and (2) New York’s Famine immigrants enjoyed more upward occupational mobility than expected (Anbinder Citation2012; Anbinder, Ó Gráda, and Wegge 2019, Citation2022). Naturally, we realized that these results could be due in part to selection bias. (on which more below), which compares male Irish-born ESB account holders living in New York when they opened their accounts and a one-in-ten sample of New York City Irish males taken from the 1855 state census, suggested that the former were not very different from the latter. But the possibility remained that those we managed to track over time may have been the more successful. Our search for a way of determining the socio-economic mobility of a true cross-section of New York’s Irish immigrant population led us to the ABE databases. In the combined 1860–1870 “crosswalk” from ABE (the one most relevant to our work) and the IPUMS complete-count census data for these years (Ruggles et al. Citation2020), there are 9623 Irish-born men aged 18–59 years linked using the ABE Standard algorithm and 3747 men using the Conservative algorithm (see ).

Comparing the ESB and ABE-generated databases of New York’s Irish immigrants yielded two striking contrasts. First, the occupational mobility of the Irish generated by the ABE method, both upward and downward, was far greater than among the ESB’s customers. Second, the former were far more likely to change the state or county they lived in than the latter. We were prepared for some differences. However, the occupational and geographic mobility of the Irish immigrants in the ABE-generated database did not seem credible. These discrepancies raised concerns about the potential prevalence of such errors across entire research fields that are characterized by newly emerging record linkage algorithms and substantive findings derived from linked samples.

That was what led us to Archbishop John B. Purcell. Purcell, one of the best-known Irish immigrants in mid-nineteenth-century America, was a native of County Cork who emigrated to the United States in 1820. He graduated in 1823 from Mount St. Mary’s Seminary in Maryland, where he became college president seven years later. In 1833, he became the bishop of Cincinnati, and in 1850, archbishop, with jurisdiction over the entire American Midwest. In that role, Purcell became a strident defender of American Catholics against the attacks of nativist zealots, such as caricaturist Thomas Nast (see ). Comparing occupations and locations in 1860 and 1870 of Irish immigrants linked in the database generated by the original ABE algorithm, we noticed Purcell. There he was, living in Cincinnati in 1860, his occupation listed as “RC [Roman Catholic] Archbishop,” albeit with his name spelled incorrectly as “Pursell.” Yet although Purcell remained in his post in southern Ohio until his death in 1883, the algorithm would have us believe that he had retired by 1870 and had moved to Philadelphia. In fact, he is easy to find in the 1870 census, still in Cincinnati, still “archbishop,” although with his surname now spelled correctly, which explains why the algorithm mistakes him for a “John Pursell” of the same age in Pennsylvania.

Our examination of all 95,000 or so of the adult male Irish-born Americans whom the original ABE algorithm traces from 1860 to 1870 suggests that about half of those links generated by the ABE “Standard” method are false positives like Purcell. Even if one uses more conservative variants of algorithmic linking (e.g., ABE “Conservative”), designed to limit false positives (or Type I errors), about a third of the links of Irish immigrants is false. We shall show that these errors do not result from any specific errors in the technical construction of the algorithm, but rather the challenges that these algorithms face when they encounter imperfect data. In our case, these errors result primarily from rampant age misreporting and surprisingly wide variations in the spelling of names by the original census takers (who wrote each name out by hand) and contemporary census transcribers, who must decipher these sometimes barely legible census returns so that they can be easily searched by scholars, genealogical enthusiasts, and algorithms. The high incidence of false positives that we found raises serious questions about the reliability of algorithmic census linkage for studies of immigrant social mobility through historical census data.

While our analysis focuses on one algorithm that has become popular for the study of immigrant social mobility, our findings have broader implications for historical social science, where false positives in linked samples are likely pervasive and still underappreciated. Despite promising advancements in reducing false positives in historical census linkage through data and algorithmic enhancements (e.g., Helgertz et al. Citation2022; Abramitzky, Mill, and Pérez Citation2020; Abramitzky, Boustan, and Rashid Citation2020; Price et al. Citation2021), these newer approaches will continue to meet obstacles on the path to credibility.Footnote2 This is due to deeper challenges in record linkage that have long been recognized in the computing sciences (e.g., Fellegi and Sunter 196; Christen Citation2012). Specifically, the task of reducing measurement errors associated with false links requires tradeoffs in terms of the size and representativeness of the linked sample, raising new concerns about selection bias (Doidge and Harron Citation2019; Antonie et al. Citation2020). Evaluating these tradeoffs, however, proves to be extremely difficult in historical census linkage, particularly given the great diversity of linkage algorithms and data sources in play (Ruggles, Fitch, and Roberts Citation2018; Ruggles and Magnuson Citation2020). Recognition of these problems calls for the continued validation of linked samples using comparative data sources for which we have high degrees of confidence and which approximate ground truth.

Brief literature survey

Systematic attempts at creating longitudinal datasets from census returns began with Thernstrom’s Poverty and Progress (1964), an analysis of working-class males in the town of Newburyport in northeast Massachusetts. Most of the “hundreds of obscure men” studied by Thernstrom were Irish; he reckoned that neither they nor their compatriots in Boston, nor their children in either place, progressed very far in terms of social mobility (1964, 223, 1973, 89, 142–143, 247, 1986, 42). Thernstrom’s method constrained him to tracking “persisters,” i.e., those who remained in Newburyport. Thanks to the creation in the 1980s of searchable state-level census indexes, finding those who moved became feasible, albeit extraordinarily laborious, allowing subsequent work in this area to include “non-persisters” as well. It emerged that those who moved were more upwardly mobile than “persisters.” This corroborated the common presumption that the “best and the brightest” of the poor are the most likely to relocate in search of better employment opportunities. At that stage, however, most of this research was still local and focused on a particular town or county (e.g. Kessner Citation1977; Griffen and Griffen Citation1978; Galenson and Pope Citation1989).

By developing an iterative linking strategy that matched males of all ages across the entire US with the aid of national census indexes, Ferrie (Citation1996, Citation1999; see too Herscovici Citation1998) set the research agenda for a new generation.Footnote3 From the 1850 U.S. census to 1860 he traced 580 individuals, a number which seems small today but was considered quite impressive a quarter century ago. In the early 2010s Abramitzky, Boustan, and Eriksson (Citation2012, Citation2014) developed a more powerful variant of Ferrie’s approach, scaled up to involve the automated linkage of digitized complete-count census data. Over the past decade or so, variants of the methods used by ABE have generated research on topics as varied as the economic impact of public policies, the socio-economic progress of African Americans, and the return to investment in schooling. The most popular use of automated census links, however, has been for research on the occupational and geographical mobility of Americans, both native and immigrant (e.g. Abramitzky, Boustan, and Eriksson Citation2012, Citation2014; Alexander and Ward Citation2018; Connor Citation2019; Pérez Citation2019; Beck Knudsen Citation2022; Connor and Storper Citation2020).

While automated linkage algorithms mark a huge step forward in efforts to measure socio-economic mobility, they have not been immune to criticism. The primary concern has been the allegedly high rate of false positives (Massey Citation2017; Ruggles, Fitch, and Roberts Citation2018; Bailey et al. Citation2020; Sylvester and Hacker Citation2020). Establishing the prevalence of false positives is not straightforward, however, given the lack of what Bailey et al. (Citation2020, 998–999) term “ground truth data.” This paper considers the implications of comparing a case study that pits high quality hand-linked data approximating “ground truth” against automatically linked data for insight into the prevalence of false positives.

The Emigrant Savings Bank and the search for “ground truth”

It is very unlikely that a lawyer named John Scanlon, age fifty-four, who owned his own home in Brooklyn in 1860 was (as the ABE algorithm implies) a property-less day laborer of the same name in the 1870 census who gave his age as age sixty-three and lived in Scranton, Pennsylvania. But how can we be sure? Sometimes, it is not actually very hard for a professional genealogist to prove that algorithm-generated links are false. Our genealogist found that John Scanlon the 1870 day laborer was also in Scranton and also a day laborer in 1860 (albeit with his age in the census rounded to fifty), thereby proving that the ABE link of Scanlon the lawyer to Scanlon the day laborer is a false one. It was her analysis of 355 ABE links (those of every Irish immigrant listed as a doctor, lawyer, or clergyman in 1860) that led us to our estimation of the rate of false positives for Irish immigrants.

The ESB records facilitate such linkages over time because the bank went to extraordinary lengths, in an age before government-issued photo-identification, to protect its customers’ money. To that end, bank officials created “test books,” ledgers in which they compiled a wealth of personal information about all depositors, including their address; occupation; townland, parish, and Irish county of birth for Irish-born depositors; the name of the ship that carried them to America and the date of its arrival; their parents’ names (including mother’s maiden name) and whereabouts; their siblings names and whereabouts; their spouse’s names (including wife’s maiden name); and their children’s names. Then, when people visited the bank to deposit or withdraw money, a bank employee would “test” their identity by asking them their mother’s maiden name or which of their sisters still lived in Ireland. The bank would periodically update this information.

These records can be used to track both immigrants who remained in New York, and those who left. In most cases, for example, it would be impossible to trace New Yorker Peter Lynch from 1850 to 1860, given that in the 1860 census there are 123 Irish-born Peter Lynches, dozens of whom are about the right age. But the test books list the names and birth order of Lynch’s five brothers and sisters, and the 1885 Minnesota state census lists a Peter Lynch living in the town of Faxon with five siblings whose names and birth order exactly match those of the bank customer. That information leads to still more evidence which proves that the Peter Lynch found in Faxon in the 1860 census is the New York Peter Lynch from 1850. Michael Egan, a bank customer who emigrated from County Clare to New York in 1850, was traced in a similar but even more circuitous manner. When one enters Michael’s name into a genealogical search engine along with his wife’s maiden name, Ellen Carey (found in the bank records), up pop two death records from the mid-twentieth century of Minnesotans whose parents had those exact names. This eventually allows us to determine that the Michael Egan found in the 1860 census in, of all places, Faxon, Minnesota is the New York Michael Egan who arrived in New York from Clare in 1850. We managed to track Egan to Minnesota even though he had only a customer of the bank for five months. Our ESB longitudinal database is thus comprised of people who were bank depositors at some point, typically not more than a couple of years, rather than of people who remained customers of a bank in New York for a long time.

The reason why the ESB is such a rich resource for research on Irish immigrantsFootnote4 is that the institution was created by Irish-American philanthropists in 1850 specifically to provide a safe haven for the savings of Irish immigrants (Casey Citation2006, Citation2013; Ó Gráda Citation2003; Anbinder Citation2012). Anyone could open an account at the bank, and there were significant numbers of German, British, and native-born account holders. But while Irish immigrants made up a quarter of the city’s population when the bank opened, they comprised 71% of the bank’s customers in the 1850s. By the end of the decade, more than fifteen thousand people had opened accounts in the bank’s offices in lower Manhattan.

Scrutiny of the bank’s records reveals that its Irish depositors spanned the spectrum from destitute assisted immigrants to the cream of New York Irish society. Fourteen of those first 15,000 opened their accounts with the minimum deposit of $1, equivalent to an unskilled worker’s daily wage; and 659 of the 11,147 Irish-born depositors in our database made an initial deposit of $10 or less. In terms of occupations, the account holders mirrored New York’s Irish population pretty well. compares male Irish-born ESB account holders who were living in New York when they opened their accounts and a one-in-ten sample of New York City Irish males taken from the 1855 state census, divided into six broad occupational categories. The professionals were mainly physicians, lawyers, and the like. Most of those characterized as “business owners” were people of modest means: grocers, saloonkeepers, druggists, and so forth. Most of the “lower-status white collar” workers were salesmen, clerks, overseers, teachers, and civil servants. The “skilled” category is composed mainly of craftsmen. They are a heterogeneous group; some, such as shoemakers and tailors were under severe pressure from automation at this time, while others like jewelers and carriage makers could command fairly high wages. Those classified as “petty entrepreneurs” (pedlars, hucksters, fruit-stand operators, and so on) might also live in precarious circumstances. While petty entrepreneurs and business owners form somewhat higher shares of ESB customers than the labor force as a whole, it is the preponderance of workers, skilled and unskilled, in both sets of data that is most striking. On the whole, the occupational distributions are similar, but with that of account holders skewed somewhat toward business and white-collar workers and away from those in the lowest-paying jobs the city had to offer. Three-quarters of the immigrant savers had arrived in America in 1846 or later.

Since savings banks in Ireland catered disproportionately to the lower-middle and middle classes in the bigger towns and cities, few of the ESB customers—who lived overwhelmingly in rural Ireland before coming to America—are likely to have been institutional savers before they emigrated (Ó Gráda Citation2003). Yet although vast swathes of the US west and south still contained no savings banks, the savings habit was widespread in the east and particularly in New York in the 1850s, and the New York Irish were also enthusiastic savers. Over the course of the 1850s, the number of Irish-born residents of New York City who opened accounts at the ESB was equal to nearly 8% of the city’s adult Irish-born population in 1855 (Anbinder, Gráda, and Wegge Citation2019, 1598). Allowing for marriages, perhaps one in nine Irish immigrants were ESB depositors or married to one.

Given that all the depositors in our ESB database had arrived in America before the end of 1858, for the purpose of this paper we have created, with the help of a professional genealogist, a separate database of ESB customers found in the censuses of 1860 and 1870. We have managed to identify 744 of 1281 Irish-born male account holders aged between 18 and 59 years who lived in New York City or Brooklyn in the 1860 census and some place in the United States in the 1870 enumeration.Footnote5

Many of the immigrants found in the 1860 census but missing from the 1870 tally had died during the 1860s and some had returned to Ireland; still more were either skipped by the census taker or had their names so badly recorded that they could not be located. Thomas Boran, for example, returned to county Kilkenny and got married there in 1865. Armagh-born Thomas Abbott, a blacksmith living in New York’s Ward Five in 1860, could not be located in 1870 but was located, still shoeing horses in Ward Five, in the 1880 census. Abbott probably still lived in Ward Five in 1870 but for some reason was missed by the 1870 enumerator. One of the clear advantages of the manual method of matching historical records for individuals is that it yields very few false positives. Moreover, as will be clear from a few examples given below, it detects many links that would be beyond the reach of the automatic linkage algorithms currently in use, examples of the effect of “false negatives” in the machine-learning context:

  1. William Singleton, account number 3288: He was recorded as aged 37 in 1860, 25 in 1870, and 31 in 1880. William was a harness maker in 1860 and 1870, but a laborer in 1880. His wife Anne’s ages were recorded as 34 in 1860, 46 in 1870, and 50 in 1880.

  2. Thomas Kiernan, account number 15,335: Censuses list him as 25 in 1855, 35 in 1860, and 35 in 1870. His wife’s recorded ages were 28, 32, and 40; they had no children. In reality, Kiernan was probably 45 in 1870. Opening his ESB account with $70, he held nearly $700 (equal to more than $20,000 in 2023) in the bank at one time.

  3. Matthew Quirk, account number 11,102: listed as 30 in 1855, 40 in 1860, and 35 in 1870. Quirk’s wife Margaret was recorded in 1855 as aged 25, with children Mary, 3, and Thomas, 10 months. Five years later she was recorded as aged 30, with Mary, 7; Thomas, 5; Ellen, 3; and Margaret, 1. In 1870 she was still listed 30, with Mary, 17; Thomas, 15; Ellen, 13; Margaret, 10 and Jennie, 8.

  4. Peter Duggan, account number 3466: Censuses record his age as 43 in 1855, 25 in 1860, 65 in 1870, and 60 in 1880. In 1855 Peter and his wife Mary lived in the Sixth Ward with their children Michael 14, Charles 12, Mary 9, Dennis 6. Those children allowed the Duggans to be traced despite the erratic recording of Peter’s age. Despite his menial occupations—laborer to 1870, junkman by 1880—Peter at one point held over $1100 in the ESB.

Comparing the ABE and ESB databases

To make a like-to-like comparison with ABE links, we chose to consider here only male ESB customers living in New York City or Brooklyn in 1860. We limited our focus to men since women are much harder for an algorithm to trace because their surnames change when they marry. We compare our 744 ESB links to those formed by applying the two “first generation” versions of the ABE algorithm, on which a considerable body of research already rests, to a selection of Irish-born men living in New York City or Brooklyn in the 1860 full-count census database. In the standard version (ABE Standard), if the algorithm finds only one person with a certain name and birth year in one census, and then finds only one person with that name and birth year in the second census, this is considered a match. If there is no exact age match, the algorithm looks for someone in the second census who is either nine or eleven years older than the person it is attempting to match. If there is only one such person, this is considered a match. If there is still no match, it tries one more time with people of that name either eight or twelve years older than the person being searched. If there is no unique match, then this person from the 1860 census is eliminated from consideration. The names from census to census do not need to match exactly if the variations are deemed insignificant or are standard abbreviations. In the Conservative variant of the ABE algorithm, a surname-and-given-name combination must be unique within a 5-year age window. In what follows, we focus mainly on the Conservative variant but also report some results using the Standard variant, on which earlier studies rely (Abramitzky, Boustan, and Eriksson Citation2012, Citation2014; Ager, Boustan, and Eriksson Citation2021; Abramitzky Citation2019; Abramitzky, Boustan, Eriksson, et al. Citation2021).Footnote6 Of 97,573 Irish-born men aged 18–59 in 1860 who were recorded in the census of that year as living in Manhattan (then the entirety of New York City) and Brooklyn, the ABE Standard method matches 8281 of them (8.5%) to males with the same name living somewhere in the United States in the 1870 census. The ABE Conservative method matches 3370, or 3.4%, of the same group (see ).

Contrasting results

The contrast between the geographic and occupational mobility rates for New York’s Irish immigrants as measured by the ABE algorithms and our hand links could not be more stark. To measure class mobility, invokes the occupational classification scheme already used in . We compared the two ABE linkage variants to our genealogist’s “hand links” of ESB customers from the same decade. The results, found in , provide a like-for-like comparison between ABE-linked data and the hand-linked ESB customers.Footnote7

We expected that positive selection might lead the occupation transition patterns within our ESB data to differ from ABE’s, but not nearly to the extent described. Hand linking shows significant persistence in all occupational categories between 1860 and 1870. This is reflected in the percentages in the diagonal of Panel A. In all but one category, the small one created for clerks and civil servants (LSWC), three-quarters or more of the workers remained in the same occupational category a decade later. The outcomes using either version of ABE are in stark contrast, with only the “Unskilled” category showing a persistence rate of >50%, and the rest indicating persistence percentages in the thirties, twenties, or even teens. The ABE links suggest that the smaller the occupational category, the smaller the rate of persistence, which is highly suspect. In contrast, we found that the very smallest category—that of “professionals,” such as doctors and lawyers—actually had the highest rate of persistence, as one would expect.

The algorithm-generated data for overall upward and downward occupational mobility seem far-fetched as well: The ABE Standard (Panel B) would have us believe that 31.0% of Irish-born doctors and lawyers in 1860 were working in unskilled occupations just a decade later, while even the ABE Conservative (Panel C) suggests that 38.4% of business owners in 1860 held unskilled occupations just a decade later. If we were to calculate these transitions in reverse, looking retrospectively at occupational persistence from 1870 to 1860, the patterns would be similarly alarming. We would observe that ∼18% of business owners in 1870 were unskilled just a decade earlier, a rate of upward mobility that even Horatio Alger would find implausibly high. Whereas fewer than one-in-four of the hand-linked savers changed occupational status over the decade (with 14.9% moving up and 8.4% moving down), even ABE Conservative indicates that more than half of the Irish had changed occupational categories, with 26% moving up and 25% moving down (). Given some likely selection bias, that the hand-linked savers performed better, on average, did not surprise us; what made us uneasy was the huge movements across occupational categories in the automated matches.

compares what the ABE and hand-link approaches predict for the rates at which the New York Irish changed location over the decade. Hand linking suggests that 6% of the ESB’s New York City and Brooklyn customers changed their state of residence in the 1860s, and that 14% changed counties. This may seem somewhat lower than what one would expect for the general population, but the rate of geographic movement of those traced by the ABE links seems much more out of line, with half to three quarters changing county or state over the decade.

As a further “reality check” of sorts, we augmented the ABE-generated data to see if the linked individuals had spouses and, if so, whether or not that spouse’s name was the same in the 1860 and 1870 censuses. When we discovered that requiring precise first name matches for the spouse excluded many good links, we modified the rule so that only the first four letters of the first name had to match and making allowances for abbreviations like Maggie, Lizzie, Kate, etc. For the vast majority of Irish-born people then and much later, divorce was not an option. Irish immigrants may have died at a slightly higher rate than other Americans in this era, yet it is plain from municipal death records that two-thirds to three-quarters of New York’s Irish-born men cannot have lost spouses in a ten-year timespan. The 2% rate of remarriage implicit in our ESB customer database—that is where the wives’ names unambiguously differ–is a tiny fraction of that generated by the both ABE Conservative and Standard variations. Given that our genealogist uses wives’ names to help confirm links, there are undoubtedly widowers whom she cannot identify with certainty (though if the first marriage produced several children, then the married man of 1860 can often be found with his children in 1870 even if he has remarried). Nonetheless, this inconceivable rate of remarriage in ABE links is another indication of large numbers of false positives.

The relationship between spouse names and geographic persistence provides further evidence that the ABE Irish links must contain many false positives. illustrates this relationship. The immigrants who are married to the same spouse in both 1860 and 1870 (and thus most likely to be a valid link) remain primarily in the same state from census to census, while those listed as married to a different spouse are mostly found in different states. Even if we can imagine that a widower might leave behind his support network and move with his children to a new state to start afresh, the rates for such behavior found in are not credible. This again implies a massive misidentification of immigrant males in the matched sample. An added indication that there are many false positives in the ABE links is that the correlation between the occupational classes in 1860 and 1870 are high for those who were found in the same location in 1870 as in 1860 but negligible for those who supposedly migrated. The contrasting outcomes for stayers and movers are given in , which invokes an occupational classification scheme, HISCO/HISCLASS, which derived its inspiration from the ILO’s International Standard Classification of Occupations (ISCO). The scheme (van Leeuwen, Maas, and Miles Citation2002; van Leeuwen and Maas Citation2011) has been widely used by economic historians (e.g. Breschi et al. Citation2014; Dribe, Hacker, and Scalone Citation2014; Vickers and Ziebarth Citation2016; Connor Citation2019). It ranks occupations from professional, managerial, white-collar (with HISCO values of up to 30,000), through farming, skilled, commercial, and artisanal occupations (with values between 30,001 and 89,999), and unskilled occupations (with values of 90,000 and above) who moved during the decade. These loosely represent the upper, mid, and lower steps on the occupational ladder. shows that correlations between HISCO_1860 and HISCO_1870 are consistently high for those who stay put whereas they are negligible for those who migrate, an added indication that there is something amiss with the location of 1870 matches.Footnote8

What of the possibility that those not linked by hand differed systematically from those who were? The tables in Appendix A show that while such biases were present, they were not very powerful. Those linked by hand were more likely to be professionals or businessmen, while the less skilled were more likely to be lost. Our hand links also show that those who left New York did have a bit more upward mobility (attributable primarily to the propensity of these men to become farmers) than those who remained in New York, but that overall there was still a strong relationship between one’s occupation in 1860 and that followed in their new location. When the ABE linkage procedure links an Irish-born male in two different locations, the resulting data indicate virtually no relationship between that man’s 1860 and 1870 occupation, a highly implausible finding. Put another way, the ABE-linked Irish immigrants who supposedly move hardly ever stay in the same occupation when they do so. This is even the case for workers in hard-to-learn, highly sought-after, well-paid trades like baking, butchering, masonry, plumbing, printing, and stonecutting. It is not plausible that 90% to 100% of the workers in all these trades who left New York would have abandoned them, as the ABE links would have us believe.

Irishmen of an uncertain ageFootnote9

Why do the ABE links of Irish immigrants apparently contain so many false positives? The erratic recording of ages noted above pointed us to one answer: the prevalence of age heaping in census entries for the foreign born. Measures of age-heaping, such as the Whipple Index have long been used by social scientists and historians as a proxy for numeracy or cognitive ability.Footnote10 The Whipple value “is obtained by summing the age returns between 23 and 62 years inclusive and finding what percentage is borne by the sum of the returns of years ending with 5 and 0 to one-fifth of the total sum.”Footnote11 The value of the index can range from 100 (no age heaping) to 500 (complete age-heaping). In poor economies, Whipple values are typically high, particularly for females.

In a pioneering comparative study of age heaping in many countries over several centuries, A’Hearn, Baten, and Crayen (Citation2009, 792) found that census data from mid-nineteenth century Ireland produce very high Whipple scores.Footnote12 Irish Americans in the same period had even higher Whipple values, however, prompting the inference that the Irish who came to America must have had “extremely low levels of age numeracy” (italics in the original). Irish-born ESB depositors and Irish-born New Yorkers generally also yield extraordinarily high Whipple values by any comparative standard, matching or exceeding those inferred from the Irish population census of 1841. (BPP Citation1843) Why should age heaping among the New York Irish be higher than among the Irish at home? It is not because they were less likely to be literate. Beginning in the late 1850s, the ESB asked depositors to sign their names in the test books when they opened an account, and these data are far more reliable than self-reported reading and writing ability in measuring literacy rates. They indicate that only 20% of Irish-born male New Yorkers men could not write their names, far fewer than in Ireland. Moreover, 63% of adult women could not sign, yet while innumeracy ought to correlate with illiteracy, Irish-born men in New York were marginally more likely to be age-heaped than women in 1850. Such outcomes add to the evidence that there is more at play in age heaping than relative numeracy ().

The other source of age-heaping (as well as other errors, such as misspellings and under-enumeration), which the linking literature tends to ignore or downplay, is shoddy enumeration on the part of census marshals and assistant marshals.Footnote13 Indeed, the instructions given to census enumerators allowed for age “approximation.” Why would census enumerators have taken advantage of this time-saving shortcut with Irish New Yorkers more than others? Perhaps fearing crime or disease, census marshals might have been afraid of spending too much time in Irish tenements and more likely to guess at ages to complete their visits as quickly as possible. Prejudice may also have been involved. Perhaps the Irish—stereotyped as degraded, drunken brutes—were not perceived to deserve the careful consideration given to other groups. Eventually, the state would demand a more accurate accounting of its citizens. A’Hearn, Delfino, and Nuvolari (Citation2021) ascribe the sharp drop in age heaping in southern Italy between 1881 and 1901 to “the state’s increased allocation of resources to census operations, its enhanced technical competence, its increasing success in overcoming the suspicions and enlisting the cooperation of its citizens, and its growing ability to monitor and control the actions of local government.” Age heaping declined in the American censuses in the same period, probably for the very same reasons.Footnote14

Researchers are aware of the problem age heaping causes for census linking (Ferrie Citation1999, 22; Cirenza Citation2011, 55, 68; Cirenza Citation2016; Bailey et al. Citation2020, 1001), but perhaps not sufficiently so. shows the spread in age differences in more detail for these two censuses as well as 1850–1860. The “All” columns report percentages with age differences outside a five-year band centered on 10 (italicized in bold). Note that 46.5% of all potential links would be under the radar of an algorithm using a 5-year age band in 1860–70, and 52.2% of links if age is heaped in 1860.

This age heaping would not be a problem if it merely led to Type II errors, in which potential links were left unidentified. What happens instead, however, is that if a Patrick Connor in New York is asked his exact age in 1860 by a conscientious census taker, but his lazy successor in 1870 estimates Connor’s age, while an enumerator encountering a Patrick Connor in Texas is lazy in 1860 but diligent in 1870, then the algorithm can be led to believe that the New York Connor moved to Texas and the Texas Conner relocated to New York, when in fact neither man moved at all. This may be less of a problem in the twentieth century when census takers had to record each resident’s year of birth. But in the nineteenth century, when enumerators only had to record an age and estimation was permitted, the problem was pervasive and it leads to thousands of type I errors for Irish immigrants in the 1860–1870 ABE crosswalk.

Name variations

Variations in name spelling, like that of Archbishop Purcell, are the other main cause of false positives generated by the ABE algorithm. This fact became clear when we subjected our ESB links to scrutiny by the ABE algorithm. We identified 102 Emigrant Bank account holders in the 1860 census who were matched to people in 1870 by both manual and ABE methods. Only in 56 cases did the matches concur. In about half of the instances in which our genealogist determined that the ABE algorithm made a false match, the age difference in the true match was outside the range of 8-to-12 years permitted by ABE. In most of the other cases, the problem that had tripped up the algorithm was a variation in surname spelling.

Some of these name-spelling variations are so extreme that no automated system will likely ever be able to identify them. For example, when reading the entry of a census enumerator from 1870 with messy handwriting, a modern transcriber recorded Robert Baxter’s name as “Robert Bartoo,” leading the ABE algorithm to link Baxter, a Michigan brass finisher and former ESB customer, to a book binder named Robert Baxter in New York rather than the correct Robert Baxter, who was still a brass finisher and still living in the same town in Michigan in 1870. In a more typical case, the ABE algorithm incorrectly concludes that the New York chairmaker Hugh Donohoe from the 1860 census is Hugh Donohoe the Minnesota farmer in the 1870 census because in that latter year, the enumerator in New York spelled Hugh’s surname as “Donahue” ().

In many cases, both a name spelling variation and age heaping prompt the ABE algorithm to make an erroneous match. The algorithm believes, for example, that ESB customer Lawrence Fleming, a New York day laborer living with wife Jane and son Edward in 1860, had moved to Pennsylvania by 1870, become a carter, remarried a Catherine, and was now childless. In fact, Fleming was still in New York, still a day laborer, still married to Jane, and still had a son named Edward, but the ABE method cannot make this link because the sloppy New York census enumerator in 1870 recorded Fleming’s age as thirty when in fact he was 9 years older. But even if that census taker had correctly documented Fleming’s age, the ABE algorithm would have still made the same bad link because in 1870 that careless New York census marshal spelled the immigrant’s name as “Flemming,” and the ABE method only links Flemings to Flemings and Flemmings to Flemmings. Between the spelling and age variations, most of the 117 Irish-born Flemings and Flemmings the ABE algorithm links from 1860 to 1870 are errors (). The same result is found with the many other Irish surnames commonly spelled in more than one way, such as Burns/Byrnes, Conner/Connor, Eagan/Egan, Maher/Meagher, O’Neil/O’Neal/O’Neill, Quin/Quinn, and Riley/Reilly, just to name a few. In all the cases in which the ABE algorithm makes an erroneous link of an ESB customer, both versions of the algorithm should have eliminated these people from consideration because in each case there were two people with the same name of a very similar age. However, because of inconsistency in the spelling of their names and the recording of their ages, the ABE method made erroneous matches instead.

A case study: clergymen, doctors, and lawyers

Abramitzky et al. claim that their automated matching algorithms typically “generate very low (<5%) false positive rates” (2021, 865), yet our analysis thus far of the links created by the ABE method for Irish immigrant males living in New York and Brooklyn in 1860 suggests a far higher error rate. To test our suspicion that the ABE method creates more false positives than has been acknowledged, we conducted another case study using the Irish-born Catholic priests found in the 1860 census whom the ABE algorithm matched in the 1870 census. Catholic clergymen are a particularly interesting group to consider since their calling was almost invariably a life-long one. We do not discount the occasional possibility of a priest being defrocked but in the mid-nineteenth century the mantra of “once a priest, always a priest” rang true. Furthermore, priests create a much larger paper trail—in parish histories and news accounts of their lives (and deaths)—than the average citizen, making it fairly easy to determine with certainty whether or not the ABE links for these priests are accurate.

The ABE Standard algorithm matches 70 Irish-born Catholic priests (once some transcription errorsFootnote15 are corrected) found in the 1860 US census to individuals in the 1870 tally. Of those 70, according to the ABE Standard algorithm, only 26 (37%) were still priests in 1870. Even if one chooses the ABE Conservative algorithm, the percentage of priests who supposedly remained in that line of work ten years later increases only to 48% (14 out of 29). In collaboration with our genealogist, we closely investigated the 44 who had supposedly left the priesthood and found that 43 of the 44 matches were demonstrably erroneous.Footnote16 Of these 43, seven could be shown to have died before 1870. Twelve of the supposed former priests had children in 1870 who had been born at times and in places indicating their father could not possibly be the same person as the priest of the same name from 1860. According to the algorithm, for example, Wisconsin priest George Brennan in 1870 had become a leather currier and moved to Massachusetts. However, the 1870 census lists the Massachusetts man as having four children born in the Bay State from 1855 to 1861, meaning the leather currier of 1870 was not working as a priest 1000 miles away in 1860. Furthermore, we can show that twenty of the alleged ex-priests were still clergymen in 1870, in every case in the very state where they had lived in 1860 (). The remaining links can be proven false because the person linked in 1870 can be found in the 1860 census in an entry different than that for the priest of the same name. Note that nearly all the false matches were to immigrants living in different states than those where the priests were documented to have originally lived.

Still, a sample of 70 is not conclusive, so we expanded our case study to also include Irish-born clergymen who were not Roman Catholics as well as doctors and lawyers. Again, these groups were chosen because they were more likely than the average immigrant to leave a paper trail that would allow us to definitively evaluate the ABE links. The result is a more robust sample of 355 doctors, lawyers, and clergymen. The ABE accuracy rate for this larger group is slightly better than for the priests alone. Still, half of the links for these immigrants formed by the ABE Standard algorithm (177 of 355) are clearly false positives, while even the conservative ABE method produces a false positive rate of one-third (58 out of 178). The cases of the Irish, clergymen, doctors, and lawyers, to whom we return below, add direct evidence to the already strong circumstantial case that the ABE method produces many more erroneous links than is generally understood.

Data quality, hand linking, and match rates

Abramitzky, Boustan, Eriksson, et al. (Citation2021) have subjected their automated linking algorithm to a variety of tests against hand links. In one exercise comparing the results obtained from automated linking of the entire 1910 and 1920 US censuses to links from Familysearch.org’s “Family Tree” data they report that their method produces very similar results in terms of false positives—about a 5% error rate for their Standard method and 3% for their Conservative method (Abramitzky, Boustan, Eriksson, et al. Citation2021, 865, 868). Why is it, then, that the false-positive rates we find differ so markedly? The answer probably lies in part in the relative quality of the underlying data, which is likely to have differed not only over time but across countries. We have highlighted how age misreporting and the inconsistent spelling of surnames led to both false positives and missed links. We suspect that these issues matter more for data from the second half of the nineteenth century than those from the first half of the twentieth. If that is so, then the older census records are likely to generate more Type I and Type II errors than the newer ones. Algorithms, such as ABE rely on identifiers that should not change over time, such as place of birth, birth year, surname, and gender. However, as the quality of the underlying data improved over time, the variation in the gap in ages between censuses and in spelling errors decreased, leading to better matching. This probably helps to account for the very low match rates found by Ferrie and by ourselves relative to those claimed by ABE and others in different historical contexts.Footnote17

As noted earlier, the original ABE algorithm matched 8.5% of Irish-born aged 18–59 living in New York and Kings counties in 1860. That is in the same ballpark as the 10.6% achieved by Ferrie (Citation1999, 22) for immigrants matched in the 1850 and 1860 US censuses, but much lower than the 20% matched by Ager, Boustan, and Eriksson (Citation2021), the 19% by Collins and Wanamaker (Citation2014), the 21% by Connor and Storper (Citation2020) using later US census data, and the 17% matched by Long and Ferrie (Citation2018, F426–F427), using British data. For 1880–1910 cohorts of U.S. immigrants from 17 countries, Abramitzky, Boustan, Jacome, et al. (2021) find match rates ranging from 15.9 and 27.7%, and for 1910–1940 from 20.9 to 34.3%. Such differences and any potential biases they might cause are also of potential interest, particularly if the prevalence of false positives is a function of the match rate, and specifically how match rates improved over time and how they may have differed across names of different ethnic and linguistic origins.

Our genealogist, in contrast, has produced a much higher match rate—she has found 54% of Irish ESB customers located in the 1860 census in the 1870 count.Footnote18 The ABE Standard algorithm purports to find 11% of those same immigrants from the 1860 census in 1870. The genealogist’s higher linkage rate results primarily from the fact that she can use information other than censuses—such as directory listings, naturalization, birth, and death records—as well as census information on the depositors’ family members to confirm or eliminate potential matches. The poorer quality of the earlier data enhances the value of hand linking.

Perhaps another reason why ABE underestimate their rate of false positives is that there are deficiencies in the design of the tests they use to compare their results to those attainable by humans. In most of their tests, ABE limit their human testers to the same data points that the algorithm has at its disposal—name, age, and birthplace. Those tests merely show that humans are no better than machines at doing precisely what ABE ask a machine to do. Yet professional genealogists—surely the appropriate yardstick—would not go about the problem of linking people across different censuses in that manner. They use all the data at their disposal and consequently make much better links and produce many fewer false positives than any algorithm can. An under-appreciated factor is that genealogists know much better than an algorithm when not to make a link at all. They can determine that someone found in one census has died before the next one was conducted or that what seems like a very unusual name is really an enumerator’s misspelling of a very common one.

Do ABE false positives affect resulting outcomes?

As Abramitzky, Boustan, Eriksson, et al. (Citation2021) rightly point out, false positives are inevitable in any automated linking project and do not really matter unless they bias the outcome of the analysis for which the data are used. ABE argue that the damage done is small, but our case study of Irish immigrants linked by the ABE algorithm provides evidence that the foreign-born may produce many more false positives than previously understood. This matters a great deal given that studying the social mobility of immigrants is one of the most popular uses of linked-census databases (see Ferrie Citation1999; Abramitzky, Boustan, and Eriksson Citation2012, Citation2014; Pérez Citation2017, Citation2019; Collins and Zimran Citation2019, Citation2023; Connor Citation2019; Kosack and Ward Citation2020; Aaronson, Davis, and Schulze Citation2020; Abramitzky, Boustan, Jacome, et al. 2021).

To test the extent to which outcomes can be affected by ABE false positives, we return to the occupational and geographic mobility of Irish immigrants in the US from 1860 to 1870. First, in we compare outcomes using the Standard and Conservative algorithms for Irish males aged 18–59 years living in New York or Kings counties in 1860 with 744 hand-linked Emigrant Bank account holders, also aged 18–59 and living in the same counties in 1860. The contrasts in outcomes are staggering. Not only does the algorithm return proportions moving out of New York state that are 8–9 times as high as the EISB sample: it produces even more improbable contrasts for changing wives over the decade. And, whereas the bank data suggest about a quarter changed occupational category (as defined above), the algorithm suggests that 50–55% did so. Now, undoubtedly, some of these differences are explained by selection in the Emigrant Bank data emanating from both bank depositors being positively selected relative to the New York Irish in general (see ) and from selection in those we have successfully linked (see Appendix A). Because linked account holders were on average a few years older than those in the ABE databases,Footnote19 they were less likely to move states; and because linked bank account holders were somewhat better off than those not linked, they were likely to be more upwardly mobile. But, as explained earlier, the biases emanating from these differences are second order compared to the huge differences between hand-linking and the algorithm, which are primarily due to false positives generated by both versions of the algorithm.

We can do better, however. We can compare the social mobility of the 355 Irish-born clergymen, lawyers, and physicians in 1860 who were tracked to 1870 by the ABE algorithm with genealogist hand linking of the same 355 individuals. This is a genuine “like-to-like” comparison. shows that the ABE method generates results that drastically misrepresent the propensity of these immigrants to relocate or change vocations. The ABE Standard links overstate the geographic and occupational mobility of this group almost 5-fold, while even the ABE Conservative variant overstates their propensity to move or change occupational categories by 350%. Furthermore, there is no question about whose links are correct. In each of the 298 cases out of 355 (84%) in which the genealogist has made a link, there is proof positive that her link is correct; this represents the gold standard that Bailey et al. (Citation2020, 998) define as “data obtained by direct observation of the true link.”Footnote20

The ABE algorithm’s results are similarly distorted when tracking the Irish immigrants who opened accounts at the Emigrant Savings Bank. When comparing the ABE links of the bank’s Irish-born customers to genealogist hand links of only those depositors who ABE link, we find that the automated technique overstates the percentage who changed their states of residence from 1860 to 1870 by a factor of over 6 using the Conservative version of the algorithm and by a factor of roughly 8 with the Standard version.” The ABE methods overstate the bank customers’ propensity to move up or down among our six occupational categories by about 100% (). These results suggest that false positives generated by the first generation ABE method of linking significantly distort social mobility analysis.

Conclusions

Our examination of the reliability of the original ABE method for linking Americans between censuses, using mid-nineteenth-century Irish immigrants as a test case, yields several important findings. Immigrants from Ireland accounted for one-in-four of all European immigrants to the US during the 1850s; in New York City they made up 28% of the population in 1855. Not only do their sheer numbers make the New York Irish important from an historical perspective, certain of their characteristics, notably their propensity to age-heap in the US census and their relatively narrow range of names, provide a challenging trial for evaluating linkage methods. Our analysis of first generation ABE-generated databases of the Irish-born in New York in 1860 and 1870 shows that even very slight variations in name transcriptions, which are quite common, can play havoc with the efforts of automated census linking algorithms to make accurate matches. Age heaping and age misreporting generally pose another major challenge to automated linking efforts. The propensity to age heap was widespread—at least in the nineteenth century—and significantly impacts the quality of matching results. We find, consequently, that there are more false positives produced using the ABE method than has been recognized. Our case study of Irish doctors, lawyers, and clergymen found that half of the links created by the ABE Standard method were false positives and that even the ABE Conservative variant produced one false positive in three. We also found that the false positives generated by automated linking significantly affect the socio-economic outcomes implied by those links. The large number of false positives generated by the ABE method for Irish immigrants linked from 1860 to 1870 produced far too much occupational mobility—both upward and downward. Geographic mobility was even more drastically distorted by false positives.

To what extent such distortions apply to other populations subjected to first generation linkage exercises (i.e., the standard and conservative ABE algorithms) remains to be seen. We recognize that linking mid-nineteenth-century immigrants may produce more false positives than matching later immigrants or the native born in the ABE system and that the results of using the algorithm on Irish immigrants may yield more biases than using it on other groups. But we are currently at a juncture where we lack a full understanding of the factors that generate false positives, an issue that will continue to bias our findings from economic and demographic research.

Second-generation census-matching algorithms are now becoming available; how they fare relative to their predecessors deserves separate scrutiny (Helgertz et al. Citation2022; Abramitzky, Boustan, Eriksson, et al. Citation2021; Ward Citation2020; Anbinder, Connor, and Ó Gráda Citation2023). Meanwhile, one simple tweak of the original ABE method, which requires men linked from one census to the next to have a spouse with the same name in both censuses, offers one indication of the possibilities. As shows, this variation cuts down the likely false-positive rate dramatically; the proportions predicted to have changed states fall from 64 and 55 to 30 and 21%, respectively, and the proportions changing occupational category also fall significantly. But even this variation still implies more movement than our genealogist’s hand links.Footnote21 The striking improvement generated by this tweak to the algorithm prompts the following ecumenical summary: the gap between the strikingly high false-positive rate that we found and the 5% reported by Abramitzky et al. stems from three factors: (1) our matched database refers to an earlier, more error-prone crosswalk than theirs; (2) we are testing a more error-prone group, the Irish, who are more prone to age heaping and more name variations; and (3) imperfections in the algorithm itself.

Acknowledgments

Earlier versions of this paper were presented in seminars at Northwestern University, Carleton College, South Denmark University, and Queens University Belfast. We are grateful to participants, Peter Solar, Marianne Wanamaker, the editor Lisa Dillon, and three anonymous referees for very useful comments. The usual disclaimer applies. We would also like to thank Leah Boustan, Ran Abramitzky, Myera Rashid, and Jonas Helgertz for generously providing access to their data. Finally, this undertaking would not have been possible without the expertise of project genealogist Janet Wilkinson Schwartz.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

Funding for this project was provided by the National Endowment for the Humanities, George Washington University, and the City University of New York.

Notes

1 The bank, which still operates in New York, dropped “Industrial” from its name more than a century ago, and we will refer to it by its current and more recognizable name. Olmstead (Citation1976) describes the ESB and other New York savings banks in the 1850s.

2 In related work, we evaluate these more recent latest algorithms using the EISB database (Anbinder, Connor, and Ó Gráda Citation2023).

3 Steckel (Citation1989), however, may claim to have been the first to harness systematic linking for Ferrie-style research. Note too that the computing science literature had developed theory and method for linking long before social and economic historians (going back at least to Fellegi and Sunter Citation1969).

4 See e.g. Ó Gráda (Citation2000), Wegge, Anbinder, and Ó Gráda (Citation2017), Anbinder, Ó Gráda, and Wegge (2019), and Ó Gráda, Anbinder, and Wegge (Citation2020).

5 Tracing customers involved using not only federal censuses, but also state population tallies, newspapers, city directories, military enlistment and pension records, death registries, and probate records. Another seventeen hundred were found in a census from 1860 or earlier but not 1870, while several hundred more were located in the 1870 census or later but not earlier.

6 Note that according to the 1860 U.S. census, there were 64 Bridget Lynchs living in New York City, 87 Bridget Ryans, 134 Bridget Murphy/Murpheys, and 146 Bridget Kelleys/Kellys. And there were three to four times as many Marys with each of these surnames.

7 The ESB mobility figures presented here differ substantially from those we have published previously (Anbinder, Ó Gráda, and Wegge Citation2022) because in our study devoted to immigrant mobility, we looked at the very first and last occupations held by the immigrants over their lifetimes, not just those from 1860 and 1870. Looking over that longer timespan, we found that the Irish experience more upward and less downward mobility than in the shorter period examined here.

8 Similarly, the correlation between the declared real property of New York or Kings County residents who remained in those counties over the decade is significant (0.335, N = 3392), as is that of residents of those counties who stayed within New York state (0.281, N = 5431), while the correlation for those who changed state are close to zero (0.029, N = 11,637).

9 With a bow to Blum et al. (Citation2018).

10 See, e.g. A’Hearn, Baten, and Crayen (Citation2009), Crayen and Baten (Citation2010), De Moor and van Zanden (Citation2010), and Blum et al. (Citation2018).

11 J. T. Marten, Census of India, 1921, vol. 1, part 1 (Calcutta, 1924), 126–127 as cited in United Nations (Citation2017).

12 The Irish in Canada, particularly the Catholics among them, were also prone to age heap. In 1871, 21% of the latter gave an age ending in zero on the census return (Dillon Citation2008, 107).

13 Berry-Cahn (Citation2022) provides an interesting comparison of copying errors made by five assistant marshals in 1860. Quality depended greatly on the individual enumerator. Berry-Cahn focuses on spelling errors and spaces on forms left blank, while Steckel (Citation1991) describes the scale of under-enumeration in 1850–1870.

14 A highlight is the stark contrast between the age heaping of the Italian-born in the US in 1910 as recorded in the census (150 for males, 149 for females) and of Ellis Island arrivals from Italy in 1898–1912 (99 and 105). In the latter case the data were assembled on board and “sloppiness was extraordinarily rare” (A’Hearn, Delfino, and Nuvolari Citation2021, ). For an earlier example of disregard for administrative sloppiness see Mokyr and Ó Gráda (Citation1982).

15 For example, one linked immigrant whose occupation in 1860 was transcribed as “RC paster” and coded as a menial worker was clearly a Roman Catholic pastor and has been included with those whose occupations as Catholic clergymen were accurately transcribed. In other cases, an immigrant is listed only as a “clergyman” and it takes some research to determine if they were Catholic clergymen.

16 In the forty-fourth case, even though we could not prove that Father John Cassey of California in 1860 was not John Cassey the Philadelphia domestic servant in 1870, the link is certainly erroneous.

17 A referee suggests that, quite apart from data quality issues, the family trees were created in part by researchers looking for census links that may or may not be correct and that, besides, the family tree data are “bedevilled by survivor bias.” Because the Familysearch.org data are not freely available, their reliability has not been subjected to critical assessment.

18 Kosack and Ward (Citation2020, 969–970), in their study of inter-generational Mexican-American mobility, used both automated and hand-linking methods. Because they wanted to get a sufficiently large enough group of third generation Mexican-Americans, they hand-linked the entire sample, stating that “The main benefit of hand-linking is that the match rates increase.”

19 For those aged 18 and over in 1860 the averages are: ESB 37.8 years, ABE-standard 34.3 years, ABE-conservative 35.7 years.

20 Furthermore, in all but a handful of the remaining 57 cases, our genealogist could prove that the ABE link is incorrect even though she could not come up with a verifiable alternative link.

21 We recognize that this spouse-match variation has its limitations—it will not count single men, who constituted 20% of Irish-born American men age 30 or older in 1860 and a much higher proportion of those under 30, and its first name recognition rules need sharpening. But we feel that through sample weighting or other means, these deficiencies will be remedied to some extent, and that until some better solutions come along, the benefits of using a system that generates only a fraction of the false positives currently being created far outweigh the costs.

References

  • Aaronson, D., J. Davis, and K. Schulze. 2020. Internal immigrant mobility in the early 20th century: Evidence from Galveston, Texas. Explorations in Economic History 76:101317. doi:10.1016/j.eeh.2019.101317.
  • Abramitzky, R. 2019. Historical record linking. https://ranabr.people.stanford.edu/matching-codes.
  • Abramitzky, R., L. Boustan, and K. Eriksson. 2012. Europe’s tired, poor, huddled masses: Self-selection and economic outcomes in the age of mass migration. The American Economic Review 102 (5):1832–56. doi:10.1257/aer.102.5.1832.
  • Abramitzky, R., L. Boustan, and K. Eriksson. 2014. A nation of immigrants: Assimilation and economic outcomes in the age of mass migration. The Journal of Political Economy 122 (3):467–506. doi:10.1086/675805.
  • Abramitzky, R., L. Boustan, K. Eriksson, J. J. Feigenbaum, and S. Pérez. 2021. Automated linking of historical data. Journal of Economic Literature 59 (3):865–918. doi:10.1257/jel.20201599.
  • Abramitzky, R., L. Boustan, E. Jacome, and S. Pérez. 2021. Intergenerational mobility of immigrants in the United States over two centuries. American Economic Review. 111 (2): 580–608.
  • Abramitzky, R., R. Mill, and S. Pérez. 2020. Linking individuals across historical sources: A fully automated approach. Historical Methods: A Journal of Quantitative and Interdisciplinary History 53 (2):94–111. doi:10.1080/01615440.2018.1543034.
  • Abramitzky, R., L. Boustan, and M. Rashid. 2020. Census linking project: Version 2.0 [dataset]. https://censuslinkingproject.org.
  • Ager, P., L. Boustan, and K. Eriksson. 2021. The intergenerational effects of a large wealth shock: White southerners after the civil war. American Economic Review 111 (11):3767–94. doi:10.1257/aer.20191422.
  • A’Hearn, B., J. Baten, and D. Crayen. 2009. Quantifying quantitative literacy: Age heaping and the history of human capital. The Journal of Economic History 69 (3):783–808. doi:10.1017/S0022050709001120.
  • A’Hearn, B., A. Delfino, and A. Nuvolari. 2021. Rethinking age heaping: A cautionary tale from nineteenth-century Italy. Economic History Review. 75 (1): 111–137.
  • Anbinder, T. 2002. From famine to five points: Lord Lansdowne’s Irish tenants encounter North America’s most notorious slum. American Historical Review 107 (4):350–87.
  • Alexander, R., and Zachary Ward. 2018. Age at Arrival and Assimilation During the Age of Mass Migration. Journal of Economic History 78(3): 904–37.
  • Anbinder, T. 2012. Moving beyond “rags to riches”: New York’s Irish famine immigrants and their surprising savings accounts. Journal of American History 99 (3):741–70. doi:10.1093/jahist/jas435.
  • Anbinder, T., D. Connor, and C. Ó. Gráda. 2023. Advances in automated census linkage algorithms. Typescript.
  • Anbinder, T., C. Ó. Gráda, and S. Wegge. 2019. Networks and opportunities: A digital history of Ireland’s great famine refugees in New York. The American Historical Review 124 (5):1591–629. doi:10.1093/ahr/rhz1023.
  • Anbinder, T., C. Ó Gráda, and S. Wegge. 2022. ‘The best country in the world’: The surprising social mobility of New York’s Irish famine immigrants. The Journal of Interdisciplinary History 53 (3):407–38. doi:10.1162/jinh_a_01869.
  • Antonie, L., K. Inwood, C. Minns, and F. Summerfield. 2020. Selection bias encountered in the systematic linking of historical census records. Social Science History 44 (3):555–70. doi:10.1017/ssh.2020.15.
  • Bailey, M. J., C. Cole, M. Henderson, and C. Massey. 2020. How well do automated linking methods perform? Lessons from US historical data. Journal of Economic Literature 58 (4):997–1044. doi:10.1257/jel.20191526.
  • Beck Knudsen, Anne Sofie. 2022. Those Who Stayed: Selection and Cultural Change during the Age of Mass Migration. Working Paper, Department of Economics, University of Copenhagen, August 7.
  • Berry-Cahn, J. 2022. Paid by the line: Copying mistakes made by enumerators in 1850, 1860, and 1870 federal census returns. Clio 32:135–60.
  • Blum, M., C. L. Colvin, L. McAtackney, and E. McLaughlin. 2018. Women of an uncertain age: Quantifying human capital accumulation in rural Ireland in the nineteenth century. The Economic History Review 70 (1):187–223. doi:10.1111/ehr.12333.
  • BPP (British Parliamentary Papers). 1843. Report of the commissioners appointed to take the census of Ireland for the year 1841. Vol. XXIV, 1. 504.
  • Breschi, M., S. Mazzoni, M. Esposito, and L. Pozzi. 2014. Fertility transition and social stratification in the Town of Alghero, Sardinia (1866–1935). Demographic Research 30:823–52. doi:10.4054/DemRes.2014.30.28.
  • Casey, M. R. 2006. Refractive history: Memory and the founders of the Emigrant Savings Bank. In Making the Irish American: History and heritage of the Irish in the United States, ed. J. J. Lee and M. R. Casey, 302–31. New York, NY: NYU Press.
  • Casey, M. R. 2013. Emigrant as historian: Records, banking and Irish American scholarship. American Journal of Irish Studies 10:145–63.
  • Christen, P. 2012. Data matching: Concepts and techniques for record linkage entity resolution and duplicate detection. Berlin: Springer Science & Business Media.
  • Cirenza, P. 2016. Geography and Assimilation: A case study of Irish immigrants in late nineteenth century America. In Migration and integration new models for mobility and coexistence, eds. R. Hsu and C. Reinprecht, 173–200. Vienna: Vienna University Press.
  • Cirenza, P. 2011. Melting pot or salad bowl? Assessing Irish immigrant assimilation in late nineteenth century America. PhD diss., London School of Economics. http://etheses.lse.ac.uk/90/1/Melting_pot_or_salad_bowl_assessing_Irish_immigrant_assimilation_in_late_nineteenth_century_America_%28Author%29_with_tables.pdf.
  • Collins, W. J., and M. H. Wanamaker. 2014. Selection and economic gains in the great migration of African Americans: New evidence from linked census data. American Economic Journal: Applied Economics 6 (1):220–52. doi:10.1257/app.6.1.220.
  • Collins, W. J., and A. Zimran. 2019. The economic assimilation of Irish famine migrants to the United States. Explorations in Economic History 74 (1):101302. doi:10.1016/j.eeh.2019.101302.
  • Collins, W. J., and A. Zimran. 2023. Working their way up? US immigrants’ changing labor market assimilation in the age of mass migration. American Economic Journal: Applied Economics 15 (3):238–69. doi:10.1257/app.20210008.
  • Connor, D. S. 2019. The cream of the crop? Geography, networks, and Irish migrant selection in the age of mass migration. The Journal of Economic History 79 (1):139–75. doi:10.1017/S0022050718000682.
  • Connor, D. S., and M. Storper. 2020. The changing geography of social mobility in the United States. Proceedings of the National Academy of Sciences of the United States of America 117 (48):30309–17. doi:10.1073/pnas.2010222117.
  • Crayen, D., and J. Baten. 2010. Global trends in numeracy 1820–1949 and its implications for long-term growth. Explorations in Economic History 47 (1):82–99. doi:10.1016/j.eeh.2009.05.004.
  • De Moor, T., and J. L. van Zanden. 2010. ‘Every woman counts’: A gender-analysis of numeracy in the low countries during the early modern period. The Journal of Interdisciplinary History 41 (2):179–208. doi:10.1162/jinh_a_00049.
  • Doidge, J. C., and K. L. Harron. 2019. Reflections on modern methods: Linkage error bias. International Journal of Epidemiology 48 (6):2050–60. doi:10.1093/ije/dyz203.
  • Dribe, M., J. D. Hacker, and F. Scalone. 2014. The impact of socio-economic status on net fertility during the historical fertility decline: A comparative analysis of Canada, Iceland, Sweden, Norway, and the USA. Population Studies 68 (2):135–49. doi:10.1080/00324728.2014.889741.
  • Dillon, L. 2008. The shady side of fifty: Age and old age in Late Victorian Canada and the United States. Montreal; Kingston: McGill-Queen’s University Press.
  • Fellegi, I. P., and A. B. Sunter. 1969. A theory for record linkage. Journal of the American Statistical Association 64 (328):1183–210. doi:10.1080/01621459.1969.10501049.
  • Ferrie, J. P. 1996. A new sample of males linked from the public use microdata sample of the 1850 US Federal Census of Population to the 1860 US Federal Census manuscript schedules. Historical Methods 4:141–56.
  • Ferrie, J. P. 1999. Yankeys now: Immigrants in the Antebellum U.S., 1840–1860. New York, NY: Oxford University Press.
  • Galenson, D. W., and C. L. Pope. 1989. Economic and geographic mobility on the farming frontier: Evidence from Appanoose County, Iowa, 1850–1870. Journal of Economic History XLIX 3:635–55.
  • Griffen, Clyde and Sally Griffen. 1978. Natives and Newcomers: The Ordering of Opportunity in Mid-Nineteenth-Century Poughkeepsie. Cambridge, MA.: Harvard University Press.
  • Helgertz, J., J. Price, J. Wellington, K. J. Thompson, S. Ruggles, and C. A. Fitch. 2022. A new strategy for linking U.S. historical censuses: A case study for the IPUMS multigenerational longitudinal panel. Historical Methods 55 (1):12–29. doi:10.1080/01615440.2021.1985027.
  • Herscovici, S. 1998. Migration and economic mobility: Wealth accumulation and occupational change among antebellum migrants and persisters. The Journal of Economic History 58 (4):927–56. doi:10.1017/S0022050700021677.
  • Hough, F. B. 1857. Census of the State of New-York for 1855. Albany, NY: Charles van Benthuysen.
  • Kessner, T. 1977. The golden door: Italian and Jewish immigrant mobility in New York City, 1880–1915. New York, NY: Oxford University Press.
  • Kosack, E., and Z. Ward. 2020. El Sueño Americano? The generational progress of Mexican Americans prior to World War II. The Journal of Economic History 80 (4):961–95. doi:10.1017/S0022050720000480.
  • Long, J., and J. P. Ferrie. 2018. Grandfathers matter(ed): Occupational mobility across three generations in the US and Britain, 1850–1911. The Economic Journal 128 (612):F422–F445. doi:10.1111/ecoj.12590.
  • Massey, C. G. 2017. Playing with matches: An assessment of accuracy in linked historical data. Historical Methods: A Journal of Quantitative and Interdisciplinary History 50 (3):129–43. doi:10.1080/01615440.2017.1288598.
  • Mokyr, J., and C. Ó Gráda. 1982. Emigration and poverty in pre-famine Ireland. Explorations in Economic History 19 (4):360–84. doi:10.1016/0014-4983(82)90008-0.
  • Ó Gráda, C. 2000. The famine, the New York Irish, and their bank. In Contributions to the history of economic thought – Essays in honour of RDC Black, ed. Antoin E. Murphy and Renée Prendergast, 227–48. London: Routledge.
  • Ó Gráda, C. 2003. Savings banks as an institutional import: The case of nineteenth-century Ireland. Financial History Review 10 (1):31–55. doi:10.1017/S0968565003000027.
  • Ó Gráda, C., T. Anbinder, and S. Wegge. 2020. Assisted emigration as famine relief: Lessons from the Lansdowne Estate. In Kerry: History and society, ed. Maurice Bric, 367–90. Dublin: Geography Publications.
  • Olmstead, A. L. 1976. New York mutual savings banks, 1819–1861. Chapel Hill, NC: UNC Press.
  • Pérez, S. 2017. The (South) American dream: Mobility and economic outcomes of first and second-generation immigrants in nineteenth-century Argentina. The Journal of Economic History 77 (4):971–1006. doi:10.1017/S0022050717000808.
  • Pérez, S. 2019. Intergenerational occupational mobility across three continents. The Journal of Economic History 79 (2):383–416. doi:10.1017/S0022050719000032.
  • Price, J., K. Buckles, J. Van Leeuwen, and I. Riley. 2021. Combining family history and machine learning to link historical records: The census tree data set. Explorations in Economic History 80:101391. doi:10.1016/j.eeh.2021.101391.
  • Ruggles, S., C. A. Fitch, and E. Roberts. 2018. Historical census record linkage. Annual Review of Sociology 44 (1):19–37. doi:10.1146/annurev-soc-073117-041447.
  • Ruggles, S., C. A. Fitch, R. Goeken, J. Grover, J. D. Hacker, M. Nelson, J. Pacas, E. Roberts, and M. Sobek. 2020. IPUMS restricted complete count data: Version 2.0 [dataset]. Minneapolis, MN: University of Minnesota.
  • Ruggles, S., and D. L. Magnuson. 2020. Census technology, politics, and institutional change, 1790–2020. Journal of American History 107 (1):19–51. doi:10.1093/jahist/jaaa007.
  • Steckel, R. H. 1989. Household migration and rural settlement in the United States, 1850–1860. Explorations in Economic History 25:190–218.
  • Steckel, R. H. 1991. The quality of census data for historical inquiry: A research agenda. Social Science History 15 (4):579–99. doi:10.2307/1171470.
  • Sylvester, K. M., and J. D. Hacker. 2020. Introduction to special issue on historical record linking. Historical Methods 53 (2):77–9. doi:10.1080/01615440.2020.1707445.
  • United Nations. 2017. Demographic yearbook special census topics, volume 1 – Basic population characteristics. “Table 1c – Special Topic Volume 1”. New York, NY: United Nations.
  • Thernstrom, S. 1973. The other Bostonians: Poverty and progress in the American metropolis 1880–1970. Cambridge, MA: Harvard University Press.
  • Thernstrom, S. 1986. Poverty and Progress Revisited: A response to Riess, Frisch, and Pessen. Social Science History 10 (1):33–44. doi:10.2307/1171116.
  • Thernstrom, S. 1964. Poverty and progress: Social mobility in a nineteenth century city. Cambridge, MA: Harvard University Press.
  • van Leeuwen, M. H. D., I. Maas, and A. Miles. 2002. HISCO: Historical international standard classification of occupations. Louvain: Leuven University Press.
  • van Leeuwen, M. H. D., and I. Maas. 2011. HISCLASS: A historical international social class scheme. Leuven: Leuven University Press.
  • Vickers, C., and N. L. Ziebarth. 2016. Economic development and the demographics of criminals in Victorian England. The Journal of Law and Economics 59 (1):191–223. doi:10.1086/684303.
  • Ward, Z. 2020. Intergenerational mobility in American history: Accounting for race and measurement error. Working Paper, Baylor University.
  • Wegge, S., T. Anbinder, and C. Ó Gráda. 2017. Immigrants and savers: A rich new database on the Irish in 1850s New York. Historical Methods: A Journal of Quantitative and Interdisciplinary History 50 (3):144–55. doi:10.1080/01615440.2017.1319773.

Appendix A.

Hand linking and selection

Selection is the original sin of much economic history. Although the occupational profiles of EISB account holders rather closely replicated those of all Irish-born New Yorkers as reflected in the 1855 New York census, there remains the concern that those successfully linked might be atypical of the bank customers in general. And, sure enough, comparing matches and non-matches for 1860 and 1870 for all the bank’s Irish-born customers reveals some differences between them (). Among males with New York addresses when they opened an account for whom we have an occupational category at the outset, the “business owners” and “professionals” categories were significantly overrepresented among those matched. Knowing the order of the biases guards against undue generalization.

, which compares some of the saving patterns of those linked and those not linked, offers some further evidence of selection. The opening and peak deposits of linked account holders were likely to be higher; they were more likely to be in a joint account; they were held for longer and produced more transactions. These features are consistent with some positive selection, but they are not big.

A final linking anomaly was that the unlinked were much more likely to live in Wards 1–4 at the southern tip of Manhattan, though, interestingly, not in Ward 6, the city’s most impoverished district. A significant proportion of the bank’s customers in Ward 6 came from a single estate in County Kerry and tended to stay in Ward Six for many years, which may explain why they were easier to link than residents of other wards who had weaker ties to their neighborhoods (on this enclave, see Anbinder Citation2002).

Table A1. Male ESB account holders with NY addresses in 1860.

Table A2. Some characteristics of linked and non-linked depositors: median values.

Appendix B.

Further comparison of ABE links to hand linking

The geographic and occupational mobility of Emigrant Savings Bank customers linked by the abe algorithm from 1860 to 1870, overlapping observations only.

Figure 1. Archbishop John B. Purcell of Cincinnati, as portrayed by Thomas Nast on the cover of harper’s weekly, August 28, 1875.

Figure 1. Archbishop John B. Purcell of Cincinnati, as portrayed by Thomas Nast on the cover of harper’s weekly, August 28, 1875.

Table 1. Irish-born males by occupational category: ESB’S New York City customers (1850–1858) and all New Yorkers in 1855.

Table 2. Percentages of men staying in same occupational category, 1860–1870: hand linked vs. algorithm linked.

Table 3. Geographical mobility from 1860 to 1870 of male irish immigrants living in New York or KINGS counties in 1860, by linkage method.

Table 4. Relationship between remaining married to the same woman and geographic persistence for Irish immigrants, 1860–1870, in linked ABE conservative database.

Table 5. Correlations between HISCO 1860 and 1870 values for males in NY and KINGS counties, by age.

Table 6. Whipple values for Irish-born ESB customers, 1850–1880.

Table 7. Age gaps and age heaping among account-holders (% of total).

Table 8. ESB customers incorrectly linked by the ABE conservative method.

Table 9. Priests in 1860 census incorrectly linked to 1870 census.

Table 10. Mobility from 1860 to 1870 of male irish immigrants living in New York or KINGS counties in 1860, by linkage method.

Table 11. The geographic and occupational mobility of irish-born doctors, lawyers, and clergymen linked by the ABE algorithm from 1860 to 1870.

Table 12. The geographic and occupational mobility of Emigrant Savings Bank customers linked by the ABE algorithm from 1860 to 1870.

Table 13. Mobility from 1860 to 1870 of male irish immigrants living in New York or KINGS counties in 1860, by linkage method.