8,441
Views
18
CrossRef citations to date
0
Altmetric
SI: Computational Me

Digital Trace Data Collection for Social Media Effects Research: APIs, Data Donation, and (Screen) Tracking

ORCID Icon, ORCID Icon, ORCID Icon, , , & show all

ABSTRACT

In social media effects research, the role of specific social media content is understudied, in part attributable to the fact that communication science previously lacked methods to access social media content directly. Digital trace data (DTD) can shed light on textual and audio-visual content of social media use and enable the analysis of content usage on a granular individual level that has been previously unavailable. However, because digital trace data are not specifically designed for research purposes, collection and analysis present several uncertainties. This article is a collaborative effort by scholars to provide an overview of how three methods of digital trace data collection - APIs, data donations, and tracking - can be used in studying the effects of social media content in three important topic areas of communication research: misinformation, algorithmic bias, and well-being. We address the question of how to collect raw social media content data and arrive at meaningful measures with multiple state-of-the-art data collection techniques that can be used to study the effects of social media use on different levels of detail. We conclude with a discussion of best practices for the implementation of each technique, and a comparison of their advantages and disadvantages.

Conducting media effects research in a communication environment increasingly dominated by social media is challenging because people are exposed to information on an ever-greater number of platforms, channels, devices, and contexts. Hence, failing to consider the content of exposure in digital media environments is increasingly recognized to complicate theory building (P. Valkenburg et al., Citation2021; Wagner et al., Citation2021). Key prerequisites to understanding media content effects in a digital age are, as we specify below, knowing what news articles people read, what pictures teenagers view, and how social media platforms algorithmically personalize content shown to users.

A decline in response rates to questionnaires in the last decades (Luiten et al., Citation2020) and difficulties to self-report increasingly granular use of social media platforms (Stadel & Stulp, Citation2022) have led to great efforts among social science researchers to find alternative methods to assess behavioral patterns that limit the response burden, especially in terms of effort for participants, while aiming to capture the unprecedented level of granularity of the social media landscape. Moreover, agreement between self-reported measures and tracking data for frequency of exposure is low (Araujo et al., Citation2017; Parry et al., Citation2021; Scharkow, Citation2016), thus pointing to an even larger challenge when considering the measurement of content exposure through self-reports. Therefore, many recent initiatives center around the idea of collecting existing digital traces on the social media platforms they use.

Most users leave traces in digital spaces (e.g., cookies in a browser, log data on social media platforms), and, for many, daily usage routines form a sequence of continuous interactions with digital platforms (Reeves et al., Citation2021; Stier et al., Citation2020). Hence, digital trace data can be defined “as records of activity (trace data) undertaken through an online information system (thus, digital).” (Howison et al., Citation2011, p. 769). In short, digital trace data from users are unobtrusively and continuously collected by digital platforms and are thus non-reactive (but potentially reflexive, e.g., being impacted by algorithmic processes, observer effects, or amplification via social indicators; see Lazer et al., Citation2021).

Specifically, DTD can shed light on textual and audio-visual content that is (a) produced by users (e.g., posts, stories, comments, reactions, photos/videos, private messages), (b) selected by users (e.g., searches; selective exposure), (c) selected for users (e.g., algorithmic recommendations/filtering), and (d) received by users (e.g., scrolling through posts of others, articles, videos, private messages). DTD enable the analysis of content usage on a granular individual level, which so far has only been partially possible with so-called linkage analysis combining survey and media content data (see De Vreese et al., Citation2017) or via media effects experiments (Allcott et al., Citation2020).

However, DTD-related methods come with inherent limitations and major theoretical, methodological, and ethical challenges, which we discuss in the following sections. Their collection and analysis differ from other methods in the social sciences toolbox: Data are often captured for purposes other than research, thus the concepts measured are almost never pre-determined by the data structure (Lazer et al., Citation2021; Wagner et al., Citation2021). Such DTD methods require that researchers deal with data access points of commercial platforms (e.g., APIs or data donation packages), introduce new data types (e.g., screenshots), link the data to known concepts (Lazer et al., Citation2021), and develop new data analytical procedures. Determining content engagement and production in digital trace data is an especially difficult task because many social media platforms, including Facebook and Twitter, do not provide this information through their data access endpoints.

Yet, methods and tools exist to make content data accessible and analyzable to social science scholars. The goal of this article is therefore to provide an overview of how DTD collection methods can help answer pressing social scientific research questions. To do so, we engage in a thought experiment on how to answer the following research question with the help of DTD: Does the exposure to misinformation on social media affect human well-being, and is this effect conditional upon algorithmic platform biases? This question, which combines three pressing topics of social media research, is used to exemplify how three collection methods - API, data donation, and tracking - can be used to capture and analyze digital trace data regarding social media content. In addition, we provide a decision framework for communication science scholars that outlines the advantages and limitations of each method and how to choose among them.

User- and platform-centric research methods for the collection of digital trace data

Social media effects research is mainly concerned with the activities of two actors: the social media platform and the user, where a user can be considered any individual participant who generates digital trace data on a platform by making use of its functionalities. Platforms and users interact in the occurrence of media effects. Compared to mass media effects theory, social media effects theory has a stronger focus on the user, as their selectivity and transactionality are likely to shape effects more specifically for an individual than for an aggregate group of users (P. M. Valkenburg, Citation2022). Platforms are a second important actor, as their selection mechanisms and potential message biases still shape the social contexts in which these effects occur (ibid.).

This is important to consider, as the collection of digital traces for research purposes differs in the way DTD are accessed: either via the individual user (often the study participant) or general platform interfaces. This differentiation has important implications for the types of study designs and also serves to answer different types of research questions. Subsequently, platform-centric research methods allow for the study of aggregated social media effects, while user-centric methods allow for studying person-specific effects (P. Valkenburg et al., Citation2021).

Platform-centric approach

With a platform-centric approach, digital traces left behind by users on platforms are collected by researchers obtaining the data directly from the platform of interest. Thus, the users themselves are not directly involved in the data collection process – rather, data is collected directly from the platform using either authorized (e.g., APIs) or unauthorized (e.g., scraping) methods. Such studies represent the dominant paradigm in trace-based social media research (see e.g., Araujo et al., Citation2017; Freelon et al., Citation2022; K. -C. Yang et al., Citation2021; among many others). A major advantage is that individual users do not have to be burdened. However, a collaborative spirit of the platform of interest is required for API-based research, which has often been a challenge (Halavais, Citation2019). One reason is that platforms are required to prevent disclosure of data to third parties (European Union, Citation2016), but some jurisdictions such as the EU have been discussing legislation such as the Digital Services Act that include provisions mandating platforms to provide data to academic research (Ausloos & Veale, Citation2021). The implementation of such directives – and how they will be enforced – is not yet fully clear, and for now highly dependent on platforms’ own initiatives. The recent announcement of the monetization of Twitter’s API access exemplifies the dependency of academic research on platform collaboration and the need for legislative approaches to reduce this dependency.Footnote1

With scraping, no platform collaboration is required, but such approaches are highly sensitive to unexpected changes to how a platform structures its data. Both scraping and API-based approaches have been criticized for, among other things, their lack of reproducibility, completeness, and representativeness as well as external validity issues linked to ambiguous metadata (see Freelon, Citation2014; Lazer et al., Citation2021; Tufekci, Citation2014).

User-centric approach

In contrast, with a user-centric approach, digital traces left behind by users on platforms are collected by researchers in partnership with users. This procedure allows for the collection of personal or sensitive data from particular users. Examples of user-centric approaches include data donation packages (DDPs), which do not require platform participation (other than the provision of individual data export functionality), and APIs that allow third parties to collect personalized social feed data from users with consent (e.g., Freelon et al., Citation2022). The user-centric approach offers substantial opportunities to obtain useful data, as it allows researchers to supplement the platform data with additional information (e.g., self-reports) that enhance the interpretability of the trace data. But unlike platform-centric methods, it relies upon individual research participants’ consent, labor, and skill, for which monetary incentives are often required. This means that, first, samples may become selective because participation requires a certain level of digital literacy (e.g., Boeschoten et al., Citation2020; Breuer et al., Citation2022). Second, digital traces can contain very private and sensitive data that participants might not be willing to share, that may include information about third parties (such as chat conversations or photos), and that is not needed to answer a given research question. Generally, participant attrition is an issue for user-centric approaches (e.g., Ohme et al., Citation2021) because the process of obtaining consent permits it, whereas there is no way for participants to drop out of platform-centric studies because they are never asked to participate in the first place.

Three methods of digital trace data collection: APIs, Data Donation, and Tracking

Not being available in a pre-platform age, several methods have been developed that can be used to collect digital trace data. In the following, we introduce and discuss the advantages and limitations of three methods that have strong potential to provide individual-level content data, and might be accessible to a broad range of researchers. It is important to note that other methods exist, ranging from scraping – i.e., automatically extracting the content of web pages (for an overview, see Freelon, Citation2018) - to universities entering larger institutional (data sharing) partnerships with platforms or platforms offering premium levels of access to selected researchers (such as the Social Science One initiative). Given their broader focus (in the case of scraping) or limited availability (in the case of direct initiatives), they are not covered here.

APIs

or application programming interfaces, are essentially platform-specific pipelines for obtaining machine-readable digital data and are an example of platform-centric methods. Whereas most users simply engage social media content via a platform’s website or app, researchers are better served by connecting to an API endpoint to extract retrospective user data in bulk in formats suitable for rigorous analysis. Most social media APIs were originally created to grant third-party commercial services access to user data, but as researchers began to realize their empirical value, platforms such as Twitter and Facebook began creating APIs specifically for them. The type, volume, and timespan of the data provided by APIs are determined by the platform, often with different levels of access (e.g., in the case of Twitter, the Academic API provides all tweets since the creation of the platform, whereas standard accounts only cover the last seven days). The data objects generated by APIs are typically enriched by a wide range of metadata: for example, the tweet objects generated by Twitter’s APIs include, among other details, when the tweet was created, the name of the app that was used to post it, whether it was a retweet, the creator’s screen name, and whether that user is verified. All these data can be efficiently obtained in mass quantities using open-source API interfaces written in programming languages like Python and R. The most commonly used APIs offer data that are otherwise publicly available (e.g., through a web interface), but some offer private data access if the end-user approves it (e.g., Twitter’s Reverse Chronological Timeline endpoint as used in Freelon et al., Citation2022).

API methods offer three main advantages to researchers. First, the APIs tend to be relatively easy to use. Many social media services offer documentation on the API and the data structure and often provide tutorials for developers (and researchers) on how to obtain the data. Second, social media platforms often offer some degree of free API access, although more expansive access tiers sometimes cost money. Third, because the actual data collection is done by and via the platform, it tends to be convenient and unobtrusive for users, as they are not directly involved in the process. For these reasons, APIs have long been a popular method for researchers collecting social media data, even as access has been restricted in recent years (Freelon, Citation2018). However, APIs also have limitations. First, social media services not only limit the volume of data that can be collected, but often have opaque restrictions on who is granted access, and may revoke such access at any time, for any reason.

Second, for privacy reasons, APIs only allow researchers to access public content (e.g., public posts of users that have their profiles set to public) and only a subset of public measures that regular users may also see when using the service (e.g., the number of retweets). Third, connecting these data with individual-level behavior can be challenging as APIs tend to provide information aggregated at the level of the content, instead of an individual. Informed consent is also challenging in this method, as data is being collected in bulk via the platform under provisions that may be covered under the platforms’ Terms & Conditions, yet without being feasible for researchers to contact specific individuals to receive their informed consent for participation – because of the sheer size of the data and because researchers may not be able to identify and/or contact these individuals in the first place.

Data donation

is a user-centric approach in which research participants donate their existing DTD to researchers. Data collected via data donation tends to be retrospective in nature, with the timespan and type of data dependent on the way in which the data is donated (as outlined below). As users are donating their own data, this method allows for the collection of nonpublic data (e.g., private messages or individual profiling). In practice, researchers make use of a data donation approach in various different ways. For example, users can extract DTD from their browsing history using plugins (as done, for example, by Web historian (Menchen-Trevino, Citation2016)) or by manually recording specific reports that appear on their mobile phone (e.g., Baumgartner et al., Citation2022; Ohme et al., Citation2021). An approach that has increased in popularity in recent years leverages individuals’ right of access and transportability to a copy of their personal data in a machine-readable format, as mandated by multiple regulatory authorities across the globe (e.g., European Union, California, and Brazil, among others). Examples are Facebook data downloads or Google Takeout. Data controllers, such as social media platforms, typically comply with this regulation by providing their users with a zip file, often referred to as a “Data Download Package” (DDP). With the data donation approach, researchers ask study participants to request their DDPs from social media platforms and share these DDPs with the research study.

Various research initiatives are tackling challenges that accompany the user-centric nature of data donation. By inviting participants to a research facility, Kmetty and Németh (Citation2022) could assist participants with low digital literacy. In addition, they immediately run a de-identification script under the supervision of the participant. Araujo et al. (Citation2022) tackle the issue that participants might not be willing to share private and sensitive data, by facilitating that participants can decide for each data point whether they want to share it. Alternatively, Boeschoten, Ausloos, et al. (Citation2022) tackle this issue by allowing for a data minimization step to take place locally at the device of the participant through their web browser, and only share this aggregated data with researchers, after consent is provided. Using an approach that combines the API and data donation logic, Freelon et al.’s PIEGraph software (2022) tracks a panel of users who have granted consent to access their personalized Twitter timelines.

Data donation methods offer three major advantages to researchers. First, the user-centric nature of the approach allows researchers to include self-reports (e.g., perceptions, attitudes) in their designs as well as work with the participants to enrich and contextualize the DTD, by asking questions about the data. Second, researchers can ensure that participants are able to provide meaningful informed consent for and exert agency on how their data are used. Third, for most data donation methods (e.g., using DDPs), researchers can collect information that is of a more private or sensitive nature (e.g., one’s private messages, or how a platform profiles the participant) than what can usually be collected via other methods. Data donation, however, also has limitations. First, the method usually requires participants to perform some actions to collect the data (e.g., request a DDP package to a platform), which may create challenges for participants with less technical comfort. Second, as the data collection relies on direct engagement with and recruitment of participants, researchers may not be able to obtain diverse and large samples (as might be available via APIs). Third, the data collected in this manner is often less structured or documented (when compared to APIs). A recent study by van Driel et al. (Citation2022) discusses both advantages and challenges, particularly in the case of Instagram data donations. In sum, the data donation approach often requires substantial effort from the research team (and the participants).

Tracking

is a user-centric method approach in which behavioral information is automatically captured, most often from individuals and in close temporal succession to the actual behavior, with the help of client-side tracking software. As with data donation, data collected via tracking allows for the collection of nonpublic data, yet its timespan is prospective, as it tracks digital data as they are produced. Tracking data are available for several different units of media experience, from URL logs of web browsing to data collected within commercial products (e.g., Twitter or Facebook posts) to logs of applications used on devices. All of these data sources share the same sensibility about media experiences that now range across the full breadth of real life; namely, that more granular assessments of specific media exposure will benefit the discovery of how media are processed and what effects those media have. For example, browser plugins (e.g., Wojcieszak et al., Citation2022) can be used to track the specific content that media users engage with and the timing and sequence of those engagements. A key benefit of these data is the possibility to discover details that were not available in the aggregate measures (e.g., hours of social media use per day or week) used in past research (Reeves et al., Citation2021). While each type of tracking data focuses on different units of experiences and informs different research questions, practical applications, and interventions, tracking social media content can be especially efficient by the use of screen data, presenting one of the most current developments in tracking research. Hence, we focus on the Screenomics method capturing screenshots as well as metadata to illustrate advantages as well as disadvantages of the tracking method.

The existing Screenomics software concentrates on tracking screens. Every five seconds that digital devices are activated, the software application on the device records, encrypts, compresses, and transmits a screenshot of everything that appeared on the screen at that moment to a research data server. Then, the Screenomics analysis assay discriminates between foreground and surrounding background, segments, screen text, and image blocks, and uses text, face, logo, and object recognition engines to collect text documents and image identifiers (e.g., number of faces in a screenshot). These features can be integrated with metadata to facilitate storage, retrieval, and visualization of individual screenomes (Chiatti et al., Citation2018) and can be indexed and searched with respect to specific temporal, textual, graphical, and topical features (e.g., Reeves et al., Citation2021; X. Yang et al., Citation2019).

Tracking screen content has several advantages. First, screens display a broad swath of digital life, agnostic to platform, software application, or technology (i.e., screenshots are available from smartphones, computers, cable systems, and potentially cars, appliances, and other connected devices). Whatever people are viewing and doing on the screen becomes part of each user’s unique individual record of experiences that constitute digital life – the screenome. Second, screen recordings are multimodal, in that the recordings include all the words and images that users were exposed to, in the precise location and sequences in which they were viewed. Third, the capture of screens is passive; once the recording software is installed on a device, users do not need to do any additional data collection-related tasks. Fourth, the screens constitute a detailed time series record of threads of media experience, regardless of the quick switching that we know occurs between applications and platforms (e.g., Yeykelis et al., Citation2014). As such, these data allow researchers to examine highly detailed sequences and patterns of exposure, and how those exposures influence and shape subsequent exposures and influences.

The time-series tracking data are particularly useful because they facilitate the examination of intraindividual change and within-person effects rather than interindividual and group differences, e.g., the change in mood in an individual over time as a reaction to specific media content this person encountered. Collecting screens comprehensively, as is available with Screenomics, also allows examination of the broader media context. This enables researchers to invoke a person-specific approach to research that is directly matched to theoretical assumptions and hypotheses about how social media effects operate – at the individual level and with considerable idiosyncrasy – and thus allows for better inference about those effects. Examples of recent uses of data sequences include the examination of emotion management via switching between media segments that balance arousal (Yeykelis et al., Citation2014), technology interactions between parents and their children (Sun et al., Citation2022), and the prediction of frequency and timing of switching between content segments over short time periods (X. Yang et al., Citation2019)

Tracking data also have limitations, of course. First, obtaining the data often requires substantial expertise and effort. Researchers must develop or obtain the tracking software, recruit and consent the participants, and manage what are often very large data sets in accordance with strict privacy-preserving protocols. Second, because tracking data are usually purposefully designed to be comprehensive, using them to answer specific research questions often requires the development of new methods for extracting meaningful variables from the data streams. For example, researchers using screenshot data to study the emotional propensity of screen content must develop replicable techniques to identify and/or rate the emotional valence of each screenshot. Third, the different tracking methods will have unique limitations related to their comprehensiveness (e.g., no measure of sound, or frequency of sampling). Fourth, while the in-situ nature of the data collection provides ecological validity and the possibility to describe individuals’ real-life media exposure and engagement, the prospective nature of data capturing limits the possibility of causal inferences in observational studies. However, tracking data may be particularly useful in quasi-experimental and experimental trials to identify desired participants, time experimental manipulations, monitor intervention participation and dose, and/or as outcomes.

Applying digital trace data methods for the study of specific research problems

We exemplify the potentials and drawbacks of different data collection methods by using a hypothetical research question:

Does the exposure to misinformation on social media affect human well-being, and is this effect conditional upon algorithmic platform biases?

The research question relates back to a recent report by the World Health Organization, which finds that “Infodemics and misinformation negatively affect people’s health behaviors” (World Health Organization, Citation2022; see Borges Do Nascimento et al., Citation2022).

Based on 31 systematic reviews on the topic, the report concludes that misinformation can “increase social fear, panic, stress, and mental disorders” (Borges Do Nascimento et al., Citation2022, p. 549). It is beyond the purpose of this article to review the methods used in the 1034 primary studies included in the reviews nor will we operationalize the concepts used here. Rather, we use this report as the starting point for a thought experiment on how digital trace data collection methods can be used to answer the above-stated research question. First, we give a quick overview of the three topic areas addressed by the research question before we describe potential study designs that rely on the three described methods of digital trace data collection.

Misinformation

is usually defined as false claims distributed credulously, as distinct from disinformation (false claims shared with knowledge of their falsehood; Freelon & Wells, Citation2020). While political lies are nothing new, misinformation research prior to 2016 was mostly concentrated in health-related fields (Freelon & Wells, Citation2020). However, given factors such as the ubiquity of social media and widespread mistrust in traditional journalism, misinformation has grown more salient in the minds of both the public and the research community, including Communication. One central problem in the study of misinformation is that it is difficult to determine who creates and shares misinformation in social networks and who is actually exposed to misinforming content. Most studies rely on rough estimations or self-reports by users, which cannot reveal the source and spread of misinformation in a digital public discourse, which can only be determined with a focus on message content. Hence, we lack information about the amount of misinformation a) distributed on these platforms (e.g., studies showing very low numbers, e.g., Allen et al., Citation2020; Grinberg et al., Citation2019), and b), whether people are indeed exposed to such information. We need to know the content of this information to be able to directly connect them to potential outcomes, such as misbeliefs about specific topics.

Well-being

is the subjective feeling of people experiencing positive affect, negative affect, and how satisfying they experience their lives (Diener, Citation1984; see also P. M. Valkenburg et al., Citation2022). This balance can either be disturbed by the sheer frequency of digital media usage (e.g., Vanden Abeele, Citation2020), but also through specific content exposure on social media. A granular assessment of moment-by-moment changes in digital experiences has great potential for untangling complicated issues in the study of technology, psychological well-being, and mental health. To date, there is substantially greater public, policy, parental, and medical alarm about the potential role of technology in well-being than there is empirical research to support the concerns (Hancock et al., Citation2022; Orben, Citation2020; Vuorre & Przybylski, Citation2022). It is possible that the ambiguity in research matches a reality that technology does not cause substantial changes in well-being, or at least not uniformly across individuals (P. Valkenburg et al., Citation2021). But it is equally possible that the assessment of well-being and mental health has not yet been matched well with descriptions of technology experiences that are related to well-being.

Algorithmic bias

occurs when algorithms produce results that favor certain social groups over others, e.g., Whites over Blacks, men over women, the rich over the poor, and people who speak in “standard” dialects or accents over those who speak in less common ones (Cramer et al., Citation2018). According to Gillespie (Citation2014), algorithms are “encoded procedures for transforming input data into the desired output, based on specified calculations” (p. 1). This expansive definition encompasses everything from x + 2 (with x being the input and the sum of x and 2 being the output) to the opaque amalgamations of code that determine which social media posts people see in their personalized feeds. Once (incorrectly) perceived as offering more objective alternatives to the well-known flaws of human judgment (Cohn, Citation2019), algorithms are now increasingly recognized as introducing their own distinct social harms. Accordingly, research on algorithmic bias often takes a normative stance, orienting itself toward harm reduction and eventual elimination (e.g., Hooker, Citation2021; Noble, Citation2018). Empirical studies of bias and fairness in social media share an interest in identifying three categories of trace data: the types of people who may experience such bias, the different categories under which they may be profiled by social media platforms, and the different types of content they may see.

API research design to study the RQ

One potential way in studying the effect of misinformation content spread via social media on well-being is with the help of API trace data collection. API access can be used to define the extent of misinformation that is being spread on a social media platform, like Twitter. Social media trace data can be used to study research questions about misinformation in three categories: production, distribution, and consumption. APIs can be used for the study of misinformation production by using lists of website domains or hyperlinks known to contain substantial amounts of misinformation. Such lists can be cross-referenced with large social media datasets to determine the prevalence of major misinformation sources (e.g., Grinberg et al., Citation2019; Guess et al., Citation2019; Vosoughi et al., Citation2018). API methods can also be used to identify misinformation “superspreaders” - prominent users that expose large audiences to misinformation – including such well-known names as Breitbart News, Fox News, and Donald Trump (K. -C. Yang et al., Citation2021; Mackey et al., Citation2021). Knowing about the producers and the distribution of misinformation on social networks is important to evaluate the extent of the problem for specific platforms. It can be connected to aggregate outcomes, for example by triangulating the spread of misinformation on a given platform and the well-being of users of this platform derived from other, potentially secondary data sources, like the national representative panel or trend studies (e.g., American Trends Panel; German ALLBUS). However, a potential link could only be established on the most aggregate, macro-data level, which can be weak, given many confounding variables that such a design cannot account for.

Another way to study the link between misinformation and well-being would be to analyze reply networks of posts that have been defined via lists of misinformation super-spreaders. Sentiment analysis of the occurring posts in the network could distinguish between emotional tone in the replies left in response to misinforming information, potentially comparing this with reply networks of non-misinforming news posts around similar topics. The quality of such a study strongly relies on the detection of networks and the quality of the sentiment analysis. Recent approaches in the analysis of reply networks (e.g., Gaisbauer et al., Citation2021) and the automated human emotion detection in text corpora (e.g., Guo, Citation2022), present the possibility to study expressed sentiment as a response to misinformation and can allow for cautious inference on the effect on users’ well-being.

The consumption of misinformation on an individual data level is difficult to study with API data, mainly because many social media platforms do not make post-view data available for individual users through their public APIs.Footnote2

APIs can also be used to study algorithmic bias by showing systematic differences in the exposure to content on social media platforms for specific sub-groups, for example, those based on gender, ethnicity, or geographic location. The detection of bias at a user level using APIs can, in some cases, be done indirectly through algorithmic auditing (e.g., Robertson et al., Citation2018). For example, one study created bots on Twitter – and extracted their data using APIs – to reveal algorithmic biases toward popularity and engagement (Bartley et al., Citation2021). Other studies have focused on content-level measures and used APIs to show, for instance, homophily on YouTube with pro- and anti-vax videos based on video recommendations (Abul-Fottouh et al., Citation2020), or bias toward men in the case of STEM career ads on Facebook by running advertising campaigns directly, and using reports provided by the platform (Lambrecht & Tucker, Citation2019). Hence, the above-described combination of reply network and sentiment analysis of social media posts could be probed further in testing whether the relationship between misinformation spread on platforms and the emotional language of comments below platforms differs for certain sub-groups of users. Some API metadata include information on the region, gender, or age of respondents and thus can be used to enrich such analyses. In sum, API access to DTD as a platform-centric access method has the potential to study the posed research questions mostly on an aggregate level, while it can give less indication about person-specific social media effects.

Data donation research design to study the RQ

Data donation can be used to understand the activities that a user does on social media – e.g., what one posts, comments, or likes – what accounts users follow on social media and, depending on the specific platform, reveal the websites visited by a user on a URL level (e.g., Thorson et al., Citation2021). In addition, data donation packages that a user downloads can also contain information on ad targeting on social media and hence can provide information on whether outlets that are known to spread misinformation have paid to appear in a user’s newsfeed (see Burgess et al., Citation2021 for a combined approach of data donations and tracking). Importantly, donated data often contains the access point to content users have been exposed to (e.g., URLs) and less often the content itself. Researchers, therefore, need to process data further, matching posts and URLs with known misinformation accounts, scraping the content behind the URLs, and analyze it with regard to actual misinformation. Hence, data donation could give us a proximity score for the exposure of misinformation on an individual user basis but will be lacking information about the specific misinforming posts on social media platforms a user has been exposed to.

An advantage of data donations is that it includes a variety of information about a user’s engagement on digital platforms, for example, the information a user has posted or the conversations they had with others through messaging applications (see Breuer et al., Citation2022 for a list of possible content). This information can be used to infer the emotional state of a user, based on the language and images used in original posts, the comments to other content on social media, or other expressions that can be understood as an approximation of a user’s well-being. Importantly, all these approaches make additional (often automated) processing of the content in the data donation packages necessary and are thereby resource-intensive (e.g., van Driel et al., Citation2022). Yet, because events in data donation packages are often time-stamped, it is possible to apply longitudinal data analysis procedures and test for a temporal relationship between the likelihood of being exposed to misinformation and the well-being of a user.

The collection of DTD with the help of data donations often relies on the recruitment of individuals via auxiliary methods, such as online surveys. This makes it possible to connect information from data donation with self-reported well-being (e.g., Cronin et al., Citation2022, relying on WebHistorian donation software). Given that the measures in a recruitment survey usually occur before data donations are collected, self-reported well-being and the proximity of being exposed to misinformation can be related in retrospective temporal order, establishing a quasi-prospective design, given that data donation contains information about social media use before states of well-being have been assessed.

In addition, studies that have used data donations have revealed how social media platforms profile users, for example how Facebook detects specific interests in politics or news as part of its user profiling, ultimately influencing content exposure (Thorson et al., Citation2021). Given the availability of both the profiling and user activity data on data donation packages, and its ability to triangulate this information with self-reports by users, this method may help us understand which meta information (e.g., ad targeting categories; see Burgess et al., Citation2021) determine the likelihood of being exposed to misinformation and by that, the conditionality – and potential algorithmic biases – of the relationship between misinformation exposure and well-being.

In sum, the analysis of data donation packages provides the possibility to infer the likelihood of misinformation exposure, rather than the exposure to actual content. The richness of data and the connection with self-reports makes it easier to draw a connection between information exposure and outcome variables, such as well-being. Metadata gathered by platforms helps to study the algorithmic dependency of potential relationships.

Tracking research design to study the RQ

Tracking screens can be used to study misinformation to explore, for example, the apps a user has opened, the messages that they have received and sent on messaging platforms such as Telegram, as well as the social media pages they have visited. With a method such as Screenomics that captures all screens, this extends to screens not included in the available APIs or content of data donation. Analyzing tracking data with a combination of automated and hand-coded classification of misinformation (e.g., Christner et al., Citation2022) can help to more comprehensively identify the exposure to misinformation on an individual user level. It is also possible to filter the information based on predefined lists of known spreaders of misinformation (e.g., Allen et al., Citation2020). In addition, the detection of logos on screens can detect outlets that have frequently been associated with the spread of misinformation.

Tracking can also be used to better understand the well-being of users, by analyzing the valence of information users select. For screen-tracking, this can be done by optical character recognition (OCR) and subsequently, automated content analysis or automated image recognition procedures (e.g., J. H. S. Lee et al., Citation2021; Singh & Sharma, Citation2022). Such a procedure allows for a direct linkage between exposure to misinformation and the selection of other content or activities. For example, commenting or chatting with friends allows for the inferences of individual well-being indicators by assessing the affect that is expressed in messages and can thereby be helpful to give insights about the short-term relationships with social media use. An additional approach can be to link tracking data with in-situ measures, for example by using the experience sampling method (see Otto et al., Citation2022; P. Valkenburg et al., Citation2021). Here, the momentary self-assessments of well-being can be connected to prior exposure to misinformation as a short-term temporal association.

Tracking data can also be used for natural experiments, laboratory experiments and field experiments on an individual data level. Tracking data collected in-situ and not in a fabricated laboratory situation facilitates the identification and use of within-person changes naturally and field experiments provide potentially stronger evidence of causality. Here, a specific event can be determined from tracking data and be treated as an intervention with a potential effect on users. For example, it could be tested if exposure to a specific misinforming post about vaccination is associated with changes in the attitude toward vaccines on an individual level. J. Lee et al. (Citation2022) identified specific short ads by payday loan companies (usually available on the screen for only seconds), and the effects of those ads on the type of information that low-income individuals consumed immediately after seeing the ads (i.e., people were more likely to avoid future-oriented information, they switched more between different applications, and they avoided negative information).

To investigate long-term effects, tracking data can be included in larger data-collection efforts that include pre-tracking and post-tracking surveys, allowing one to connect the exposure to misinformation based on tracking data and self-reported levels of well-being after the exposure, thereby controlling for levels of well-being prior to exposure. In addition, because tracking data may include screens experienced in the periods before and after misinformation is experienced, they also enable the examination of contextual screen experiences (e.g., verified news exposure) as possible moderators or mediators of observed changes in measures of well-being.

In sum, tracking approaches come with the highest granularity and insights into misinformation exposure but require extensive pre-processing of the data. The content is often included, hence, drawing inferences between content exposure and outcome variables is possible, especially to establish short-term effects. Auxiliary methods, such as momentary assessments or online surveys are necessary to contextualize tracking data, making these approaches most resource-intensive.

Deciding on a method for digital trace data collection

Deciding on methods for collecting digital trace data for social media content studies will likely be guided by research questions, resources, and data handling possibilities – both technical and privacy related. The method presentation and application described in the paper still can provide guidance in deciding about the method of choice.

For platform-centric approaches, we conclude, the costs and accessibility differ per platform and data type, but in certain cases, affordable and highly standardized solutions are readily available. The fact that the platform is in control during the data processing and can decide what data (not) to share and which specific users (not) to include prior to sharing also poses threats to the data quality on multiple levels (Amaya et al., Citation2020). This adds to the issue that the population of platform users is selective with respect to the general population (Mellon & Prosser, Citation2017), known as coverage error, and that supplementing platform data with person-level survey data is difficult. The unit of measurement is an account, which is unequal to an individual’s user’s data, as a person can have multiple accounts on a platform, or multiple persons can share a single account, both resulting in unit error considered from a total error perspective (Zhang, Citation2012). However, platform-centric approaches do not rely on user engagement and have a low risk of reactivity biases, because data have been published before research data are collected.

When aiming to generalize toward certain populations, a user-centric approach may be more appropriate, but will always require effort in the development of a sampling and study design – and may be challenged by the relatively high levels of effort by users. Both, data donation and tracking have in common that they require informed consent from the user and provide higher transparency to the user about what data they share with a researcher. This can be problematic for tracking methods, where the prospective nature of data shared can result in reactivity biases, i.e., the user changes their behavior because they know they are being tracked (Toth & Trifonova, Citation2021). Both methods require user involvement and are subject to sample biases during the tracking and donation process (Boeschoten et al., Citation2020; Breuer et al., Citation2022; Ohme & Araujo, Citation2022). User-centric approaches present individual data which are either linked to a user account or the device, which cannot be uniquely associated with an individual. However, they get closest to actual user behavior and – depending on the granularity of data – can allow researchers to recreate trajectories of user behavior and thereby also the content users have been exposed to. Especially for data donations, it must be kept in mind that the structure of the Data Donation Packages (DPP) is often not stable and poorly documented. Related, the predictability of content resulting from applying user-centric methods is medium to low, as they usually do not operate with search queries. In this sense, it is possible that a study interested in misinformation exposure does not find traces of this in a DDP or tracked logs at all, limiting the variance in data necessary for further analysis.

To help researchers guide their decision about which digital trace data collection method to use, provides an overview of the considerations necessary before deciding about a specific method. Roughly, researchers need to consider the platform, user-dependency and engagement, timeframe for data collection, data and content types required, data quality necessary, and the privacy risks involved with the collection and its consequences for storage and analysis.

Table 1. Comparing API, donation, and tracking methods.

Holding back and moving forward – an agenda for digital trace data research

This article sheds light on the possibilities to collect and use digital trace data for social media effects research. We present an overview of the advantages of platform and user-centric data collection methods and the disadvantages that come with their application. By focusing on three popular topics in social media research – misinformation, algorithmic biases, and digital wellbeing – we further present solutions on how digital trace data collection can be utilized in specific research areas. Especially for the latter, it becomes apparent that despite many notable exceptions, the uptake to use digital trace data in social media effects research is limited. Two questions arise: What holds the field back and what can be done to leverage the advantages of digital trace data while attenuating problems arising with such methods?

Holding back

One likely factor that should not be concealed is the resources necessary to collect digital trace data. The technical infrastructure needed to obtain access (e.g., API costs for firehose), the development of research software necessary, the lawful storage of digital trace data (e.g., GDPR compliance), and processing and analysis (e.g., server space, computation costs) are notable. In addition to the researcher’s time and devotion to the project, a substantial budget is necessary to realize digital trace data collection. Exemplar projects, such as Screenomics (Reeves et al., Citation2021), ScreenLife (Yee et al., Citation2022), PORT (Boeschoten, Ausloos, et al., Citation2022), OSD2F (Araujo et al., Citation2022), WebHistorian (Menchen-Trevino, Citation2016), and PIEGraph (Freelon, Citation2018), all took several years, significant research budgets, and great team efforts to build. Although these efforts are meant to scale and be redeployed in new settings in accordance with open-science paradigms, the development costs make progress difficult.

Second, specific skills for the development but also the employment of software necessary for digital trace data collecting are necessary. Fluency in specific programming languages, server architecture, privacy law, and so on are necessary but have not traditionally been part of the social science curriculum, although a certain “computational turn” and tendency for multi-disciplinary collaboration has shifted the field (e.g., Fan et al., Citation2022). Nevertheless, researchers may be hesitant to employ digital trace data methods because they lack familiarity with or the possibility to obtain the necessary skills.

Even if resources and skills are less of a problem, the quality of data already available often complicates the employment of digital trace data. It is often challenging to verify the extent to which the data collected are complete or a full reflection of the behavior of interest. DTD collection takes place in a highly dynamic environment, meaning that researchers have to constantly adjust to changes done by platforms – be it in what they provide, allow to be collected, or even in how content is presented to users. Moreover, while DTD collection methods have the potential to shed light on some of the workings of algorithms that shape an increasingly larger share of the content exposure on social media, the influence of these algorithms (and their changes) presents a challenge in itself when interpreting the data.

Lastly, it has to be mentioned that digital trace data collection comes with a great number of ethical questions and challenges, with regard to meaningful informed consent, researchers’ responsibility when discovering illegal content in digital traces, and the formation of cooperation with commercial platforms like Twitter (e.g., Breuer et al., Citation2022; Ohme & Araujo, Citation2022). While it is beyond the scope of this paper to discuss the ethical challenges in detail, we need to acknowledge that this is a factor potentially holding researchers back to apply DTD in social media effects research.

Moving forward

Several problems mentioned can be addressed to pave the way for a stronger engagement with content in social media effects research based on digital trace data. To arrive at more and better research in this field, we suggest an agenda for future work with digital trace data.

First, it is important to educate a greater number of scholars and students from communication science and adjacent fields in the languages, methods, and analyses used in digital trace research. This will help to establish a critical mass of scholars that have the skills and expertise needed to employ digital trace data in their research projects and introduce new innovations as to what can be discovered from these data.

Second, existing social media effects research using DTD shows – a bit oversimplified – that the effects of social media usage are not as clear as once thought. The findings obtained when more data and more granular data are available indicate that social media effects are complicated. More often, null effects are found, once DTD is used to assess digital user behavior (e.g., Cronin et al., Citation2022). But rather than assuming that this is due to a minimal media effects paradigm in a social media era (Bennett & Iyengar, Citation2008), first results using digital trace data show that media content indeed has an effect – being deceived by misinformation, being nudged by algorithmic recommendations, showing low psychological health from problematic content exposure – but that these effects are more nuanced (P. Valkenburg et al., Citation2021). To be able to find the “needle in the haystack,” we must – and with DTD increasingly can – rely on sequential analysis or users’ media trajectories that include cross-platform assessment, erratic content switches, and context of exposure. It may still be challenging to extract the one opinion-changing article or that one comment that makes a user feel bad but assessing systematic patterns in sequential analysis and associating those with outcome variables of interest may get us closer to the understanding of the intra-individual changes invoked by the media than was possible when only being able to examine aggregate-level effects.

Third, digital trace data collection methods open new opportunities to leverage the wide range of perspectives and possibilities that emerge when working in interdisciplinary teams. Interfacing with scholars with different backgrounds and expertise will certainly expand how the research is done, and how communication science can, in collaboration with other fields, contribute to scientific and societal progress. For example, a digital trace data consortium would facilitate the pooling of resources, allow for greater and high-quality data sources, and help to make digital trace data available to researchers who lack resources for their own data collection efforts. Other benefits include the possibility to attract funding for DTD research endeavors, and increased leverage in negotiations with digital platforms and policymakers when developing and demanding research access. To date, many similar approaches to DTD collection are developed and carried out in parallel, which is normal in the early days of new research method development. Nevertheless, we should not forget that most media effect researchers are interested in similar questions, want to use similar data, and focus on similar populations. While a data consortium does not mean giving up individual research projects, establishing a network of researchers will contribute to synchronization both in terms of standards used in method development, data quality criteria, and ethical requirements necessary in this type of research. Together, open-source tracking software developed in Asia, language, and image processing models developed in America, sequence analytical approaches from Africa or Australia, and data visualization software from Europe can produce and accelerate the deployment of seamless pipelines that facilitate a multitude of research projects.

Through these efforts, communication behavior studies working with digital trace data will allow a greater number of researchers to move public knowledge of media effects forward more quickly and forcefully. Although the approaches, benefits, and pitfalls surrounding the use of digital trace data introduced in this article are necessarily incomplete, our coverage of some popular methods in current use is meant to spur new thinking and innovations in how these data can be collected and used en route to discovery and understanding of social media content effects.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Notes on contributors

Jakob Ohme

Jakob Ohme is Research Group Lead at the Weizenbaum Institute and Fellow at the Digital Communication Methods Lab at the University of Amsterdam. His research interests center around the impact of digital and mobile communication processes on political behavior and news flows, generational differences in media use and political socialization, and the development of new methodological approaches in political communication and journalism research.

Theo Araujo

Theo Araujo is Scientific Director of the Amsterdam School of Communication Research (ASCoR) and Associate Professor in the Department of Communication Science at University of Amsterdam. His research focuses on the increasing adoption of artificial intelligence and related technologies within our communication environment, including conversational agents and automated decision-making.

Laura Boeschoten

Laura Boeschoten is an Assistant Professor at the Department of Methodology and Statistics at Utrecht University. Her PhD was a combined project at Tilburg University and Statistics Netherlands on combining latent class modelling and multiple imputation to estimate and correct for measurement error in combined survey-register datasets. Currently, her main research area is data donation. She is involved in multiple projects that focus on developing research infrastructure for conducting data donation studies, investigating its methodological challenges and applying it in practice.

Deen Freelon

Deen Freelon is an associate professor at the Hussman School of Journalism and Media and a principal researcher at the Center for Information, Technology, and Public Life at the University of North Carolina at Chapel Hill. Among other topics, he is interested in digital politics, computational social science, and open source software development.

Nilam Ram

Nilam Ram is a Professor of Psychology and Communication at Stanford University in the USA studying how psychological and media processes (e.g., processes such as learning, information processing, emotion regulation, etc.) develop across the life span, and how longitudinal study designs contribute to generation of new knowledge.

Byron B. Reeves

Byron Reeves, PhD, is the Paul C. Edwards Professor of Communication at Stanford and Professor (by courtesy) in the Stanford School of Education. Byron has studied how media influence attention, memory and emotional responses and has applied the research in the areas of speech dialogue systems, interactive games, advanced displays, social robots, and autonomous cars. Byron has launched (with Stanford colleagues Nilam Ram and Thomas Robinson) the Human Screenome Project, designed to collect moment-by-moment changes in technology use across applications, platforms and screens. Byron’s PhD in Communication is from Michigan State University.

Thomas N. Robinson

Thomas N. Robinson, MD, MPH designs and tests solutions to help children and families improve their health and reduce health and social disparities. He is the Irving Schulman, MD Endowed Professor in Child Health and Professor of Pediatrics, of Medicine, and, by courtesy, of Epidemiology and Population Health, at Stanford University, USA. He also directs the Stanford Solutions Science Lab and the Center for Healthy Weight and co-directs the Stanford Screenomics Lab and Human Screenome Project with Professors Byron Reeves and Nilam Ram.

Notes

1 At time of writing, the extent and details of changes in Twitter API access were not fully disclosed.

2 This is true for Twitter and Facebook, but some video-based sites like YouTube and TikTok display viewership stats for individual posts.

References

  • Abul-Fottouh, D., Song, M. Y., & Gruzd, A. (2020). Examining algorithmic biases in YouTube’s recommendations of vaccine videos. International Journal of Medical Informatics, 140, 104175. https://doi.org/10.1016/j.ijmedinf.2020.104175
  • Allcott, H., Braghieri, L., Eichmeyer, S., & Gentzkow, M. (2020). The welfare effects of social media. The American Economic Review, 110(3), 629–676. https://doi.org/10.1257/aer.20190658
  • Allen, J., Howland, B., Mobius, M., Rothschild, D., & Watts, D. J. (2020). Evaluating the fake news problem at the scale of the information ecosystem. Science Advances, 6(14), eaay3539. https://doi.org/10.1126/sciadv.aay3539
  • Amaya, A., Biemer, P. P., & Kinyon, D. (2020). Total error in a big data world: Adapting the TSE framework to big data. Journal of Survey Statistics and Methodology, 8(1), 89–119. https://doi.org/10.1093/jssam/smz056
  • Araujo, T., Ausloos, J., van Atteveldt, W., Loecherbach, F., Moeller, J., Ohme, J., Trilling, D., van de Velde, B., de Vreese, C., & Welbers, K. (2022). OSD2F: An open-source data donation framework. Computational Communication Research, 4(2), 372–387. https://doi.org/10.5117/CCR2022.2.001.ARAU
  • Araujo, T., Wonneberger, A., Neijens, P., & Vreese, C. D. (2017). How much time do you spend online? Understanding and improving the accuracy of self-reported measures of internet use. Communication Methods and Measures, 11(3), 173–190. https://doi.org/10.1080/19312458.2017.1317337
  • Ausloos, J., & Veale, M. (2021). Researching with data rights. Technology and Regulation, 136–157. http://dx.doi.org/10.2139/ssrn.3465680
  • Bartley, N., Abeliuk, A., Ferrara, E., & Lerman, K. (2021). Auditing algorithmic bias on Twitter. 13th ACM Web Science Conference 2021, 65–73. https://doi.org/10.1145/34475353462491.
  • Baumgartner, S. E., Sumter, S. R., Petkevič, V., & Wiradhany, W. (2022). A novel iOS data donation approach: automatic processing, compliance, and reactivity in a longitudinal study. Social Science Computer Review, 08944393211071068. https://doi.org/10.1177/08944393211071068
  • Bennett, W. L., & Iyengar, S. (2008). A new era of minimal effects? The changing foundations of political communication. Journal of Communication, 58(4), 707–731.
  • Boeschoten, L., Ausloos, J., Moeller, J., Araujo, T., & Oberski, D. L. (2020). Digital trace data collection through data donation. ArXiv:2011.09851 [Cs, Stat]. http://arxiv.org/abs/2011.09851
  • Boeschoten, L., Ausloos, J., Möller, J. E., Araujo, T., & Oberski, D. L. (2022). A framework for privacy preserving digital trace data collection through data donation. Computational Communication Research, 4(2), 388–423. https://doi.org/10.5117/CCR2022.2.002.BOES
  • Boeschoten, L., Mendrik, A., van der Veen, E., Vloothuis, J., Hu, H., Voorvaart, R., & Oberski, D. L. (2022). Privacy-preserving local analysis of digital trace data: A proof-of-concept. Patterns, 3(3), 100444. https://doi.org/10.1016/j.patter.2022.100444
  • Borges Do Nascimento, I. J., Beatriz Pizarro, A., Almeida, J., Azzopardi-Muscat, N., André Gonçalves, M., Björklund, M., & Novillo-Ortiz, D. (2022). Infodemics and health misinformation: A systematic review of reviews. Bulletin of the World Health Organization, 100(9), 544–561. https://doi.org/10.2471/BLT.21.287654
  • Breuer, J., Kmetty, Z., Haim, M., & Stier, S. (2022). User-centric approaches for collecting Facebook data in the ‘post-API age’: Experiences from two studies and recommendations for future research. Information, Communication & Society. https://doi.org/10.1080/1369118X.2022.2097015
  • Burgess, J., Angus, D., Carah, N., Andrejevic, M., Hawker, K., Lewis, K., Obeid, A. K., Smith, A., Tan, J., Fordyce, R., Trott, V., & Li, L. 2021. Critical simulation as hybrid digital method for exploring the data operations and vernacular cultures of visual social media platforms. Preprint SocArXiv. https://doi.org/10.31235/osf.io/2cwsu
  • Chiatti, A., Cho, M. J., Gagneja, A., Yang, X., Brinberg, M., Roehrick, K., Choudhury, S. R., Ram, N., Reeves, B., & Giles, C. L. (2018). Text extraction and retrieval from smartphone screenshots: Building a repository for life in media. Proceedings of the 33rd Annual ACM Symposium on Applied Computing, 948–955. https://doi.org/10.1145/3167132.3167236
  • Christner, C., Urman, A., Adam, S., & Maier, M. (2022). Automated tracking approaches for studying online media use: A critical review and recommendations. Communication Methods and Measures, 16(2), 79–95. https://doi.org/10.1080/19312458.2021.1907841
  • Cohn, J. (2019). The burden of choice: Recommendations, subversion, and algorithmic culture. Rutgers University Press.
  • Cramer, H., Garcia-Gathright, J., Springer, A., & Reddy, S. (2018). Assessing and addressing algorithmic bias in practice. Interactions, 25(6), 58–63. https://doi.org/10.1145/3278156
  • Cronin, J., von Hohenberg, B. C., Gonçalves, J. F. F., Menchen-Trevino, E., & Wojcieszak, M. (2022). The (null) over-time effects of exposure to local news websites: Evidence from trace data. Journal of Information Technology & Politics, 1–15. https://doi.org/10.1080/19331681.2022.2123878
  • De Vreese, C. H., Boukes, M., Schuck, A., Vliegenthart, R., Bos, L., & Lelkes, Y. (2017). Linking survey and media content data: Opportunities, considerations, and pitfalls. Communication Methods and Measures, 11(4), 221–244. https://doi.org/10.1080/19312458.2017.1380175
  • Diener, E. (1984). Subjective well-being. Psychological Bulletin, 95(3), 542–575. https://doi.org/10.1037/0033-2909.95.3.542
  • European Union. (2016). Regulation (EU) 2016/679 of the European parliament and of the council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/EC (general data protection regulation). OJ, 59(L 119), 1–89.
  • Fan, Y., Lehmann, S., & Blok, A. (2022). Extracting the interdisciplinary specialty structures in social media data-based research: A clustering-based network approach. Journal of Informetrics, 16(3), 101310. https://doi.org/10.1016/j.joi.2022.101310
  • Freelon, D. (2014). On the interpretation of digital trace data in communication and social computing research. Journal of Broadcasting & Electronic Media, 58(1), 59–75. https://doi.org/10.1080/08838151.2013.875018
  • Freelon, D. (2018). Computational research in the post-API age. Political Communication, 35(4), 665–668. https://doi.org/10.1080/10584609.2018.1477506
  • Freelon, D., Pruden, M. L., Malmer, D., & Crist, A. (2022). Piegraph. [Computer software]. Retrieved from http://pcad.ils.unc.edu
  • Freelon, D., & Wells, C. (2020). Disinformation as political communication. Political Communication.
  • Gaisbauer, F., Pournaki, A., Banisch, S., Olbrich, E., & Guidi, B. (2021). Ideological differences in engagement in public debate on Twitter. Plos One, 16(3), e0249241. https://doi.org/10.1371/journal.pone.0249241
  • Gillespie, T. (2014). The relevance of algorithms. Media Technologies: Essays on Communication, Materiality, and Society, 167(2014), 167.
  • Grinberg, N., Joseph, K., Friedland, L., Swire-Thompson, B., & Lazer, D. (2019). Fake news on Twitter during the 2016 U.S. presidential election. Science, 363(6425), 374–378. https://doi.org/10.1126/science.aau2706
  • Guess, A., Nagler, J., & Tucker, J. (2019). Less than you think: Prevalence and predictors of fake news dissemination on Facebook. Science Advances, 5(1), eaau4586. https://doi.org/10.1126/sciadv.aau4586
  • Guo, J. (2022). Deep learning approach to text analysis for human emotion detection from big data. Journal of Intelligent Systems, 31(1), 113–126. https://doi.org/10.1515/jisys-2022-0001
  • Halavais, A. (2019). Overcoming terms of service: A proposal for ethical distributed research. Information, Communication & Society, 22(11), 1567–1581. https://doi.org/10.1080/1369118X.2019.1627386
  • Hancock, J. T., Liu, S. X., Luo, M., & Mieczkowski, H. (2022). Social media and psychological well-being. In S. C. Matz (Ed.), The psychology of technology: Social science research in the age of big data (pp. 195–238). American Psychological Association. https://doi.org/10.1037/0000290-007
  • Hooker, S. (2021). Moving beyond “algorithmic bias is a data problem. Patterns, 2(4), 100241. https://doi.org/10.1016/j.patter.2021.100241
  • Howison, J., Wiggins, A., & Crowston, K. (2011). Validity issues in the use of social network analysis with digital trace data. Journal of the Association for Information Systems, 12(12), 767–797. https://doi.org/10.17705/1jais.00282
  • Kmetty, Z., & Németh, R. (2022). Which is your favorite music genre? A validity comparison of Facebook data and survey data. Bulletin of Sociological Methodology/Bulletin de Méthodologie Sociologique, 154(1), 82–104. https://doi.org/10.1177/07591063211061754
  • Lambrecht, A., & Tucker, C. (2019). Algorithmic Bias? Anastem career ads. Management Science, 65(7), 2966–2981. https://doi.org/10.1287/mnsc.2018.3093
  • Lazer, D., Hargittai, E., Freelon, D., Gonzalez-Bailon, S., Munger, K., Ognyanova, K., & Radford, J. (2021). Meaningful measures of human society in the twenty-first century. Nature, 595(7866), 189–196. https://doi.org/10.1038/s41586-021-03660-7
  • Lee, J. H. S., Li, T., Hsu, W., & Lee, M. L. (2021). Repurpose image identification for fake news detection. In C. Strauss, G. Kotsis, A. M. Tjoa, & I. Khalil (Eds.), Database and Expert Systems Applications: 32nd International Conference, DEXA 2021, Virtual Event, September 27–30, 2021, Proceedings, Part II (Vol. 12924). Springer International Publishing. https://doi.org/10.1007/978-3-030-86475-0
  • Lee, J., Reeves, N., Ram, B., & Hamilton, J. (2022). The psychology of poverty and life online: Natural Experiments on the Effects of Smartphone Payday Loan Ads on Psychological Stress. Information, Communication & Society, 1–22. https://doi.org/10.1080/1369118X.2022.2109982
  • Luiten, A., Hox, J. J. C. M., & De Leeuw, E. D. (2020). Survey nonresponse trends and fieldwork effort in the 21st century: Results of an international study across countries and surveys. Journal of Official Statistics, 36(3), 469–487. https://doi.org/10.2478/jos-2020-0025
  • Mackey, T. K., Purushothaman, V., Haupt, M., Nali, M. C., & Li, J. (2021). Application of unsupervised machine learning to identify and characterise hydroxychloroquine misinformation on Twitter. Lancet Digital Health, 3(2), e72–75. https://doi.org/10.1016/S2589-7500(20)30318-6
  • Mellon, J., & Prosser, C. (2017). Twitter and Facebook are not representative of the general population: Political attitudes and demographics of British social media users. Research & Politics, 4(3), 2053168017720008. https://doi.org/10.1177/2053168017720008
  • Menchen-Trevino, E. (2016). Web historian: Enabling multi-method and independent research with real-world web browsing history data. IConference 2016 Proceedings (iSchools). https://doi.org/10.9776/16611.
  • Noble, S. U. (2018). Algorithms of oppression: How search engines reinforce racism. NYU Press.
  • Ohme, J., & Araujo, T. (2022). Digital data donations: A quest for best practices. Patterns, 3(4), 100467. https://doi.org/10.1016/j.patter.2022.100467
  • Ohme, J., Araujo, T., de Vreese, C. H., & Piotrowski, J. T. (2021). Mobile data donations: Assessing self-report accuracy and sample biases with the iOS screen time function. Mobile Media & Communication, 9(2), 293–313. https://doi.org/10.1177/2050157920959106
  • Orben, A. (2020). Teenagers, screens and social media: A narrative review of reviews and key studies. Social Psychiatry and Psychiatric Epidemiology, 55(4), 407–414. https://doi.org/10.1007/s00127-019-01825-4
  • Otto, L. P., Thomas, F., Glogger, I., & De Vreese, C. H. (2022). Linking media content and survey data in a dynamic and digital media environment – mobile longitudinal linkage analysis. Digital Journalism, 10(1), 200–215. https://doi.org/10.1080/21670811.2021.1890169
  • Parry, D. A., Davidson, B. I., Sewall, C. J. R., Fisher, J. T., Mieczkowski, H., & Quintana, D. S. (2021). A systematic review and meta-analysis of discrepancies between logged and self-reported digital media use. Nature Human Behaviour, 5(11), 1535–1547. https://doi.org/10.1038/s41562-021-01117-5
  • Reeves, B., Ram, N., Robinson, T. N., Cummings, J. J., Giles, C. L., Pan, J., Chiatti, A., Cho, M., Roehrick, K., Yang, X., Gagneja, A., Brinberg, M., Muise, D., Lu, Y., Luo, M., Fitzgerald, A., & Yeykelis, L. (2021). Screenomics: A framework to capture and analyze personal life experiences and the ways that technology shapes them. Human–Computer Interaction, 36(2), 150–201. https://doi.org/10.1080/07370024.2019.1578652
  • Robertson, R. E., Jiang, S., Joseph, K., Friedland, L., Lazer, D., & Wilson, C. (2018). Auditing partisan audience bias within google search. Proceedings of the ACM on Human-Computer Interaction, 2(CSCW), 1–22. https://doi.org/10.1145/3274417
  • Scharkow, M. (2016). The accuracy of self-reported internet use—a validation study using client log data. Communication Methods and Measures, 10(1), 13–27. https://doi.org/10.1080/19312458.2015.1118446
  • Singh, B., & Sharma, D. K. (2022). Predicting image credibility in fake news over social media using multi-modal approach. Neural Computing & Applications, 34(24), 21503–21517. https://doi.org/10.1007/s00521-021-06086-4
  • Stadel, M., & Stulp, G. (2022). Balancing bias and burden in personal network studies. Social Networks, 70, 16–24. https://doi.org/10.1016/j.socnet.2021.10.007
  • Stier, S., Breuer, J., Siegers, P., & Thorson, K. (2020). Integrating survey data and digital trace data: Key issues in developing an emerging field. Social Science Computer Review, 38(5), 503–516. https://doi.org/10.1177/0894439319843669
  • Sun, X., Ram, N., Reeves, B., Cho, M. -J., Fitzgerald, A., & Robinson, T. N. (2022). Connectedness and independence of young adults and parents in the digital world: Observing smartphone interactions at multiple timescales using screenomics. Journal of Social and Personal Relationships. https://doi.org/10.1177/02654075221104268
  • Thorson, K., Cotter, K., Medeiros, M., & Pak, C. (2021). Algorithmic inference, political interest, and exposure to news and politics on Facebook. Information, Communication & Society, 24(2), 183–200. https://doi.org/10.1080/1369118X.2019.1642934
  • Toth, R., & Trifonova, T. (2021). Somebody’s watching me: Smartphone use tracking and reactivity. Computers in Human Behavior Reports, 4, 100142. https://doi.org/10.1016/j.chbr.2021.100142
  • Tufekci, Z. (2014). Engineering the public: Big data, surveillance and computational politics. First Monday, https://doi.org/10.5210/fm.v19i7.4901
  • Valkenburg, P. M. (2022). Theoretical foundations of social media uses and effects. In J. Nesi, E. H. Telzer, & M. J. Prinstein (Eds.), Handbook of adolescent digital media use and mental health (1st ed, pp. 39–60). Cambridge University Press. https://doi.org/10.1017/9781108976237.004
  • Valkenburg, P. M., Beyens, I., Meier, A., & Vanden Abeele, M. M. P. (2022). Advancing our understanding of the associations between social media use and well-being. Current Opinion in Psychology, 47, 101357. https://doi.org/10.1016/j.copsyc.2022.101357
  • Valkenburg, P., Beyens, I., Pouwels, J. L., van Driel, I. I., & Keijsers, L. (2021). Social media use and adolescents’ self-esteem: heading for a person-specific media effects paradigm. The Journal of Communication, 71(1), 56–78. https://doi.org/10.1093/joc/jqaa039
  • Vanden Abeele, M. M. P. (2020). Digital wellbeing as a dynamic construct. Communication Theory. https://doi.org/10.1093/ct/qtaa024
  • van Driel, I. I., Giachanou, A., Pouwels, J. L., Boeschoten, L., Beyens, I., & Valkenburg, P. M. (2022). Promises and pitfalls of social media data donations. Communication Methods and Measures, 16(4), 266–282. https://doi.org/10.1080/19312458.2022.2109608
  • Vosoughi, S., Roy, D., & Aral, S. (2018). The spread of true and false news online. Science, 359(6380), 1146–1151. https://doi.org/10.1126/science.aap9559
  • Vuorre, M., & Przybylski, A. K. 2022. Global well-being and mental health in the internet age. Preprint PsyArXiv. https://doi.org/10.31234/osf.io/9tbjy
  • Wagner, C., Strohmaier, M., Olteanu, A., Kıcıman, E., Contractor, N., & Eliassi-Rad, T. (2021). Measuring algorithmically infused societies. Nature, 595(7866), 197–204. https://doi.org/10.1038/s41586-021-03666-1
  • Wojcieszak, M., Menchen-Trevino, E., Goncalves, J. F. F., & Weeks, B. (2022). Avenues to news and diverse news exposure online: Comparing direct navigation, social media, news aggregators, search queries, and article hyperlinks. The International Journal of Press/politics, 27(4), 194016122110091. https://doi.org/10.1177/19401612211009160
  • World Health Organization. (2022, September 1). Infodemics and misinformation negatively affect people’s health behaviours, new WHO review finds. https://www.who.int/europe/news/item/01-09-2022-infodemics-and-misinformation-negatively-affect-people-s-health-behaviours–new-who-review-finds
  • Yang, K. -C., Pierri, F., Hui, P. -M., Axelrod, D., Torres-Lugo, C., Bryden, J., & Menczer, F. (2021). The COVID-19 infodemic: Twitter versus Facebook. Big Data & Society, 8(1), 20539517211013860. https://doi.org/10.1177/20539517211013861
  • Yang, X., Ram, N., Robinson, T., & Reeves, B. (2019). Using screenshots to predict task switching on smartphones. Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, 1–6. https://doi.org/10.1145/3290607.3313089
  • Yee, A. Z. H., Yu, R., Lim, S. S., Lim, K. H., Dinh, T. T. A., Loh, L., Hadianto, A., & Quizon, M. (2022). ScreenLife capture: An open-source and user-friendly framework for collecting screenome data from Android smartphones. Behavior Research Methods, 1–18. https://doi.org/10.3758/s13428-022-02006-z
  • Yeykelis, L., Cummings, J. J., & Reeves, B. (2014). Multitasking on a single device: arousal and the frequency, anticipation, and prediction of switching between media content on a computer: Multitasking and Arousal. The Journal of Communication, 64(1), 167–192. https://doi.org/10.1111/jcom.12070
  • Zhang, L. C. (2012). Topics of statistical theory for register‐based statistics and data integration. Statistica Neerlandica, 66(1), 41–63. https://doi.org/10.1111/j.1467-9574.2011.00508.x