2,426
Views
1
CrossRef citations to date
0
Altmetric
Research Article

Advancing Automated Content Analysis for a New Era of Media Effects Research: The Key Role of Transfer Learning

ORCID Icon, ORCID Icon, ORCID Icon & ORCID Icon
 

ABSTRACT

The availability of individual-level digital trace data offers exciting new ways to study media uses and effects based on the actual content that people encountered. In this article, we argue that to really reap the benefits of this data, we need to update our methodology for automated text analysis. We review challenges for the automatic identification of theoretically relevant concepts in texts along three dimensions: format/style, language, and modality. These dimensions unveil a significantly higher level of diversity and complexity in individual-level digital trace data, as opposed to the content traditionally examined through automated text analysis in our field. Consequently, they provide a valuable perspective for exploring the limitations of traditional approaches. We argue that recent developments within the field of Natural Language Processing, in particular, transfer learning using transformer-based models, have the potential to aid the development, application, and performance of various computational tools. These tools can contribute to the meaningful categorization of the content of social (and other) media.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

1 For instance, consider the sentence Zoë, who had always dreamt of becoming a computer scientist, achieved her goals after years of dedication. Here, the clause who had always dreamt of becoming a computer scientist forms a long-distance dependency with the noun Zoë, providing important information about Zoë’s ambitions and offering important contextual details.

2 One may argue that rule-based approaches like “find A within n words distance of B” are not based on a strict BoW representation. However, like pure BoW approaches, they only to an extremely limited extent exploit the order and structure of language.

3 We use “knowledge” loosely here to indicate only that the model has information that allows it to make better inferences from text. For example, that the word “not” indicates a negation. Whether or not a complex mathematical model that processes negations correctly can be said to possess language knowledge in an ontological sense is of no concern to this article.

4 For reference, a common BoW representation of a corpus is a document-term matrix, which indicates how often each term occurred in each document. The rows in this matrix (the document vectors) are the one-hot vectors of the terms summed together.

5 In practice the contextualized embeddings are not created with a single self-attention calculation. In state-of-the-art Transformers like BERT (Devlin et al., Citation2019), multiple self-attention heads are performed in parallel in a process called multi-head attention, and multiple multi-head attention layers are stacked together. In the “normal” BERT base model there were no less than 12 layers with 12 attention heads. Each of these 144 self-attention mechanisms featured close to 150,000 parameters.

6 Notably, some LLMs can already perform decent zero-shot classification at this stage, meaning that they can to some extent recognize classes that they haven’t yet seen any labeled examples of.

Additional information

Notes on contributors

Anne Kroon

Anne Kroon is an associate professor at the Amsterdam School of Communication Research at the University of Amsterdam. Her research centers on employing computational techniques to explore the causes and consequences of bias in algorithms in the domain of digital job markets.

Kasper Welbers

Kasper Welbers is an assistant professor at the Department of Communication Science at the Vrije Universiteit Amsterdam. His research focuses primarily on how the gatekeeping process of news messages has changed due to the rise of new media technologies, and how we can study this using computational methods.

Damian Trilling

Damian Trilling is associate professor of political communication and journalism at the University of Amsterdam. He is interested in news use and dissemination and in the adoption and development of computational methods.

Wouter van Atteveldt

Wouter van Atteveldt is professor of Computational Communication Science and Political Communication at the Vrije Universiteit Amsterdam. He focuses on automatic analysis of (political) communication, including both traditional and social media, and the methods and data required for studying this.