640
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Automatically Finding Actors in Texts: A Performance Review of Multilingual Named Entity Recognition Tools

ORCID Icon, ORCID Icon & ORCID Icon

References

  • Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S., & Vollgraf, R. (2019). FLAIR: An easy-to-use framework for state-of-the-art NLP. Proceedings of the NAACL 2019, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Minneapolis, Minnesota (pp. 54–59). https://doi.org/10.18653/v1/N19-4010
  • Al-Rawi, A. (2017). News values on social media: News organizations’ Facebook use. Journalism, 18(7), 871–889. https://doi.org/10.1177/1464884916636142
  • Aprosio, A. P., & Paccosi, T. (2023). NERMuD at EVALITA 2023: Overview of the named-entities recognition on multi-domain documents task. Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023), Parma, Italy.
  • Artetxe, M., Ruder, S., & Yogatama, D. (2020). On the cross-lingual transferability of monolingual representations. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4623–4637. https://doi.org/10.18653/v1/2020.acl-main.421
  • Baden, C., Pipal, C., Schoonvelde, M., & van der Velden, M. A. C. G. (2022). Three gaps in computational text analysis methods for social sciences: A research agenda. Communication Methods and Measures, 16(1), 1–18. https://doi.org/10.1080/19312458.2021.2015574
  • Benikova, D., Biemann, C., & Reznicek, M. (2014). NoSta-D named entity annotation for German: Guidelines and dataset. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland (pp. 2524–2531). http://www.lrec-conf.org/proceedings/lrec2014/pdf/276_Paper.pdf
  • Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: Analyzing text with the natural language Toolkit. O’Reilly Media.
  • Boschee, E., Lautenschlager, J., Shellman, S., & Shilliday, A. (2015). ICEWS dictionaries. https://doi.org/10.7910/DVN/28118
  • Boumans, J. W., & Trilling, D. (2016). Taking stock of the toolkit. Digital Journalism, 4(1), 8–23. https://doi.org/10.1080/21670811.2015.1096598
  • Burggraaff, C., & Trilling, D. (2020). Through a different gate: An automated content analysis of how online news and print news differ. Journalism, 21(1), 112–129. https://doi.org/10.1177/1464884917716699
  • Buz, C., Promies, N., Kohler, S., & Lehmkuhl, M. (2021). Validierung von NER-Verfahren zur automatisierten identifikation von akteuren in deutschsprachigen journalistischen texten. Studies in Communication & Media, 10(4), 590–627. https://doi.org/10.5771/2192-4007-2021-4-590
  • Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. https://doi.org/10.48550/arXiv.1911.02116
  • Derczynski, L., Nichols, E., van Erp, M., & Limsopatham, N. (2017). Results of the WNUT2017 shared task on novel and emerging entity recognition. Proceedings of the 3rd Workshop on Noisy User-generated Text, Copenhagen, Denmark (pp. 140–147). https://doi.org/10.18653/v1/W17-4418
  • Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. https://doi.org/10.48550/arXiv.1810.04805
  • Ehrmann, M., Romanello, M., Flückiger, A., & Clematide, S. (2020). Extended overview of CLEF HIPE 2020: Named entity processing on historical newspapers. In L. Cappellato, C. Eickhoff, N. Ferro, & A. Névéol (Eds.), CLEF 2020 Working Notes. Working Notes of CLEF 2020 – Conference and Labs of the Evaluation Forum, Thessaloniki, Greece (Vol. 2696). CEUR-WS. https://doi.org/10.5281/zenodo.4117566
  • Finkel, J. R., Grenager, T., & Manning, C. (2005). Incorporating non-local information into information extraction systems by Gibbs sampling. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Ann Arbor, Michigan (pp. 363–370). https://doi.org/10.3115/1219840.1219885
  • Fogel-Dror, Y., Shenhav, S. R., Sheafer, T., & Van Atteveldt, W. (2019). Role-based association of verbs, actions, and sentiments with entities in political discourse. Communication Methods and Measures, 13(2), 69–82. https://doi.org/10.1080/19312458.2018.1536973
  • Fu, J., Liu, P., Zhang, Q., & Huang, X. (2020). RethinkCWS: Is Chinese word segmentation a solved task? The 2020 Conference on Empirical Methods in Natural Language Processing, Online. https://arxiv.org/abs/2011.06858
  • Gattermann, K. (2018). Mediated personalization of executive European Union politics: Examining patterns in the broadsheet coverage of the European Commission, 1992–2016. The International Journal of Press/politics, 23(3), 345–366. https://doi.org/10.1177/1940161218779231
  • Grill, C., & Boomgaarden, H. (2017). A network perspective on mediated Europeanized public spheres: Assessing the degree of Europeanized media coverage in light of the 2014 European Parliament election. European Journal of Communication, 32(6), 568–582. https://doi.org/10.1177/0267323117725971
  • Honnibal, M., Montani, I., Van Landeghem, S., & Boyd, A. (2020). spaCy: Industrial-strength natural language processing in Python. https://doi.org/10.5281/zenodo.1212303
  • Hopp, F. R., Fisher, J. T., Cornell, D., Huskey, R., & Weber, R. (2020). The extended moral foundations dictionary (eMFD): Development and applications of a crowd-sourced approach to extracting moral intuitions from text. Behavior Research Methods, 53(1), 232–246. https://doi.org/10.3758/s13428-020-01433-0
  • Jonkman, J. G., Trilling, D., Verhoeven, P., & Vliegenthart, R. (2020). To pass or not to pass: How corporate characteristics affect corporate visibility and tone in company news coverage. Journalism Studies, 21(1), 1–18. https://doi.org/10.1080/1461670X.2019.1612266
  • Joshi, M., Levy, O., Weld, D. S., & Zettlemoyer, L. (2019). BERT for coreference resolution: Baselines and analysis. https://doi.org/10.48550/arXiv.1908.09091
  • Jurafsky, D., & Martin, J. H. (2023). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition [3rd Edition Draft]. https://web.stanford.edu/~jurafsky/slp3/
  • Kripke, S. A. (1972). Naming and necessity. In D. Davidson & G. Harman (Eds.), Semantics of natural language (pp. 253–355). Springer.
  • Kruikemeier, S., Gattermann, K., & Vliegenthart, R. (2018). Understanding the dynamics of politicians’ visibility in traditional and social media. The Information Society, 34(4), 215–228. https://doi.org/10.1080/01972243.2018.1463334
  • Lee, J. S., & Nerghes, A. (2018). Refugee or migrant crisis? Labels, perceived agency, and sentiment polarity in online discussions. Social Media & Society, 4(3), 205630511878563. https://doi.org/10.1177/2056305118785638
  • Lind, F., Eberl, J.-M., Eisele, O., Heidenreich, T., Galyga, S., & Boomgaarden, H. G. (2021). Building the bridge: Topic modeling for comparative research. Communication Methods and Measures, 16(2), 96–114. Advance online publication. https://doi.org/10.1080/19312458.2021.1965973
  • Litvyak, O., Fischeneder, A., Balluff, P., Müller, W. C., Kritzinger, S., & Boomgaarden, H. G. (2022). AUTNES automatic content analysis of the media coverage 2019 (SUF edition). https://doi.org/10.11587/ZY7KSQ
  • Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, (pp. 55–60). https://doi.org/10.3115/v1/P14-5010
  • Martin, G. J., & McCrain, J. (2019). Local news and national politics. American Political Science Review, 113(2), 372–384. https://doi.org/10.1017/S0003055418000965
  • Mehrabi, N., Gowda, T., Morstatter, F., Peng, N., & Galstyan, A. (2020). Man is to person as woman is to location: Measuring gender bias in named entity recognition. Proceedings of the 31st ACM Conference on Hypertext and Social Media, Online (pp. 231–232). https://doi.org/10.1145/3372923.3404804
  • Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). A survey on bias and fairness in machine learning. ACM Computing Surveys, 54(6), 1–35. https://doi.org/10.1145/3457607
  • Merkley, E. (2020). Are experts (News)worthy? Balance, conflict, and mass media coverage of expert consensus. Political Communication, 37(4), 530–549. https://doi.org/10.1080/10584609.2020.1713269
  • Mohit, B., Schneider, N., Bhowmick, R., Oflazer, K., & Smith, N. A. (2012, April). Recall-oriented learning of Named entities in Arabic Wikipedia. In W. Daelemans (Ed.), Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France (pp. 162–173). Association for Computational Linguistics. https://aclanthology.org/E12-1017
  • Mota, C., & Santos, D. (Eds.). (2008). Desafios na avaliação conjunta do reconhecimento de entidades mencionadas: O Segundo HAREM. Linguateca. https://www.linguateca.pt/LivroSegundoHAREM/
  • Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1), 3–26. https://doi.org/10.1075/li.30.1.03nad
  • Nardulli, P. F., Althaus, S. L., & Hayes, M. (2015). A progressive supervised-learning approach to generating rich civil strife data. Sociological Methodology, 45(1), 148–183. https://doi.org/10.1177/0081175015581378
  • Neudecker, C. (2016, May). An open corpus for named entity recognition in historic newspapers. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association (ELRA), Portorož, Slovenia.
  • Newell, C., Cowlishaw, T., & Man, D. (2018). Quote extraction and analysis for news. Proceedings of the KDD Workshop on Data Science, New York, NY, USA. Journalism and Media (DSJM).
  • Nouvel, D., Ehrmann, M., & Rosset, S. (2016, February). Named entities for computational linguistics (Vol. 148). John Wiley & Sons, Inc. https://doi.org/10.1002/9781119268567
  • Oostdijk, N., Reynaert, M., Hoste, V., & van den Heuvel, H. (2013). SoNaR User Documentation (tech. rep). Instituut voor de Nederlandse Taal. https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/sonardocumentatie.pdf
  • Paccosi, T., & Aprosio, A. P. (2021). KIND: An Italian multi-domain dataset for named entity recognition. CoRR. https://doi.org/10.48550/arXiv.2112.15099
  • Pan, X., Zhang, B., May, J., Nothman, J., Knight, K., & Ji, H. (2017). Cross-lingual name tagging and linking for 282 languages. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1946–1958. https://doi.org/10.18653/v1/P17-1178
  • Pires, T., Schlinger, E., & Garrette, D. (2019). How multilingual is multilingual BERT?. https://doi.org/10.48550/arXiv.1906.01502
  • Poschmann, P., & Goldenstein, J. (2019). Disambiguating and specifying social actors in big data: Using Wikipedia as a data source for demographic information. Sociological Methods & Research, 51(2), 887–925. https://doi.org/10.1177/0049124119882481
  • Qi, P., Zhang, Y., Zhang, Y., Bolton, J., & Manning, C. D. (2020). Stanza: A python natural language processing toolkit for many human languages. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Online.
  • Ruokolainen, T., Kauppinen, P., Silfverberg, M., & Lindén, K. (2020). A Finnish news corpus for named entity recognition. Language Resources and Evaluation, 54(1), 247–272. https://doi.org/10.1007/s10579-019-09471-7
  • Scheible, R., Thomczyk, F., Tippmann, P., Jaravine, V., Boeker, M., Folz, M., Grimbacher, B., Göbel, J., Klein, C., Nieters, A., Rusch, S., Kindle, G., & Storf, H. (2020). GottBERT: A pure German language model. https://doi.org/10.48550/arXiv.2012.02110
  • Scott, T. A., Ulibarri, N., & Scott, R. P. (2020). Stakeholder involvement in collaborative regulatory processes: Using automated coding to track attendance and actions. Regulation & Governance, 14(2), 219–237. https://doi.org/10.1111/rego.12199
  • Ševčíková, M., Žabokrtský, Z., Straková, J., & Straka, M. (2014). Czech named entity corpus 2.0 [LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL). In Faculty of mathematics and physics. Charles University]. http://hdl.handle.net/11858/00-097C-0000-0023-1B22-8
  • Shahzad, M., Amin, A., Esteves, D., & Ngonga Ngomo, A.-C. (2021). InferNER: An attentive model leveraging the sentence-level information for named entity recognition in Microblogs. The International FLAIRS Conference Proceedings, North Miami Beach, Florida (p. 34). https://doi.org/10.32473/flairs.v34i1.128538
  • Simon, E., & Vadász, N. (2021). Introducing NYTK-NerKor, a gold standard Hungarian named entity annotated corpus. In K. Ekstein, F. Pártl, & M. Konopík (Eds.), Text, Speech, and Dialogue – 24th International Conference, Olomouc, Czech Republic (pp. 222–234, Vol. 12848). Springer. https://doi.org/10.1007/978-3-030-83527-9_19
  • Steinberger, R., & Pouliquen, B. (2007). Cross-lingual named entity recognition (S. Sekine & E. Ranchhod, eds.). Lingvisticae Investigationes, 30(1), 135–162. https://doi.org/10.1075/li.30.1.09ste
  • Steinberger, R., Pouliquen, B., Kabadjov, M., Belyaeva, J., & Van Der Goot, E. (2011). JRC-NAMES: A freely available, highly multilingual named entity resource. International Conference Recent Advances in Natural Language Processing, RANLP, Hissar, Bulgaria (pp. 104–110). https://aclanthology.org/R11-1015
  • Straková, J., Straka, M., & Hajič, J. (2013). A new state-of-the-art Czech Named Entity Recognizer. In I. Habernal & V. Matoušek (Eds.), Text, speech, and dialogue (pp. 68–75). Springer. https://doi.org/10.1007/978-3-642-40585-3_10
  • Taulé, M., Martí, M. A., & Recasens, M. (2008, May). AnCora: Multilevel annotated corpora for Catalan and Spanish. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, & D. Tapias (Eds.), Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2008/pdf/35_paper.pdf
  • Tjong Kim Sang, E. F. (2002). Introduction to the CoNLL-2002 lc: Language-independent named entity recognition. COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002). https://aclanthology.org/W02-2024
  • Tjong Kim Sang, E. F., & De Meulder, F. (2003). Introduction to the CoNLL-2003 lc: Language-independent named entity recognition. Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, 142–147. https://aclanthology.org/W03-0419
  • Traag, V. A., Reinanda, R., & Van Klinken, G. (2015). Elite co-occurrence in the media. Asian Journal of Social Science, 43(5), 588–612. https://doi.org/10.1163/15685314-04305005
  • Tsai, H., Riesa, J., Johnson, M., Arivazhagan, N., Li, X., & Archer, A. (2019). Small and practical BERT models for sequence labeling. https://doi.org/10.48550/arXiv.1909.00100
  • Turc, I., Lee, K., Eisenstein, J., Chang, M.-W., & Toutanova, K. (2021, June). Revisiting the primacy of English in zero-shot cross-lingual transfer. https://doi.org/10.48550/arXiv.2106.16171
  • Turcsányi, R. Q., Karásková, I., Matura, T., & Šimalčík, M. (2019). Followers, challengers, or by-standers? Central European media responses to intensification of relations with China. Intersections East European Journal of Society and Politics, 5(3), 49–67. https://doi.org/10.17356/IEEJSP.V5I3.564
  • van Atteveldt, W., van der Velden, M. A. C. G., & Boukes, M. (2021). The validity of sentiment analysis: Comparing manual annotation, crowd-coding, dictionary approaches, and machine learning algorithms. Communication Methods and Measures, 15(2), 121–140. https://doi.org/10.1080/19312458.2020.1869198
  • Van den Bosch, A., Busser, G., Daelemans, W., & Canisius, S. (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch. In F. van Eynde, P. Dirix, I. Schuurman, & V. Vandeghinste (Eds.), Selected papers of the 17th computational linguistics in the Netherlands Meeting, Leuven, Belgium (pp. 99–114). LOT.
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you Need. NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, California (pp. 6000–6010). https://doi.org/10.5555/3295222.3295349
  • Vliegenthart, R., Boomgaarden, H. G., & Boumans, J. W. (2011). Changes in political news coverage: Personalization, conflict and negativity in British and Dutch newspapers. Political Communication in Postmodern Democracy: Challenging the Primacy of Politics, 92–110. https://doi.org/10.1057/9780230294783
  • Walter, D., & Ophir, Y. (2019). News frame analysis: An inductive mixed-method computational approach. Communication Methods and Measures, 13(4), 248–266. https://doi.org/10.1080/19312458.2019.1639145
  • Watanabe, K. (2017). Newsmap. Digital Journalism, 6(3), 294–309. https://doi.org/10.1080/21670811.2017.1293487
  • Weischedel, R., Palmer, M., Marcus, M., Hovy, E., Pradhan, S., Ramshaw, L., Xue, N., Taylor, A., Kaufman, J., Franchini, M., El-Bachouti, M., Belvin, R., & Houston, A. (2013). OntoNotes release 5.0. https://doi.org/10.35111/XMHB-2B84
  • Welbers, K., & van Atteveldt, W. (2022). Corpustools: Managing, querying and analyzing tokenized text [R Package Version 0.4.10]. https://CRAN.R-project.org/package=corpustools
  • Welbers, K., van Atteveldt, W., Kleinnijenhuis, J., & Ruigrok, N. (2018). A gatekeeper among gatekeepers. Journalism Studies, 19(3), 315–333. https://doi.org/10.1080/1461670X.2016.1190663