676
Views
0
CrossRef citations to date
0
Altmetric
Research Article

What’s in a name? The effect of named entities on topic modelling interpretability

, , , , & ORCID Icon

References

  • Bischof, J., & Airoldi, E. M. (2012). Summarizing topical content with word frequency and exclusivity. Proceedings of the 29th International Conference on Machine Learning (ICML-12), 201–208. Edinburgh Scotland.
  • Boukes, M., & Vliegenthart, R. (2020). A general pattern in the construction of economic newsworthiness? analyzing news factors in popular, quality, regional, and financial newspapers. Journalism, 21(2), 279–300. https://doi.org/10.1177/1464884917725989
  • Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., & Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1). https://doi.org/10.18637/jss.v076.i01
  • Chang, J., & Blei, D. (2009). Relational topic models for document networks. In D. van Dyk & M. Welling (Eds.), Proceedings of the twelth international conference on artificial intelligence and statistics (pp. 81–88). Florida, USA.
  • Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. Advances in Neural Information Processing Systems, 288–296. Vancouver, B.C., Canada.
  • Denny, M., & Spirling, A. (2017 (September 27, 2017). Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. When It Misleads, and What to Do About It, https://doi.org/10.2139/ssrn.2849145
  • Doogan, C., & Buntine, W. (2021). Topic model or topic twaddle? re-evaluating semantic interpretability measures. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 3824–3848.
  • Eberl, J.-M., Boomgaarden, H. G., & Wagner, M. (2017). One bias fits all? three types of media bias and their effects on party preferences. Communication Research, 44(8), 1125–1148. https://doi.org/10.1177/0093650215614364
  • Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4), 457–472. https://doi.org/10.1214/ss/1177011136
  • Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as data: A new framework for machine learning and the social sciences. Princeton University Press.
  • Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(3), 267–297. https://doi.org/10.1093/pan/mps028
  • Grusky, M., Naaman, M., & Artzi, Y. (2018). Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. arXiv Preprint arXiv: 180411283.
  • Honnibal, M., & Johnson, M. (2015). An improved non-monotonic transition system for dependency parsing. Proceedings of the 2015 conference on empirical methods in natural language processing, 1373–1378. Lisbon, Portugal.
  • Hoyle, A., Goel, P., Peskov, D., Hian-Cheong, A., Boyd-Graber, J., & Resnik, P. (2021). Is automated topic model evaluation broken?: The incoherence of coherence. arXiv Preprint arXiv: 210702173.
  • Hudson, R. (1994). About 37% of word-tokens are nouns. Language, 70(2), 331–339. https://doi.org/10.2307/415831
  • Hu, L., Li, J., Li, Z., Shao, C., & Li, Z. (2013). Incorporating entities in news topic modeling. In G. Zhou, J. Li, D. Zhao, & Y. Feng, Eds. Natural language processing and Chinese computing. (pp. 139–150). Springer: 14. https://doi.org/10.1007/978-3-642-41644-6.
  • Jacobi, C., & K¨onigslo¨w, K.-K.v., & Ruigrok, N. (2016). Political news in online and print newspapers. Digital Journalism, 4(6), 723–742. https://doi.org/10.1080/21670811.2015.1087810
  • Krasnashchok, K., & Jouili, S. (2018). Improving topic quality by promoting named entities in topic modeling. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 247–253. https://doi.org/10.18653/v1/P18-2040
  • Kripke, S. A. (1972). Naming and necessity. In D. Davidson & G. Harman (Eds.), Semantics of natural language (pp. 253–355). Springer.
  • Kuhr, F., Lichtenberger, M., Braun, T., & M¨oller, R. (2021). Enhancing relational topic models with named entity induced links. 2021 IEEE 15th International Conference on Semantic Computing (ICSC), 314–317. https://doi.org/10.1109/ICSC50631.2021.00059
  • Kumar, D., & Singh, S. R. (2019). Prioritized named entity driven LDA for document clustering. In B. Deka, P. Maji, S. Mitra, D. K. Bhattacharyya, P. K. Bora, & S. K. Pal (Eds.), Pattern recognition and machine intelligence (pp. 294–301). Springer International Publishing. https://doi.org/10.1007/978-3-030-34872-433
  • Lau, J. H., Newman, D., & Baldwin, T. (2014). Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 530–539. Gothenburg, Sweden.
  • Liang, J., & Liu, H. (2013). Noun distribution in natural languages. Poznań Studies in Contemporary Linguistics, 49(4), 509–529. https://doi.org/10.1515/psicl-2013-0019
  • Lundberg, I., Johnson, R., & Stewart, B. M. (2021). What is your estimand? defining the target quantity connects statistical evidence to theory. American Sociological Review, 86(3), 532–565. https://doi.org/10.1177/00031224211004187
  • Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., Pfetsch, B., Heyer, G., Reber, U., H¨aussler, T., Schmid-Petri, H., & Adam, S. (2018). Applying lda topic modeling in communication research: Toward a valid and reliable methodology. Communication Methods and Measures, 12(2–3), 93–118. https://doi.org/10.1080/19312458.2018.1430754
  • Marrero, M., Urbano, J., Sanchez-Cuadrado, S., Morato, J., & Gomez-Berbis, J. M. (2013). Named entity recognition: Fallacies, challenges and opportunities. Computer Standards and Interfaces, 35(5), 482–489. https://doi.org/10.1016/j.csi.2012.09.004
  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 3111–3119. Lake Tahoe, Nevada, USA.
  • Mimno, D., & Lee, M. (2014). Low-dimensional embeddings for interpretable anchor-based topic inference. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1319–1328. Doha, Qatar.
  • Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. Proceedings of the 2011 conference on empirical methods in natural language processing, 262–272. Edinburgh, Scotland, UK.
  • Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1), 3–26. https://doi.org/10.1075/li.30.1.03nad
  • Nouvel, D., Ehrmann, M., & Rosset, S. (2016). Named entities for computational linguistics. John Wiley & Sons, Inc. https://doi.org/10.1002/9781119268567
  • Rehurek, R. & Sojka, P. (2011). Gensim—statistical semantics in python. Retrieved from Genism
  • Rijcken, E., Kaymak, U., Scheepers, F., Mosteiro, P., Zervanou, K., & Spruit, M. (2022). Topic modeling for interpretable text classification from ehrs. Frontiers in Big Data, 5. https://doi.org/10.3389/fdata.2022.846930
  • Roberts, M. E., Stewart, B. M., & Tingley, D. (2019). Stm: An r package for structural topic models. Journal of Statistical Software, 91(1), 1–40. https://doi.org/10.18637/jss.v091.i02
  • Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S. K., Albertson, B., & Rand, D. G. (2014). Structural topic models for open-ended survey responses. American Journal of Political Science, 58(4), 1064–1082. https://doi.org/10.1111/ajps.12103
  • Sievert, C., & Shirley, K. (2014). Ldavis: A method for visualizing and interpreting topics. Proceedings of the workshop on interactive language learning, visualization, and interfaces, 63–70. Baltimore, Maryland, USA.
  • Taddy, M. (2012). On estimation and selection for topic models. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics. La Palma, Canary Islands.
  • Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., & Bu¨rkner, P.-C. (2021). Rank-normalization, folding, and localization: An improved r for assessing convergence of mcmc (with discussion). Bayesian Anal- Ysis, 16(2), 667–718. https://doi.org/10.1214/20-BA1221
  • Ying, L., Montgomery, J. M., & Stewart, B. M. (2021). Topics, concepts, and measurement: A crowdsourced procedure for validating topics as measures. Political Analysis, 30(4), 570–589. https://doi.org/10.1017/pan.2021.33