How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

Authors

  • Daniel Maier Freie Universität Berlin (FU)
  • Andreas Niekler University of Leipzig
  • Gregor Wiedemann University of Hamburg
  • Daniela Stoltenberg University of Münster

DOI:

https://doi.org/10.5117/CCR2020.2.001.MAIE

Keywords:

Text analysis, topic model, latent Dirichlet allocation, preprocessing, model selection

Abstract

Topic modeling enables researchers to explore large document corpora. However, accurate model specification requires the calculation of multiple models, which can become infeasibly costly in terms of time and computing resources. In order to circumvent this problem, we test and propose a strategy introducing two easy-to-implement modifications to the modeling process: Instead of modeling the full corpus and the whole vocabulary, we (1) use random document samples and (2) an extensively pruned vocabulary. Using three empirical corpora with different origins and characteristics (news articles, websites, and Tweets), we investigate how different sample sizes and pruning strategies affect the resulting topic models as compared to fully modeled corpora. Our test provides evidence that sampling and pruning are cheap and viable strategies to accelerate model specification. Sample-based topic models closely resemble corpus-based models, if the sample size is large enough (usually >10%). Also, extensive pruning does not compromise the quality of the resulting topics. Altogether, pruning and sample-based modeling leads to increased performance without impairing model quality.

Downloads

Published

2020-11-02

Issue

Section

Articles

How to Cite

Maier, D., Niekler, A., Wiedemann, G., & Stoltenberg, D. (2020). How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models. Computational Communication Research, 2(2), 139-152. https://doi.org/10.5117/CCR2020.2.001.MAIE