Selecting Relevant Documents for Multilingual Content Analysis

An Evaluation of Keyword and Semantic Similarity Search Approaches

Authors

  • Sean Palicki Technical University of Munich
  • Stefanie Walter
  • Wouter van Atteveldt
  • Alice Beazer
  • Isaac Bravo

DOI:

https://doi.org/10.5117/CCR2023.2.5.PALI

Keywords:

information retrieval, sampling, machine translation, semantic search, multilingual text analysis, computational social science

Abstract

Comparative research in communication often involves selecting and analyzing documents in multiple languages. Machine translation is an effective preprocessing step for automated content analysis, however its impact on data collection remains under-examined. Using a parallel language corpus of European Parliament debates, this paper evaluates machine translation as an approach for multilingual document retrieval, i.e., selecting documents for analysis. We compare several strategies for retrieving relevant multilingual documents, including 1) expert-validated search queries, 2) machine translated search queries, and 3) multilingual semantic similarity search, comparing them against monolingual searches, and describing how these strategies can impact results from topic modeling. Results show that expert-validated search queries achieve reliable results across languages, while the accuracy of machine translated search queries varies significantly between languages and impacts further analyses. Whereas semantic similarity search retrieved a similar subset of relevant documents across languages, results were less accurate than keyword approaches. In sum, validated translations of search queries can be effective for multilingual document retrieval, but errors can lead to systematic bias in further analysis results. These results are important for researchers seeking opportunities to introduce, validate and generalize findings and theories beyond English-speaking countries.

Downloads

Published

2023-09-28

Issue

Section

Research Articles (regular issue)

How to Cite

Palicki, S., Walter, S., van Atteveldt, W., Beazer, A., & Bravo, I. (2023). Selecting Relevant Documents for Multilingual Content Analysis: An Evaluation of Keyword and Semantic Similarity Search Approaches. Computational Communication Research, 5(2). https://doi.org/10.5117/CCR2023.2.5.PALI