20th AIAI 2024, 27 - 30 June 2024, Corfu, Greece

From Tweets to Reddit: Leveraging Semi-Supervised Domain Adaptation for Improving Data Filtering

Shelly Gupta, Jumanah Alshehri, Ameen Abdel Hai, Hussain Otudi, Zoran Obradovic

Abstract:

  Reddit has emerged as a leading platform for microblogging data collection, providing valuable insights into patterns and knowledge discovery. However, the process of gathering and preparing this data presents significant challenges, particularly when it comes to ensuring its accuracy. Existing methodologies often yield an abundance of irrelevant posts, and datasets for relevance prediction on Reddit are rare. To overcome these obstacles, we propose a new semi-supervised framework that filters Reddit posts based on their topic relevance. Our approach combines annotated data from Twitter with weak labels generated from Wikipedia pages associated with relevant subreddits to automatically label Reddit posts. To enhance the model's generalization performance, we utilize a domain adversarial adaptation network to bridge the distribution gap between Twitter and Reddit data. Our novel framework achieves an accuracy of 73% and an F1 score of 0.77, which is a significant improvement of 20% compared to baseline models. Additionally, we address important research questions regarding the effectiveness of automatic labeling, the use of weakly labeled data, the contextual requirements for training domain adaptation models, and the optimal weak labeling method.  

*** Title, author list and abstract as seen in the Camera-Ready version of the paper that was provided to Conference Committee. Small changes that may have occurred during processing by Springer may not appear in this window.