A Community-Based Code-Switch Discussion Pipeline
Abstract
Social media (SM) facilitates discussions within communities across the globe, and to communicate effectively, multilinguals will often alternate languages in a phenomenon known as code-switching (CW). Discussions in which CW is exhibited can, upon analysis, reveal a community's diversity and provide insight into evolving trends and opinions. Widespread use of SM allows for tracking and characterizing these discussions for cultural and linguistic analysis. Advanced algorithms for community detection, based on network structures of followers and friends, interactions of retweets and mentions, and patterns of hashtag occurrence largely ignore linguistic cues in the body of posts. For this reason, the applicability of these state-of-the-art approaches to problems involving CW analysis has been limited, as the resulting communities are dependent on attribute types used in the detection rather than on attributes characterizing the significance of the CW; that is, the connections among posters, the topics under discussion, and the social context in which it occurs. Here we develop a new framework to facilitate understanding and CW processing of high volumes of SM information by 1) detecting community-based multilingual SM discussions, 2) defining evaluation metrics and heuristics to obtain CW discussions, 3) developing word-level language ID algorithms, 4) visualizing user-discussion graphs where component types are extracted based on defined rankings, and 5) representing discussions as trees with first-order nodes as posts, and nonterminal and leaf nodes as responses.
Document Details
- Document Type
- Technical Report
- Publication Date
- Mar 01, 2021
- Accession Number
- AD1126224
Entities
People
- Aaron Harwood
- Lucia Falzon
- Michelle Vanni
- Prarthana Padia
- Shanika Karunasekera
- Sue Kase
Organizations
- United States Army Research Laboratory