Feasibility of Leveraging Crowd Sourcing for the Creation of a Large Scale Annotated Resource for Hindi English Code Switched Data: A Pilot Annotation

Abstract

Linguistic code switching (LCS) occurs when speakers mix multiple languages in the same speech utterance. We find LCS pervasively in bilingual communities. LCS poses a serious challenge to Natural Language and Speech Processing. With the ubiquity of informal genres online, LCS is emerging as a very widespread phenomenon. This paper presents a first attempt at collecting and annotating a large repository of LCS data. We target Hindi English (Hinglish) LCS. We investigate the feasibility of leveraging crowd sourcing as a means for annotating the data on the word level. This paper briefly explains the setup of the experiment and data collection. It also presents statistics representing agreements among annotators over different possible categories of Hinglish words and analyzes the confidence with which a code switched word can be annotated in the correct category by humans.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Nov 01, 2011
Accession Number: ADA562521

Entities

People

Ankit Kamboj
Mona Diab

Organizations

Columbia University

Feasibility of Leveraging Crowd Sourcing for the Creation of a Large Scale Annotated Resource for Hindi English Code Switched Data: A Pilot Annotation

Abstract

Document Details

Entities

People

Organizations

Tags

DTIC Thesaurus Topics

Readers