Feasibility of Leveraging Crowd Sourcing for the Creation of a Large Scale Annotated Resource for Hindi English Code Switched Data: A Pilot Annotation

Abstract

Linguistic code switching (LCS) occurs when speakers mix multiple languages in the same speech utterance. We find LCS pervasively in bilingual communities. LCS poses a serious challenge to Natural Language and Speech Processing. With the ubiquity of informal genres online, LCS is emerging as a very widespread phenomenon. This paper presents a first attempt at collecting and annotating a large repository of LCS data. We target Hindi English (Hinglish) LCS. We investigate the feasibility of leveraging crowd sourcing as a means for annotating the data on the word level. This paper briefly explains the setup of the experiment and data collection. It also presents statistics representing agreements among annotators over different possible categories of Hinglish words and analyzes the confidence with which a code switched word can be annotated in the correct category by humans.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Nov 01, 2011
Accession Number
ADA562521

Entities

People

  • Ankit Kamboj
  • Mona Diab

Organizations

  • Columbia University

Tags

DTIC Thesaurus Topics

  • Acquisition
  • Agreements
  • Automated Speech Recognition
  • Computer Science
  • Computers
  • Computing-Related Activities
  • Data Sets
  • Education
  • Human Intelligence
  • Information Operations
  • Language
  • Linguistics
  • Mathematics
  • Neurobehavioral Manifestations
  • New York
  • Statistics
  • Switches

Readers

  • Computational Linguistics
  • Computer Networking
  • Naval Mine Countermeasure Systems Development.