Dropped Pronoun Recovery in Chinese SMS

Abstract

In written Chinese, personal pronouns are commonly dropped when they can be inferred from context. This practice is particularly common in informal genres like Short Message Service (SMS) messages sent via cell phones. Restoring dropped personal pronouns can be a useful preprocessing step for information extraction. Dropped personal pronoun recovery can be divided into two subtasks: (1) detecting dropped personal pronoun slots and (2) determining the identity of the pronoun for each slot. We address a simpler version of restoring dropped personal pronouns wherein only the person numbers are identified. After applying a word segmenter, we used a linear-chain conditional random field (CRF) to predict which words were at the start of an independent clause. Then, using the independent clause start information, as well as lexical and syntactic information, we applied a CRF or a maximum-entropy classifier to predict whether a dropped personal pronoun immediately preceded each word and, if so, the person number of the dropped pronoun. We conducted a series of experiments using a manually annotated corpus of Chinese SMS messages. Our machine-learning based approaches substantially outperformed a rule-based approach based partially on rules developed by Chung and Gildea in 2010. Features derived from parsing did not help our approaches. We conclude that the parse information is largely superfluous for identifying dropped personal pronouns if reasonably accurate independent clause start information is available.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2015
Accession Number
AD1107806

Entities

People

  • Chris Giannella
  • Ransom Winder
  • Stacy Petersen

Organizations

  • Georgetown University

Tags

DTIC Thesaurus Topics

  • Artificial Intelligence
  • Artificial Intelligence Software
  • Automata Theory
  • Cognitive Science
  • Computational Linguistics
  • Computational Science
  • Computer Languages
  • Computer Science
  • Data Sets
  • Language
  • Linguistics
  • Machine Learning
  • Mobile Phones
  • Natural Language Processing
  • Natural Languages
  • Supervised Machine Learning
  • Text Messaging

Readers

  • Agent-Based Social Robotics and Mobile-Assisted Learning in Virtual Environments.
  • Computational Linguistics

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Machine Translation