Improving Automated Lexical and Discourse Analysis of Online Chat Dialog

Abstract

One of the goals of natural language processing (NLP) systems is determining the meaning of what is being transmitted. Although much work has been accomplished in traditional written and spoken language domains, little has been performed in the newer computer-mediated communication domain enabled by the Internet, to include text-based chat. This is due in part to the fact that there are no annotated chat corpora available to the broader research community. The purpose of our research is to build a chat corpus, initially tagged with lexical and discourse information. Such a corpus could be used to develop stochastic NLP applications that perform tasks such as conversation thread topic detection, author profiling, entity identification, and social network analysis. During the course of our research, we preserved 477,835 chat posts and associated user profiles in an XML format for future investigation. We privacy-masked 10,567 of those posts and part-of-speech tagged a total of 45,068 tokens. Using the Penn Treebank and annotated chat data, we achieved part-of-speech tagging accuracy of 90.8%. We also annotated each of the privacy-masked corpus's 10,567 posts with a chat dialog act. Using a neural network with 23 input features, we achieved 83.2% dialog act classification accuracy.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Sep 01, 2007
Accession Number
ADA473971

Entities

People

  • Eric N. Forsyth

Organizations

  • Naval Postgraduate School

Tags

Communities of Interest

  • Autonomy
  • C4I
  • Energy and Power Technologies
  • Human Systems

DTIC Thesaurus Topics

  • Artificial Intelligence Software
  • Automated Speech Recognition
  • Computational Linguistics
  • Computational Science
  • Computer Languages
  • Computer Science
  • Data Mining
  • Electronic Mail
  • Grammars
  • Intellectual Property
  • Language
  • Linguistics
  • Machine Learning
  • Markov Models
  • Natural Language Processing
  • Network Science
  • Neural Networks

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Geospatial Intelligence and Artificial Intelligence Analytics
  • Neural Network Machine Learning.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Machine Translation