Using LSA to Compute Word Sense Frequencies
Abstract
This document describes a project to explore the use of Latent Semantic Analysis (LSA) and statistical clustering techniques for automatically identifying word senses and for estimating word sense frequencies from application relevant corpora. The hypothesis is that LSA can be used to compute context vectors for ambiguous words that can be clustered together - with each cluster corresponding to a different sense of the word. The document is organized as follows: the first section includes a short introduction to LSA, an introduction to the context-group discrimination paradigm adopted in the project, and a description of the corpus used in the experiments. Section 2 describes the investigation of the effect of LSA dimensionality on sense discrimination accuracy. Overall, sense discrimination accuracy was relatively low. This motivated a digression into investigation of the influence of different distance measures; investigation of the geometry of the sense clusters in the LSA-based space through silhouette value analysis; investigation of sense discrimination accuracy as a function of the degree of supervision provided during model training; and investigation and comparison of sense discrimination in homonyms versus polysemes. Section three describes the investigation of optimal context size for word sense discrimination from 3 (1 word on each side of word) to 11 words (5 words on each side). Section 4 describes the use of Minimal Description Length (MDL) to determine the number of word senses. Section 5 provides a project summary. Appendix A provides a literature review and Appendix B provides a source code listing (not included in this published report).
Document Details
- Document Type
- Technical Report
- Publication Date
- Feb 01, 2008
- Accession Number
- ADA481969
Entities
People
- Esther Levin
- Mehrbod Sharifi
Organizations
- City University of New York