Identifying Semantic Components from Cross-Language Variation, Structured Lexical Resources, and Corpora
Abstract
This project addresses the automatic identification of semantic components: sub-lexical elements of linguistic meaning that may be composed in different ways to capture the meanings of words. The project is specifically intended to address aspects of lexical meaning that are not easily captured in corpus-derived distributed semantic representations, but that are an important part of the underlying cognitive structure on which language and language understanding rely. To that end, this project proposes to distill such underlying structure out of three kinds of existing resources, brought into register with each other. The first is a set of cross-language datasets documenting variation in semantic categories across languages; these permit the identification of cross-linguistically recurring semantic components that may form a universal or near-universal repertoire of semantic building blocks, combining differently in different languages. The second is richly detailed lexical resources such as FrameNet and WordNet, which explicitly capture semantic and conceptual relations among words, including underlying conceptual gestalts or "bundles" of meaning that are central to language understanding but are rarely themselves directly expressed in language. The third is corpus-derived word co-occurrence statistics. This project will identify semantic components from the juxtaposition of such resources using methods from machine learning. It will also assess those semantic representations against human word similarity judgments, for comparison with the performance of other approaches to semantic representation.
Document Details
- Document Type
- DoD Grant Award
- Publication Date
- Jul 10, 2017
- Source ID
- HDTRA11710042
Entities
People
- Collin Baker
Organizations
- Defense Threat Reduction Agency
- University of California, Berkeley