Identifying Semantic Components from Cross-Language Variation, Structured Lexical Resources, and Corpora

Abstract

This project addresses the automatic identification of semantic components: sub-lexical elements of linguistic meaning that may be composed in different ways to capture the meanings of words. The project is specifically intended to address aspects of lexical meaning that are not easily captured in corpus-derived distributed semantic representations, but that are an important part of the underlying cognitive structure on which language and language understanding rely. To that end, this project proposes to distill such underlying structure out of three kinds of existing resources, brought into register with each other. The first is a set of cross-language datasets documenting variation in semantic categories across languages; these permit the identification of cross-linguistically recurring semantic components that may form a universal or near-universal repertoire of semantic building blocks, combining differently in different languages. The second is richly detailed lexical resources such as FrameNet and WordNet, which explicitly capture semantic and conceptual relations among words, including underlying conceptual gestalts or "bundles" of meaning that are central to language understanding but are rarely themselves directly expressed in language. The third is corpus-derived word co-occurrence statistics. This project will identify semantic components from the juxtaposition of such resources using methods from machine learning. It will also assess those semantic representations against human word similarity judgments, for comparison with the performance of other approaches to semantic representation.

Document Details

Document Type: DoD Grant Award
Publication Date: Jul 10, 2017
Source ID: HDTRA11710042

Entities

People

Collin Baker

Organizations

Defense Threat Reduction Agency
University of California, Berkeley

Identifying Semantic Components from Cross-Language Variation, Structured Lexical Resources, and Corpora

Abstract

Document Details

Entities

People

Organizations

Tags

Fields of Study

Readers

Technology Areas