Innovation Engine for Blog Spaces

Abstract

The goal of this project was to show, as a kind of Turing Test, how well the machine "understands" ongoing group discussions, the interest of the group, and how well it can participate. There are a number of related problems that we had to solve, including the following: (1) Ethical problem: For proper evaluation, we should not uncover that the participant is a machine. We decided to resolve this by restricting the machine to asking questions about news that could be interesting to a community. For example, by using analogies, data-mining can discover that an earthquake has long lasting effects if the road system is poor and can ask about the quality of road system, how good it is and how much it was distorted, and can bring related news for human expert evaluations. (2) Community problem: In order to show the capabilities of present day machine learning techniques and natural language processing methods, we needed a relatively narrow topic domain and a well defined community. (3) Statistical problem: We needed a large a quickly developing database. (4) Problem of contributing: Our original idea, that we would contribute in blog spaces, was inappropriate for contributing: blog space is not for active discussions. Twitter was suggested, but it has the same problem. These are all passive options, where either somebody's blog is to be commented, or one's own blog is to be created that can gain visibility and reactions. The ideal case that we finally discovered is to contribute and serve to forums. It is, however, harder, since forum texts are highly imprecise, are very short, use slang, topic related TLAs, and unimportant, topic irrelevant text fragments. We found that the number of scientific blogs is small. We decided to move from scientific blogs to blogs on movies, although they are harder, but they solve the community problem. We had to scale up our original crawler architecture to this huge database, collected and analyzed blogs.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Sep 01, 2011
Accession Number
ADA550367

Entities

People

  • Andras Lorincz

Tags

Communities of Interest

  • Autonomy
  • C4I
  • Energy and Power Technologies
  • Materials and Manufacturing Processes
  • Weapons Technologies

DTIC Thesaurus Topics

  • Air Force
  • Computational Science
  • Computer Languages
  • Computer Programming
  • Computers
  • Data Mining
  • Databases
  • Dimensionality Reduction
  • Information Science
  • Information Systems
  • Language
  • Linguistics
  • Machine Learning
  • Natural Language Processing
  • Network Science
  • Ontologies
  • Supervised Machine Learning

Fields of Study

  • Computer science

Readers

  • Military History of the United States in the 20th Century.
  • Neural Network Machine Learning.
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Machine Learning Algorithms
  • Space