Mining Developer Communications to Create a Web-Scale Repository of Documented and Analyzable Snippets

Abstract

Typical software developer communications, such as chat conversations or forum posts, contain both code snippets and natural language text that describes those code snippets. Thus, developer communications on the web are an important resource for large-scale mining of information about the functionality, quality, and other properties of code snippets. Consistent with the goals of the DARPA Mining and Understanding of Software Enclaves (MUSE) program, the goal of this research project was to enable targeted access to the software development knowledge captured in the code snippets, as well as the natural language text describing those code snippets, embedded within developer communications. Key research accomplishments that stem from this project include an understanding of what information about code snippets is available in different kinds of developer communications, an in-depth analysis of two kinds of developer communication to compare their efficacy in supporting software engineering tools, a new technique to extract the available information from a specific kind of developer communication, and a technique to enable web-scale code clone detection and search, and ultimately, curation of documented, analyzable code snippets extracted from developer communications.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Aug 01, 2018
Accession Number
AD1059446

Entities

People

  • David Shepherd

Tags

Communities of Interest

  • Autonomy

DTIC Thesaurus Topics

  • Air Force
  • Air Force Research Laboratories
  • Computer Programming
  • Computer Programs
  • Computer Science
  • Data Sets
  • Detection
  • Engineering
  • Java Programming Language
  • Language
  • Markup Languages
  • Natural Language Processing
  • Natural Languages
  • Operating Systems
  • Programming Languages
  • Software Development
  • Xml

Fields of Study

  • Computer science
  • Engineering

Readers

  • Computational Linguistics
  • Distributed Systems and Data Platform Development
  • Software Engineering.