Filetype Identification Using Long, Summarized N-Grams

Abstract

Past research into file type identification has employed many different techniques in an attempt to accurately classify files and file fragments including N-gram analysis. However, na ve application of n-grams breaks down when handling n-grams that are greater than two bytes, due to the sparseness of the feature. As a result, other researchers have generally ignored long n-grams for filetype identification. This thesis explores the use of long n-grams for whole file and file fragment classification by building feature distributions of commonly occurring n-grams for single filetypes and using those distributions to classify unknown files and file fragments. This thesis also utilizes summarized n-grams in order to "collapse" similar n-grams within a file type into common n-grams. The algorithms developed to both generate and compare unknown files are presented as well as results from an experiment that was conducted using another researcher's data set.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Mar 01, 2011
Accession Number
ADA543322

Entities

People

  • Ryan C. Mayer

Organizations

  • Naval Postgraduate School

Tags

Communities of Interest

  • Autonomy
  • C4I

DTIC Thesaurus Topics

  • Artificial Intelligence Software
  • C Programming Language
  • Computer Languages
  • Computer Programming
  • Computer Programs
  • Computer Science
  • Computers
  • Data Sets
  • Governments
  • Information Science
  • Machine Learning
  • Named Entity Recognition
  • Programming Languages
  • Statistical Analysis
  • Supervised Machine Learning
  • Two Dimensional
  • Word Processors

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Computer Science.
  • Theoretical Analysis.