Filetype Identification Using Long, Summarized N-Grams

Abstract

Past research into file type identification has employed many different techniques in an attempt to accurately classify files and file fragments including N-gram analysis. However, na ve application of n-grams breaks down when handling n-grams that are greater than two bytes, due to the sparseness of the feature. As a result, other researchers have generally ignored long n-grams for filetype identification. This thesis explores the use of long n-grams for whole file and file fragment classification by building feature distributions of commonly occurring n-grams for single filetypes and using those distributions to classify unknown files and file fragments. This thesis also utilizes summarized n-grams in order to "collapse" similar n-grams within a file type into common n-grams. The algorithms developed to both generate and compare unknown files are presented as well as results from an experiment that was conducted using another researcher's data set.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Mar 01, 2011
Accession Number: ADA543322

Entities

People

Ryan C. Mayer

Organizations

Naval Postgraduate School

Filetype Identification Using Long, Summarized N-Grams

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Fields of Study

Readers