Filetype Identification Using Long, Summarized N-Grams
Abstract
Past research into file type identification has employed many different techniques in an attempt to accurately classify files and file fragments including N-gram analysis. However, na ve application of n-grams breaks down when handling n-grams that are greater than two bytes, due to the sparseness of the feature. As a result, other researchers have generally ignored long n-grams for filetype identification. This thesis explores the use of long n-grams for whole file and file fragment classification by building feature distributions of commonly occurring n-grams for single filetypes and using those distributions to classify unknown files and file fragments. This thesis also utilizes summarized n-grams in order to "collapse" similar n-grams within a file type into common n-grams. The algorithms developed to both generate and compare unknown files are presented as well as results from an experiment that was conducted using another researcher's data set.
Document Details
- Document Type
- Technical Report
- Publication Date
- Mar 01, 2011
- Accession Number
- ADA543322
Entities
People
- Ryan C. Mayer
Organizations
- Naval Postgraduate School