Article

A Comparative Study of Support Vector Machine and Neural Networks for File Type Identification Using n-gram Analysis

Joachim Sester; Darren Hayes; Nhien-An Le-Khac; Mark Scanlon

March 2021 Forensic Science International: Digital Investigation

Contribution Summary

This paper presents a comparative study of Support Vector Machines (SVMs) and Neural Networks (NNs) for file type identification using n-gram analysis. The authors investigate the influence of input parameters, such as learning rate and n-gram values, on the results and compare the scalability of SVMs and NNs. The study uses two NNs and four SVMs, with linear and RBF kernels, on the RealDC dataset. The results show that SVM-based approaches perform better than NNs, but their scalability is still a challenge. The study contributes to the field of file type identification, a crucial aspect of cybersecurity and digital forensics, by providing a comprehensive comparison of machine learning-based approaches.

Keywords: File type identification; n-gram analysis; Support Vector Machines; Neural Networks; Cybersecurity; Digital forensics; Machine learning; RealDC dataset

Abstract

File type identification (FTI) has become a major discipline for anti-virus developers, firewall designers and for forensic cybercrime investigators. Over the past few years, research has seen the introduction of several classifiers and features. One of these advances is the so-called n-grams analysis, which is an interpretation of statistical counting in fragments classified. Recently, n-grams based approaches were already successfully combined with computational intelligence classifiers. However, the academic body of literature is scant when it comes to a comprehensive explanation of machine learning based approaches such as neural networks (NN) or support vector machines (SVM). For example, how the input parameters, including learning rate, different values of n for n-grams, etc. influence the results. In addition, very few studies have compared the scalability of NN vs. SVM approaches. Therefore, a systematic research in comparing different approaches is needed to address these questions. Hence, this paper investigates this type of comparison, by focusing on the n-gram analysis as a feature for the two different classifiers: SVMs and NNs. This paper details our experiments with two NNs and four SVMs, using linear kernels and RBF kernels on RealDC datasets. In general, we found that SVM-based approaches performed better than the NN, but their scalability is still a challenge.

BibTeX

@article{sester2021FileIdentification,
	author={Sester, Joachim and Hayes, Darren and Le-Khac, Nhien-An and Scanlon, Mark},
	title="{A Comparative Study of Support Vector Machine and Neural Networks for File Type Identification Using n-gram Analysis}",
	journal="{Forensic Science International: Digital Investigation}",
	year=2021,
	month=03,
	publisher={Elsevier},
	abstract={File type identification (FTI) has become a major discipline for anti-virus developers, firewall designers and for forensic cybercrime investigators. Over the past few years, research has seen the introduction of several classifiers and features. One of these advances is the so-called n-grams analysis, which is an interpretation of statistical counting in fragments classified. Recently, n-grams based approaches were already successfully combined with computational intelligence classifiers. However, the academic body of literature is scant when it comes to a comprehensive explanation of machine learning based approaches such as neural networks (NN) or support vector machines (SVM). For example, how the input parameters, including learning rate, different values of n for n-grams, etc. influence the results. In addition, very few studies have compared the scalability of NN vs. SVM approaches. Therefore, a systematic research in comparing different approaches is needed to address these questions. Hence, this paper investigates this type of comparison, by focusing on the n-gram analysis as a feature for the two different classifiers: SVMs and NNs. This paper details our experiments with two NNs and four SVMs, using linear kernels and RBF kernels on RealDC datasets. In general, we found that SVM-based approaches performed better than the NN, but their scalability is still a challenge.},
  doi={10.1016/j.fsidi.2021.301121},
}