Incollection

Expediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter Trees

David Lillis; Frank Breitinger; Mark Scanlon

January 2018 Digital Forensics and Cyber Crime. ICDF2C 2017

Contribution Summary

This paper focuses on improving the MRSH-v2 approximate matching algorithm, a technique used in digital forensics to detect files with high bytewise similarity. The authors propose the use of Hierarchical Bloom Filter Trees (HBFT) to reduce the number of pairwise comparisons required between known-illegal files and files on a seized disk. The HBFT approach is designed to achieve speed benefits over the classic pairwise comparison approach while supporting the identification of specific matching files. The authors conduct a series of experiments to evaluate the effectiveness of the HBFT approach and investigate the factors that affect its performance. The results demonstrate substantial speed gains over the original MRSH-v2 while maintaining effectiveness. The paper contributes to the development of efficient and effective digital forensic techniques for the detection of files with high bytewise similarity.

Keywords: approximate matching; hierarchical bloom filter trees; mrsh-v2; digital forensics; cybersecurity; fuzzy hashing; similarity hashing

Abstract

Perhaps the most common task encountered by digital forensic investigators consists of searching through a seized device for pertinent data. Frequently, an investigator will be in possession of a collection of “known-illegal” files (e.g. a collection of child pornographic images) and will seek to find whether copies of these are stored on the seized drive. Traditional hash matching techniques can efficiently find files that precisely match. However, these will fail in the case of merged files, embedded files, partial files, or if a file has been changed in any way. In recent years, approximate matching algorithms have shown significant promise in the detection of files that have a high bytewise similarity. This paper focuses on MRSH-v2. A number of experiments were conducted using Hierarchical Bloom Filter Trees to dramatically reduce the quantity of pairwise comparisons that must be made between known-illegal files and files on the seized disk. The experiments demonstrate substantial speed gains over the original MRSH-v2, while maintaining effectiveness.

BibTeX

@incollection{lillis2018bloomfiltertrees,
  author = {Lillis, David and Breitinger, Frank and Scanlon, Mark},
  editor = {Matou{\v{s}}ek, Petr and Schmiedecker, Martin},
  title = {{Expediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter Trees}},
  booktitle = {Digital Forensics and Cyber Crime. ICDF2C 2017},
  year = 2018,
  publisher = {Springer},
  pages = {144--157},
  isbn = {978-3-319-73697-6},
  doi = {10.1007/978-3-319-73697-6_11},
  series = {Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering},
  volume = {216},
  abstract = {Perhaps the most common task encountered by digital forensic investigators consists of searching through a seized device for pertinent data. Frequently, an investigator will be in possession of a collection of ``known-illegal'' files (e.g. a collection of child pornographic images) and will seek to find whether copies of these are stored on the seized drive. Traditional hash matching techniques can efficiently find files that precisely match. However, these will fail in the case of merged files, embedded files, partial files, or if a file has been changed in any way. In recent years, approximate matching algorithms have shown significant promise in the detection of files that have a high bytewise similarity. This paper focuses on MRSH-v2. A number of experiments were conducted using Hierarchical Bloom Filter Trees to dramatically reduce the quantity of pairwise comparisons that must be made between known-illegal files and files on the seized disk. The experiments demonstrate substantial speed gains over the original MRSH-v2, while maintaining effectiveness.},
}