Hierarchical Bloom Filter Trees for Approximate Matching

David Lillis; Frank Breitinger; Mark Scanlon

doi:10.15394/jdfsl.2018.1489

Article

Hierarchical Bloom Filter Trees for Approximate Matching

David Lillis; Frank Breitinger; Mark Scanlon

March 2018 Journal of Digital Forensics, Security and Law

PDF BibTeX DOI Open Access

Contribution Summary

This paper presents an improvement in the runtime efficiency of approximate matching techniques in digital forensics by implementing Hierarchical Bloom Filter Trees (HBFTs). The authors focus on the MRSH-v2 algorithm, which generates a similarity digest for each file using Bloom filters. However, the all-against-all pairwise comparison required to determine if files from a set of desired content are present in a corpus of unanalyzed digital material is not scalable. The proposed HBFT approach reduces the number of pairwise comparisons required, achieving substantial speed gains while maintaining effectiveness. The authors conduct three experiments to evaluate the effectiveness of HBFTs and explore the effects of different configurations of HBFTs. The results show that HBFTs can dramatically reduce the number of pairwise comparisons required, achieving a 100% recall rate for identical files and files with a MRSH-v2 similarity above a reasonable threshold of 40%. The authors also investigate the effects of different configurations of HBFTs, including the use of variable-width and fixed-width Bloom filters.

Keywords: approximate matching; hierarchical bloom filter trees; mrsh-v2; digital forensics; runtime efficiency; scalability; bloom filters; similarity digests

Abstract

Bytewise approximate matching algorithms have in recent years shown significant promise in detecting files that are similar at the byte level. This is very useful for digital forensic investigators, who are regularly faced with the problem of searching through a seized device for pertinent data. A common scenario is where an investigator is in possession of a collection of “known-illegal” files (e.g. a collection of child abuse material) and wishes to find whether copies of these are stored on the seized device. Approximate matching addresses shortcomings in traditional hashing, which can only find identical files, by also being able to deal with cases of merged files, embedded files, partial files, or if a file has been changed in any way. Most approximate matching algorithms work by comparing pairs of files, which is not a scalable approach when faced with large corpora. This paper demonstrates the effectiveness of using a “Hierarchical Bloom Filter Tree” (HBFT) data structure to reduce the running time of collection-against-collection matching, with a specific focus on the MRSH-v2 algorithm. Three experiments are discussed, which explore the effects of different configurations of HBFTs. The proposed approach dramatically reduces the number of pairwise comparisons required, and demonstrates substantial speed gains, while maintaining effectiveness.

BibTeX

@article{lillis2018hierarchicaltrees,
	author={Lillis, David and Breitinger, Frank and Scanlon, Mark},
	title="{Hierarchical Bloom Filter Trees for Approximate Matching}",
	journal="{Journal of Digital Forensics, Security and Law}",
	year=2018,
	month=03,
	volume="13",
	number="1",
	pages="81-96",
	publisher="{ADFSL}",
	doi="https://doi.org/10.15394/jdfsl.2018.1489",
	abstract="Bytewise approximate matching algorithms have in recent years shown significant promise in detecting files that are similar at the byte level. This is very useful for digital forensic investigators, who are regularly faced with the problem of searching through a seized device for pertinent data. A common scenario is where an investigator is in possession of a collection of “known-illegal” files (e.g. a collection of child abuse material) and wishes to find whether copies of these are stored on the seized device. Approximate matching addresses shortcomings in traditional hashing, which can only find identical files, by also being able to deal with cases of merged files, embedded files, partial files, or if a file has been changed in any way. Most approximate matching algorithms work by comparing pairs of files, which is not a scalable approach when faced with large corpora. This paper demonstrates the effectiveness of using a “Hierarchical Bloom Filter Tree” (HBFT) data structure to reduce the running time of collection-against-collection matching, with a specific focus on the MRSH-v2 algorithm. Three experiments are discussed, which explore the effects of different configurations of HBFTs. The proposed approach dramatically reduces the number of pairwise comparisons required, and demonstrates substantial speed gains, while maintaining effectiveness."
}