Automated identification of malicious code variants

Chris Ries, Colby College

Document Type Honors Thesis (Open Access)


Malicious code is one of the most dynamic threats to computers and computer networks. Authors are constantly modifying their malicious code to fix bugs, add new features, and evade detection. Some families have over fifty variants in the wild. When these new variants are discovered, correctly identifying the maliclous code's family can be a very time consuming and manual process for security researchers. This project's goal was to create a system to automate the family identification process. The system that was built for this project uses run-time analysis to analyze the API calls that a malicious Win32 binary makes. These calls are then compared to data collected from other malicious code. If the new malicious code is found to be similar enough to any other malicious code, the two are considered to be variants of the same family. The system performed very well during testing. It was able to identify the correct family of about 82% of the malicious programs in the dataset. The system was also able to provide explanations in cases when different antivirus scanners did not agree on the family of a piece of malicious code.