Identification of Obfuscated Function Clones in Binaries using Machine Learning


With widespread use of higher-level languages for malware, such as C# on Windows or Go for cross-platform malware, the complexity and functionality of malware is ever-increasing. Additionally, obfuscation is used to hide the malicious intent from virus scanners and increase the time it takes for a human analyst to reverse engineer the binary file. One way to minimize this effort is function clone detection. Like any other software engineering project, malware reuses code and modifies already existing code. Detecting whether a binary function is already known, or similar to an existing function, can reduce the time needed to analyze it. Outside of malware, the same function clone detection mechanism can be used to find vulnerable versions of functions in binaries, making it a powerful technique. This thesis introduces an approach for the detection of obfuscated function clones, called Ofci, building on recent advances in machine learning based function clone detection. Using the Albert transformer, a size-optimized version of the Bert natural language processing model, on textual disassembly instead of language, Ofci can achieve an 83% model size reduction in comparison to state-of-the-art approaches, while only causing an average 7% decrease in the ROC-AUC scores of function pair similarity classification. To additionally tackle the issue of obfuscation, Ofci analyzes the effect of known function calls on function similarity and applies function similarity classification on code obfuscated through virtualization. Instead of trying to match virtualized function pairs statically, Ofci tries to perform function clone detection based on traces of the virtualized function, as a cheap form of dynamic analysis. However, the reduced model size comes at the cost of precision for function clone search and the evaluation of Ofci discusses the reasons for this and other pitfalls of building function similarity detection tooling. Besides evaluating the machine learning approach, Ofci also establishes a new framework for the extraction and processing of binary functions. By implementing this functionality as a Ghidra plugin, Ofci offers an end-to-end approach for binary function analysis where every part of the pipeline is open-source. Through headless analysis, this framework can scale to analyzing large quantities of binary executables in parallel.

TU Wien
Michael Pucher
Michael Pucher

My research interests include systems security, malware analysis and embedded security, focusing on software reverse engineering and dynamic binary analysis.