Machine Learning for Malware Detection (Seminar 1.1)

During this seminar, student Eric Muller presented a summary of Machine Learning for Malware Detection. After Eric’s presentation our class had an open discussion related to the paper and more. This blog post will cover a summary of the information discussed by Eric as well as a summary of our class discussion.

Presentation Summary

As this was the first seminar, it covers a high level overview of the machine learning space in relation to malware detection. Below are the main ideas presented by Eric.

Malware detection collects data at different phases
Pre-execution phase data consists of file information prior to execution (file format descriptions, code descriptions, binary data statistics, etc.)
Post-execution data communicates process activity within a system
The advancement of malware raises the need of Machine Learning which can be broken down into unsupervised and supervised learning
Unsupervised learning consists of unlabeled datasets (an example algorithm is clustering) whereas supervised learning consists of labeled data with a training phase and application phase
Regarding ML in cybersecurity, you generally want your datasets to be representative (mirrors real life data) and interpretable (helps you understand the decisions made)
Minimizing the false positive rate is the main metric objective
It is important to create adaptable models that update as data distributions change
Malware writers rapidly advanced from their early days with new techniques like server-side polymorphism and file editing to change fingerprints
Kapersky addressed that by using locality-sensitive hashes (LSH)
Regular hashes have no connection to the similarity of files whereas LSHs do (very similar files correspond to the same binary score)
LSH is naturally unsupervised, so Kapersky implemented Similarity Hashing to make the process supervised by using file features (labeled data)
They made a two step process that combines a similarity hashing method with other algorithms and light features (simple classification)
The second part is hard classification that utilizes a decision tree for a harder region definition
This two step approach limits false positives, is lightweight on the user’s system, is interpretable, and is easier for retraining
Kapersky employs a deep learning model to handle more rare or targeted attacks
Previous approaches are static analysis (pre-execution) which is safe for the user but has difficulty in handling advanced encryption
Dynamic analysis (post-execution detection) is based on the behavioral log data of objects that are represented as bipartite graphs
Kapersky utilizes behavior patterns to create binary vectors as training sets for deep neural networks
As malware is constantly changing, Kapersky handles new data by clustering objects and labeling them in real time
Kapersky follows Distillation to perform malicioussness tests offline and update their labeled data, saving computation time and errors on user devices

Discussion Summary

Our class had a discussion following the presentation. Below are points that were not covered in the presentation or were further discussed.

AV companies run models statically on user devices and dynamically on the cloud
Companies distribute more lightweight models to users as the models on the cloud use more parameters and weights
As mentioned in the presentation, the objective should be to reduce false positives as it enhances user experience
Variances are used as we want models to expand from only detecting malware in one kind of binary file
Antivirus companies utilize user data along with open source data to retrain models when data distributions change

That is all for the summary of seminar 1.1, thanks for reading!

Blog Archive

Archive of all previous blog posts

Malware Detection on Highly Imbalanced Data through Sequence Modeling (Seminar 1.2)