Machine Learning for Malware Detection (Seminar 1.1)
During this seminar, student Eric Muller presented a summary of Machine Learning for Malware Detection. After Eric’s presentation our class had an open discussion related to the paper and more. This blog post will cover a summary of the information discussed by Eric as well as a summary of our class discussion.
Presentation Summary
As this was the first seminar, it covers a high level overview of the machine learning space in relation to malware detection. Below are the main ideas presented by Eric.
- Malware detection collects data at different phases
- Pre-execution phase data consists of file information prior to execution (file format descriptions, code descriptions, binary data statistics, etc.)
- Post-execution data communicates process activity within a system
- The advancement of malware raises the need of Machine Learning which can be broken down into unsupervised and supervised learning
- Unsupervised learning consists of unlabeled datasets (an example algorithm is clustering) whereas supervised learning consists of labeled data with a training phase and application phase
- Regarding ML in cybersecurity, you generally want your datasets to be representative (mirrors real life data) and interpretable (helps you understand the decisions made)
- Minimizing the false positive rate is the main metric objective
- It is important to create adaptable models that update as data distributions change
- Malware writers rapidly advanced from their early days with new techniques like server-side polymorphism and file editing to change fingerprints
- Kapersky addressed that by using locality-sensitive hashes (LSH)
- Regular hashes have no connection to the similarity of files whereas LSHs do (very similar files correspond to the same binary score)
- LSH is naturally unsupervised, so Kapersky implemented Similarity Hashing to make the process supervised by using file features (labeled data)
- They made a two step process that combines a similarity hashing method with other algorithms and light features (simple classification)
- The second part is hard classification that utilizes a decision tree for a harder region definition
- This two step approach limits false positives, is lightweight on the user’s system, is interpretable, and is easier for retraining
- Kapersky employs a deep learning model to handle more rare or targeted attacks
- Previous approaches are static analysis (pre-execution) which is safe for the user but has difficulty in handling advanced encryption
- Dynamic analysis (post-execution detection) is based on the behavioral log data of objects that are represented as bipartite graphs
- Kapersky utilizes behavior patterns to create binary vectors as training sets for deep neural networks
- As malware is constantly changing, Kapersky handles new data by clustering objects and labeling them in real time
- Kapersky follows Distillation to perform malicioussness tests offline and update their labeled data, saving computation time and errors on user devices
Discussion Summary
Our class had a discussion following the presentation. Below are points that were not covered in the presentation or were further discussed.
- AV companies run models statically on user devices and dynamically on the cloud
- Companies distribute more lightweight models to users as the models on the cloud use more parameters and weights
- As mentioned in the presentation, the objective should be to reduce false positives as it enhances user experience
- Variances are used as we want models to expand from only detecting malware in one kind of binary file
- Antivirus companies utilize user data along with open source data to retrain models when data distributions change
That is all for the summary of seminar 1.1, thanks for reading!