During this seminar, student Eric Muller presented a summary of Machine Learning for Malware Detection. After Eric’s presentation our class had an open discussion related to the paper and more. This blog post will cover a summary of the information discussed by Eric as well as a summary of our class discussion.

Presentation Summary

As this was the first seminar, it covers a high level overview of the machine learning space in relation to malware detection. Below are the main ideas presented by Eric.

  • Malware detection collects data at different phases
  • Pre-execution phase data consists of file information prior to execution (file format descriptions, code descriptions, binary data statistics, etc.)
  • Post-execution data communicates process activity within a system
  • The advancement of malware raises the need of Machine Learning which can be broken down into unsupervised and supervised learning
  • Unsupervised learning consists of unlabeled datasets (an example algorithm is clustering) whereas supervised learning consists of labeled data with a training phase and application phase
  • Regarding ML in cybersecurity, you generally want your datasets to be representative (mirrors real life data) and interpretable (helps you understand the decisions made)
  • Minimizing the false positive rate is the main metric objective
  • It is important to create adaptable models that update as data distributions change
  • Malware writers rapidly advanced from their early days with new techniques like server-side polymorphism and file editing to change fingerprints
  • Kapersky addressed that by using locality-sensitive hashes (LSH)
  • Regular hashes have no connection to the similarity of files whereas LSHs do (very similar files correspond to the same binary score)
  • LSH is naturally unsupervised, so Kapersky implemented Similarity Hashing to make the process supervised by using file features (labeled data)
  • They made a two step process that combines a similarity hashing method with other algorithms and light features (simple classification)
  • The second part is hard classification that utilizes a decision tree for a harder region definition
  • This two step approach limits false positives, is lightweight on the user’s system, is interpretable, and is easier for retraining
  • Kapersky employs a deep learning model to handle more rare or targeted attacks
  • Previous approaches are static analysis (pre-execution) which is safe for the user but has difficulty in handling advanced encryption
  • Dynamic analysis (post-execution detection) is based on the behavioral log data of objects that are represented as bipartite graphs
  • Kapersky utilizes behavior patterns to create binary vectors as training sets for deep neural networks
  • As malware is constantly changing, Kapersky handles new data by clustering objects and labeling them in real time
  • Kapersky follows Distillation to perform malicioussness tests offline and update their labeled data, saving computation time and errors on user devices

Discussion Summary

Our class had a discussion following the presentation. Below are points that were not covered in the presentation or were further discussed.

  • AV companies run models statically on user devices and dynamically on the cloud
  • Companies distribute more lightweight models to users as the models on the cloud use more parameters and weights
  • As mentioned in the presentation, the objective should be to reduce false positives as it enhances user experience
  • Variances are used as we want models to expand from only detecting malware in one kind of binary file
  • Antivirus companies utilize user data along with open source data to retrain models when data distributions change

That is all for the summary of seminar 1.1, thanks for reading!


<
Blog Archive
Archive of all previous blog posts
>
Next Post
Malware Detection on Highly Imbalanced Data through Sequence Modeling (Seminar 1.2)