Malware Detection on Highly Imbalanced Data through Sequence Modeling (Seminar 1.2)

NOTE: This blog is for a special topics course at Texas A&M (ML for Cyber Defenses). During each lecture a student presents information from the assigned paper. This blog summarizes and further discusses each topic.

During this seminar, students Akshat Punjabi and Akshat Pandey presented information from Malware Detection on Highly Imbalanced Data through Sequence Modeling. After their presentation our class had an open discussion related to the paper and more. This blog post will cover a summary of the information discussed by both Akshats as well as a summary of our class discussion.

Presentation Summary

This seminar covers dynamic analysis on sequences of Android OS malware detection activity. Below are the main ideas presented:

Static vs Dynamic Analysis

Static analysis consists of analyzing an application without it being executed (program structure and used utilities)
Dynamic analysis consists of analyzing an application in a sandbox based on actual API calls and execution
This paper covers malware detection using dynamic analysis

Traditional Malware Detection Technique: Rule-based approach

Uses explicit rules to classify software as benign or malicious
Recognizes threats through known signatures and patterns
Problems:
- Tends to have a high FPR due to inconsistenices in system behavior
- Hard to generate heuristics (best guesses or estimations) without ample domain knowledge
- Easy for adversarial machine learning (attacker can disguise malware as benign within a sequence of actions)

Natural Language Processing Techniques for Malware Detection

NLP approaches can significantly help malware detection as they both analyze sequences
Long Short-Term Memory (LSTM)
- Has already shown to be effective in log file analysis and anomaly detection (activity sequences are similar to language models)
- A form of recurrent neural network that is good for processing and clustering long sequences of data
- Incorporates previous input at the current input, combating the issue of forgetting information
- Utilizes gates to retain essential information and discard unnecessary information
Bidirectional Encoder Representations from Transformers (BERT)
- Transformers use attention mechanisms for sequence modeling
- Considers the whole sentence instead of individual words to develop context
- Utilizes Masked Language Model (MLM) to randomly mask input and predict the masked word from context
- Looks at words in sentences in both directions

Background on Datasets and Pre-processing

Malware samples are rare which creates unbalanced datasets
The paper uses Android OS activity sequences generated by WildFire
Activities refer to any API call
Machine learning models understand numbers, so a number is mapped to each action in the sequences

Approaches in Analyzing Android Activities

N-gram
- Processes a group of several activities/sequences together
- N-grams are redundant and do not guarantee locality
BERT
- Using a pretrained BERT model works better than training from scratch
- Performs well for Android malware detection
- Learns sequence patterns that are relevant to Android maliciousness

Experiments

Evaluation metrics include accuracy, precision, recall, and F1 score
Performances of baseline detection methods such as clustering, autoencoder, DAGMM, and DeepLog were shown but they were not ideal
Pretrained and trained from scratch BERT models were evaluated on Android activity logs along with LSTM
- Pretrained BERT had the highest F1 score

My Thoughts

This is a very interesting paper that provides great approaches in detecting malware by representing activity logs as sequences. I have had experience with all of these NLP models, but have only trained them on corpuses of text. The way that this presentation and paper suggests that NLP models, such as BERT, can effectively classify on activity logs has taught me a lot. I’ve learned that activity logs and sequences of text are actually quite similar, and performing NLP models on logs really helps in finding the context of an event in comparison to its surrounding events. This idea of contextual clarity is very important as one event could affect future events. Analyzing activity logs in relation to time and position could significantly help in determining features that make up malware and why they are placed at that point in a sequence. Lastly, as malware detection is mainly a classification process (classifying as benign or malicious) it is important to consider all classificaion metrics (accuracy, precision, recall, and F1). I believe that F1 is the most robust metric as it accounts for both precision and recall. However, in this field of ML for cybersecurity there is never a one size fits all, so it is important to experiment and deduce from there.

Discussion Summary

We can utilize confusion matrix to understand evaluation metrics (however labels change in reality)
False positives can be more important to minimize in some contexts (ex: Coca Cola shutting down production because of a false positive)
Transformers are great for malware activity detection as they look at the whole sequence of activity and can determine the context of a single action based on previous actions while predicting for future actions
What determines if something is malware? Not features and etc. Rather, when training a model, the data shows that the majority of people believe that this data point is good ware or that data point is malware, thus malware can be recognized (not really detected) as a statistic based on correlation
Transformers are used on malware detection (even though they’re used for texts mainly). For instance, malware detection can be modeled using transformers by mapping binary files to an image, for example, and utilizing the file bytes in place of pixels
Another way to model malware detection is a Markov decision chain or process that represents a state space (reinforcement learning can be used then too)
Dynamic analysis is better because the sequence you see dynamically is not the sequence you see statically

My Thoughts

While this discussion helped in reiterating the main concepts from the presentation, it also stirred up new insights and ideas on malware detection as a whole. For instance, I’ve had experience with classification metrics (accuracy, precision, recall, F1) and confusion matrices in my machine learning course, but have never really thought about how they’re applied in the cybersecurity space. As noted in the first bullet point, in a traditional ML setting we look at a confusion matrix to see if a data point is classified correctly or not, such as an apple being predicted as an apple or it being predicted as an orange. These datapoints will always remain the same as they’re derived from natural features. However, that is not the case with malware. Malware is constantly evolving and changing, and thus their labels change too. That is something we must account for in ML for cybersecurity.

Additionally, there is an emphasis on minimizing false positives. We must think about the consumer and the consequences of false positives when considering machine learning models in cybersecurity. That is something that was hard for me to grasp as I’ve had experience with traditional machine learning. On top of representing activities as sequences, I found the point of mapping binary files to images to be very interesting. This is a great step in easing the ML for cybersecurity process through utilizing existing methods, but there are still various cases to cover. One of which is concept of pixel locality in images. Bytes in binary files do not share the same concept. Modeling malware detection as a reinforcement learning problem could also bring new frontiers to this field. At the end of the day, reinforcement learning entails of maximizing a reward for an agent moving through an environment through a sequence of actions. Similarly, malware logs are sequences of actions where an action in one state impacts the future states. Lastly, the comparison between static and dynamic analysis in this discussion enlightened me on good insights. The class converged on the idea that dynamic analysis is overall better, and I agree. Dynamic analysis actually steps through the execution of a file or software whereas static analysis just looks at all of its features. Thus, dynamic analysis ultimately shows how malware could be performed in reality.

Thanks for reading my summary and thoughts on seminar 1.2!

Machine Learning for Malware Detection (Seminar 1.1)

Machine Learning (In) Security: A Stream of Problems Part 1 (Seminar 2.1)