NOTE: This blog is for a special topics course at Texas A&M (ML for Cyber Defenses). During each lecture a student presents information from the assigned paper. This blog summarizes and further discusses each topic.

During this seminar, Bhavan Dondapati, Vishal Vardhan Adepu, and Rohith Yogi Nomula presented Fast & Furious: On the modelling of malware detection as an evolving data stream. After their presentation our class had an open discussion related to the paper and more. This blog post will cover a summary of the information presented as well as a summary of our class discussion.

Presentation Summary


Background

  • The paper explores the impact of concept drift on malware classifiers with experiments using Android datasets
  • Concept drift denotes how malware detectors struggle to identify new and evolving malware as a result of their changing characteristics
  • Various strategies for mitigating malware are explored in this paper
  • A novel data stream pipeline that combats concept drift is proposed and discussed

Data

  • Android datasets are used:
    • Android is globally the most used operating system (more than 2 billion active devices monthly)
    • Surpassed Windows with a 40% market share
    • Malware attacks, evolution, and distribution are a result of the large usage
  • DREBIN dataset:
    • Approximately 130,000 apps
    • Static features
    • 123,453 goodware and 5560 malware
  • AndroZoo
    • Approximately 350,000 apps
    • Uses both static and dynamic features with the paper emphasizing on static features
    • Collected throughout 9 years from 2009-2018
    • 267,342 goodware and 80,102 malware
    • Temporal information is incorporated through the first seen date on VirusTotal
  • These datasets are largely imbalanced

Threat Models and Assumptions

  • The threat model considers an AV for Android due to its market leadership and how malware can affect a large number of users
  • The detection model is completely static (features are retrieved directly from APK files)
    • Static malware detection is the most popular and fastes way to triage malware samples
    • The most prevalent approach for detecting malware within the Android environment
  • The goal of this paper is to highlight the need for updating ML models based on classifiers, not to implement an actual AV
  • Simulated the behavior of an online AV using offline experiments that use data streams

Data Stream

The proposed data stream pipeline considers the feature extractor under changes, on top of updating the classifier:

  1. Obtain a new sample X from the raw data stream
  2. Use feature extractor E to extract features from X, trained with previous data
  3. Predict the class of X with classifier C, trained with previous data
  4. With C’s prediction, update drift detector D to check the drift level
  5. Based on the drift level, one of three steps can occur (all of them restart the pipeline): a. Normal: incrementally update C with X b. Warning: incrementally update C with X and add X to a buffer c. Drift: retrain both E and C using only the data collected during the warning level (from the buffer build during this level), creating a new extractor and classifier

Representation

  • Uses two types of feature representation: Word2Vec and TF-IDF:
    • Both are widely used for text classification
    • Word2Vec:
      • Converts textual attributes of malware, such as API calls or system calls, into dense vectors
      • The vectors capture semantic similarities between attributes
    • TF-IDF:
      • Assigns weights to attributes based on their frequency within a document and across the corpus
      • Highlights the significance of terms within each sample

Concept Drift Detectors

  • Drift Detection Method (DDM), Early Drift Detection Method (EDDM), ADaptive WINdowing (ADWIN), Kolmogorov - Smirnov WINdowing (KSWIN)
  • DDM and EDDM
    • Online supervised methods based on sequential error monitoring (each incoming example is processed separately estimating the sequential error rate)
    • They assume that the increase of consecutive error rate suggests the occurrence of concept drift. DDM directly uses the error rate, while EDDM uses the distance-error rate, which measures the number of examples between two errors
    • Has two trigger levels: Warning and Drift
    • Warning level suggests that the concept starts to drift, then an alternative classifier is updated using the samples which rely on this level
    • The drift level suggests that the concept drift occurred, and the alternative classifier build during the warning level replaces the current classifier
  • ADWIN
    • Keeps statistics from sliding windows of variable size, used to compute the average of the change observed by cutting them in different points
    • If difference between two windows is greater than predefined threshold - concept drift
  • KSWIN
    • Uses a sliding window of fixed size to compare the most recent samples of the window with the remaining ones by using Kolmogorov–Smirnov (KS) statistical test
    • When a change occurs, only the most recent samples of the window are kept and used to retrain the classifier and feature extractor

Classifiers

  • Adaptive Random Forest
    • Works as random forest for data streams
    • Has the best overall performance and is widely used in malware detection
  • Stochastic Gradient Descent
    • One of the fastest online classifiers in scikit-learn

Experiments

  • The best-case scenario for AVs (ML cross-validation):
    • The goal is to classify all samples together to compare feature extraction algorithms and establish baseline results
    • Tested various parameters for TF-IDF and Word2Vec algorithms
    • Fixed vocabulary size at 100 for TF-IDF and created projections with 100 dimensions for Word2Vec
    • Employed 10-fold cross-validation to evaluate models, reducing biases and simulating a mixed view of past and future threats for AV operations
    • ML classifiers performed well when trained on mixed datasets
      • Indicates their ability to learn features from various periods
      • TF-IDF is preferred due to higher performance
    • TF-IDF outperformed Word2Vec in most metrics across both DREBIN and AndroZoo datasets.
    • TF-IDF had great recall for malware detection, Word2Vec had slightly higher precision for goodware detection
    • TF-IDF had higher accuracy and F1-score
  • On classification failure (temporal classification)
    • Significant drop in all metrics compared to cross-validation in both DREBIN and AndroZoo datasets
    • Indications of concept drift in malware samples, evidenced by notably smaller recall rates
    • Bias toward goodware detection due to dataset imbalance
  • Real-world scenario (windowed classifier):
    • AVs constantly update and refine thier systems as they identify new samples
    • Incremental Windowed Classifier (IWC) is an ML method that utilizes incremental stream learning to continously refine classifiers in a windowed fashion
    • Evaluation of IWC:
      • DREBIN and a subset of AndroZoo are used
      • Training data is collected until a specific month and testing data is formed one month later
      • Uses Adaptive Random Forest and Stochastic Gradient Descent as the classifiers
      • Remove months with no samples
      • Retrains classifiers and feature extractors monthly
      • Evaluated IWC using precision and recall
  • Concept drift detection using data stream pipeline (Fast & Furious – F&F)
    • Continuous learning is effective for threat detection, but concept drift is still a challenge as seen in AndroZoo dataset results
    • Evaluation of drift detector methods:
      • DREBIN and a subset of AndroZoo are used
      • Uses Adaptive Random Forest and Stochastic Gradient Descent as the classifiers
      • Drift detectors: DDM, EDDM, ADWIN, and KSWIN
      • Used TF-IDF as a feature extractor with 100 features for each textual attribute
      • Initialized the base classifier in the DREBIN dataset with data from the first year and in the AndroZoo dataset with data from the first month
      • Evaluated drift detection algorithms in two scenarios during classifier creation:
        • Classifier updated with new samples (collected in warning level or ADWIN window) for a rapid response to new threats
        • Extracted all features from raw data, retrained both classifier and feature extractor from scratch, building a new vocabulary based on the words in the training set for each textual attribute (a more comprehensive approach, but time-consuming)
      • DroidEvolver: Implemented a version of DroidEvolver for comparison
  • Multiple Time Spans
    • There is potential bias from the use of the same training and test sets
    • The use of multiple time spans can mitigate biased evaluations
    • Experiment setup:
      • Uses cumulative training set
      • Split dataset into eleven folds, simulating k-fold cross-validation. (same amount of data)
      • Incrementally added training data while removing corresponding test data in each iteration
      • Evaluated both classifiers (Adaptive Random Forest, SGD) and drift detectors (DDM, EDDM,ADWIN, KSWIN) with both update and retrain methods
      • IWC’s performance declines with increasing data volume in the stream
      • KSWIN with retrain is recommended for static android malware detection streams.
      • Results are unbiased due to varied testing conditions

My Thoughts

This paper was very helpful in providing further understanding in concept drift detection and the impact of different ML models and feature extraction techniques. Starting off with performance metrics and the results, this seminar showed that F1-score, recall, and precision should be considered when evaluating malware effectiveness, not just accuracy. This has been mentioned before, but should continue being reiterated. As demonstrated through the experiments and results, continuous learning methods combat the issue of fixed models (which struggle to predict and classify future threats). However, continous learning methods don’t eliminate the issue of drift, and thus calls for the need of drift detectors.

The paper introduces a novel data stream pipeline to address this by continuously updating feature extractors and classifiers in response to new samples, employing strategies like incremental updates and retraining during drift detection. Using static features from APK files and employing popular feature representation methods like Word2Vec and TF-IDF, alongside concept drift detectors like DDM, EDDM, ADWIN, and KSWIN, the study highlights the importance of adaptive learning in maintaining high detection rates amidst malware evolution. Concept drift is never leaving, malware threats have a dynamic and evolving nature, so the suggestion of continously updating malware detection systems is extremely essential. False positives can be significantly reduced by incorporating temporal information, as demonstrated.


Disussion Summary

  • Everything in data stream pipeline is automated
    • Someone else must provide ground truth labels for a drift detector (how would it know if it’s a malware or not)
    • Dynamic analysis is used to automate the pipeline
  • Warning on the data pipeline denotes that a drift is about to occur (error rate is increasing, for instance)
  • Key message of this paper is that we should also focus on re-training the feature extractor, not just the classifier
    • Features change over time (they cause concept drift themselves)
  • Reasons for concept drift happening: attacker decides to change, features in software change, both attackers and defenders are dynamic
  • SMS malware: a virus is always in the system, and it sends SMS messages from your device
  • Word2vec and TF-IDF are used for embedding string info in binaries (like API calls, system calls)

My Thoughts

Understanding how the data stream pipeline is important. This is accomplished through dynamic analysis as the retraining process of the model in the system is performed within a controlled environment, like a sandbox. Therefore, each new sample is executed in the sandbox and avoids actual harm. This execution allows the feature extractor to pull any new features and update and retrain the model itself. As mentioned from the seminar, it is so important to retrain the feature extractor on top of the classifier. The classifier uses the features to make a prediction, but as mentioned in this discussion, features are what cause concept drift. Samples are composed of features and are identified by them, so if a classifier does not learn the new sample’s features properly, concept drift is not really solved. It is also important to see how both attackers and defenders are dynamic. Sure the attacker can continously evolve malware and features, but the defenders can continously accomodate for them as well. Effective concept drift detectors can go a long way.


That is all for this seminar, thanks for reading!


<
Previous Post
Dos and Don’ts of Machine Learning in Computer Security (Seminar 3.2)
>
Next Post
DroidEvolver: Self-Evolving Android Malware Detection System (Seminar 4.2)