DroidEvolver: Self-Evolving Android Malware Detection System (Seminar 4.2)

NOTE: This blog is for a special topics course at Texas A&M (ML for Cyber Defenses). During each lecture a student presents information from the assigned paper. This blog summarizes and further discusses each topic.

During this seminar, Vasudeva Vallabhajosyula presented DroidEvolver: Self-Evolving Android Malware Detection System. After their presentation our class had an open discussion related to the paper and more. This blog post will cover a summary of the information presented as well as a summary of our class discussion.

Presentation Summary

Introduction

It is difficult to build Android malware detection systems that are effective at detecting new malware while being trained with old Android applications
Existing systems face the challenges of determining when to retrain a detection system, the cost of manually labeling new malware, and retraining with cumulative datasets
DroidEvolver, a system to to make malware detection accurate over time by making necessary update to its detection models with evolving feature set, is proposed
- Maintains a model pool of different detection models that are initialized with a set of labeled applications using various online learning algorithms
- Intuition is that different models are less likely to be aging at the same pace in malware detection even with the same, intial dataset
- Performs weighted votes among “young” detection models to classify each application based on its Android API calls
- “Young” is determined through the Juvenilization Indicator (JI) that computes the similarity between the detected app and a batch of apps detected by the model to the same prediction label
- DroidEvolver dynamically updates its feature set and aging detection models using JI to identify drifting applications, ensuring effective adaptation to new API calls or usage patterns without manual intervention
- Updates its model pool for malware detection without needing true labels for applications, utilizing pseudo labels for identified drifting applications to update aging models, thus minimizing manual labeling and associated costs
- When no drift is detected, all models in the pool contribute to the classification without any updates, ensuring efficient and cost-effective malware detection evolution
- DroidEvolver leverages online learning algorithms for its detection models, enabling incremental learning over streaming data without the need for periodic retraining, making it highly efficient for continuous malware detection
- It updates only aging models when a drifting application is detected, using pseudo labels instead of true labels, enhancing practicality and efficiency by promptly juvenilizing models as necessary without extensive manual intervention
- DroidEvolver is evaluated rigorously using 34,722 malicious apps and 33,294 benign apps
- It has an F-measure that is 2.11 times higher than MAMADROID and declines by 1.06% per year whereas MAMADROID declines at 13.52% per year
This paper covers the efficacy and efficiency of DroidEvolver and compares it with the state-of-the-art detection system: MAMADROID

Design of DroidEvolver

There are two phases: initialization and detection
During the initialization phase, DroidEvolver processes a set of known applications with true labels to generate a set of features and detection models, which are then utilized in the detection phase
In the detection phase, DroidEvolver evaluates applications with unknown labels, applying the previously generated models to output prediction labels for these applications
Initialization phase:
- The initialization phase of DroidEvolver involves decompiling application APK files to disassembled dex bytecode using apktool, allowing the preprocessor module to identify API calls within each application
- A feature extraction module then identifies and records the binary presence of Android APIs across all applications, constructing an initial ordered feature set and corresponding feature space through a 1-to-1 mapping of features to dimensions
- The vector generation module produces a feature vector for each application relative to the detection models, mapping detected features to the feature space, with present features assigned to one and absent features to zero
- The model pool construction module creates an initial set of detection models from the generated feature vectors of input applications, using various online learning algorithms based on their true labels and feature vectors
- After the initialization phase, there is a transition to the detection phase with an established feature set and model pool, where each model is marked with a feature set indicator reflecting the number of processable features, potentially adjusting for expansion during detection
Detection phase:
- DroidEvolver dynamically updates its feature set and adapts its detection models to classify each unknown application as malicious or benign, incorporating new features discovered during the detection phase
- The detection phase mirrors the initialization phase’s first three modules, with modifications including dynamic feature set updates, construction of model-specific feature spaces based on feature set indicators, and generation of tailored feature vectors for each model
- Feature extraction module identifies Android APIs across consistent API families (android, java, javax), ensuring comprehensive detection of new API calls within these families despite the significant increase in API packages from Android version 1.0 to 8.1
- The classification and evolvement module classifies each unknown application as malicious or benign and identifies aging models, updating the feature set with new Android API calls from the application without altering existing feature ordinals
- For aging models, DroidEvolver adjusts the feature set indicator to match the updated feature set’s size and updates these models based on the application’s classification result and its updated feature vector
Specifics of model pool construction:
- Model pool is used, each initialized with online learning algorithms during the initialization phase, to overcome the limitations and biases of single-model detection, ensuring more accurate and reliable malware detection results
- Each detection model is each built on a distinct online learning algorithm, they process applications individually, offering a complexity linear to the input size, contrasting with batch learning’s simultaneous processing requirement
- DroidEvolver processes a sequence of N applications, each represented by a d-dimensional feature vector and a binary label (malicious or benign), using online learning algorithms where each application influences a d-dimensional weight vector in the model, reflecting the importance of features for classification
- The model pool consists of five lienar online learning algorithms: Passive Aggressive, Online Gradient Descent, Adaptive Regularization of Weight Vectors, Regularized Dual Averaging, and Adaptive Forward-Backward Splitting
- They encompass a broad range of online learning strategies, from first-order and second-order learning to learning with regularization, addressing diverse computational and optimization needs
- First-order algorithms focus on optimizing objective functions using gradient information, offering linear computational complexity, while second-order algorithms, utilizing more detailed gradient information, aim for faster optimization convergence but at a higher computational cost, especially with high-dimensional data
- DroidEvolver’s adaptability is enhanced through regularization algorithms which leverage data sparsity to manage high-dimensional challenges, and the varied update policies, learning rates, optimization methods, and loss functions of these algorithms allow for differential aging within the model pool
Clasfification and evolvement:
- This module performs classification for each unknown application and makes necessary updates to its feature set and model pool
- Three staps are involved: drifting application identification, classification and pseudo label generation, and aging model juvenilization
- DroidEvolver combats concept drift by detecting “drifting applications” that deviate from previously processed data, identifying and updating “aging models” accordingly to sustain its malware detection efficacy
- A juvenilization indicator (JI) is used to assess if an unknown application represents a concept drift, by measuring its similarity to a batch of previously processed applications using a detection model
- DroidEvolver maintains an app buffer of size K, storing feature vectors from a subset of processed applications, and updates this buffer by replacing one vector with the new application’s vector, ensuring the app buffer remains current for accurate JI calculations
- Each unknown application is classified as “malicious” or “benign” based on a weighted voting system involving all models in the pool, where the classification depends on whether the sum of weighted feature vectors is non-negative or not
- If an application is deemed a drifting application, DroidEvolver excludes aging models from the weighted voting process for classification, using the result as a pseudo label to update the feature set and aging models
- When all models are aging or none are, DroidEvolver includes all models in the weighted voting for classification but does not update the models
- drifting applications reveals new features or patterns not present in previously processed applications, hence requiring updates to the feature set and model structures of aging models
- DroidEvolver updates its feature set with new features from the drifting application and individually adjusts each aging model’s structure and feature set indicator based on the application and its pseudo label to ensure the models remain effective over time

Experimental Settings and Parameter Tuning

Data collection:
- A set of applications was collected in July 2017 from an open Android app collection project
- Labels were determined and obtained from VirusTotal (an application is benign if no alarm was set and malicious if it received at least 15 alarms from 63 scanners)
- Application timestamps were determined by the packaging date found in the apk file’s dex file, with the dataset covering applications from the years 2011 to 2016, providing a comprehensive temporal span for analysis
Metrics and measurements:
- DroidEvolver is assessed using F-measure
- Validation set is used to avoid over-fitting
- The dataset spans six time periods from 2011 to 2016, with applications for each period randomly divided into five equal parts; three parts are used for training, one for validation, and one for testing, ensuring a comprehensive evaluation across different temporal segments
Parameter tuning:
- Two thresholds are used to identify drifting applications and aging models according to JI values in the detection phase
- App buffer is used as previously mentioned where the default size is K=500

Evaluation and Analysis

Detection in the same time period:
- DroidEvolver outper-forms MAMADROID consistently and significantly, achieving 15.80% higher F-measure, 12.97% higher precision, and 17.57% higher recall on average
- By learning from drifting applications and adapting to new changes, DroidEvolver achieves 96.15% F-measure on average using Android APIs as detection features
Detection over time:
- DroidEvolver significantly outperforms MAMADROID in detecting malware over time, with models trained on older data and tested on newer applications
- DroidEvolver’s average F-measure remains high (between 87.17% and 92.32%) when tested on applications developed up to five years after the training period, while MAMADROID’s performance drastically declines (from 68.01% to 8.81%)
- The performance of DroidEvolver stabilizes after two to three years, demonstrating its capability to adapt by learning from new applications and updating its features and models
- The effectiveness of DroidEvolver is particularly notable in scenarios with older and smaller training sets, indicating its superior adaptability to evolving malware trends compared to MAMADROID
Impact of model updates:
- By updating its model with at least 60% of the labeled drifting applications, DroidEvolver maintains an average F-measure above 92%, demonstrating high detection performance with updates from just about 6.74% of true label applications annually
- The naive solution still outperforms MAMADROID considerably due to the weighted voting
Feature evolvement:
- DroidEvolver’s feature set dynamically expands by integrating new features from drifting applications, adapting to changes in app development and the Android framework, demonstrated by the growth from 14,327 to 52,001 features over six years
- Both the overall feature set and individual model indicators within DroidEvolver exhibit similar growth patterns, highlighting the system’s ability to continuously update and adapt to new features for effective malware detection
False positives and false negatives:
- DroidEvolver assigns a weight value to each feature of an application to indicate its significance in the classification process, with the sum of all feature weights determining the final classification result
- In an evaluation involving 11,566 applications from 2012, DroidEvolver demonstrated high performance with an F-measure of 95.69%, accuracy of 95.91%, true positive rate (TPR) of 93.39%, and false positive rate (FPR) of 1.70%
- A detailed analysis of feature weight distribution for true positives, false negatives, true negatives, and false positives is conducted, showing variability in feature contributions across different classifications through box and whisker plots
- DroidEvolver reports 98 false positives (FPR = 1.70%) from 5,789 benign applications, largely because these applications have a high percentage (over 40%) of non-negative weight features, including APIs commonly used by malware, leading to incorrect predictions
- From 5,777 malicious applications, DroidEvolver produces 382 false negatives (FNR = 6.61%), mainly because these applications have a higher proportion of negative weight features, making it challenging for DroidEvolver to accurately detect them without examining beyond API calls
Runtime:
- DroidEvolver is significantly faster than MAMADROID in all modules except in classification and evolvement
- Although DroidEvolver leverages app buffer to strike a balance between effectiveness and efficiency, it still takes more time than other steps to calculate JI value for each unknown application, which requires comparison with all applications included in the app buffer
- DroidEvolver has high efficiency due to its online learning algorithms to update aging models and requires no true labels to update the model pool
Robustness
- DroidEvolver is robust against common code obfuscations, including identifier renaming, junk code insertion, code reordering, and data encryption
- Robustness is tested by applying Droid-Chameleon
Limitations and extensions:
- DroidEvolver’s static analysis approach, focused on Android API call detection, misses malware identifiable only through more complex features or dynamic behaviors, and is susceptible to poisoning attacks where attackers manipulate the initial dataset to compromise detection
- Enhancements to DroidEvolver can include integrating more sophisticated static analysis features like API call graphs, extending to native and dynamic analysis, to improve detection accuracy albeit with potential performance trade-offs

My Thoughts

This paper proposed a very unique and insightful drift detection solution that I did not think about before. There are various aspects of DroidEvolver that effectively address the concept drift problem and other machine learning problems to boost efficiency. First of all, the use of a model pool appears to be very state-of-the-art. As malware is constantly adapting, one model can only perform well for so long. Thus, having a pool of models that can work together through voting and be updated when appearing to age significantly helps in the drift detection accuracy. Additionally, the use of online learning as opposed to batch learning makes it extremely efficient while also individualizing features for each applications separately.

DroidEvolver is a big upgrade from traditional malware detection systems. It does not need to retrain or use true labels which makes it more practical in environments that are constrained by resources and availability of labels. This paper demonstrates that it performs extremely well against the existing MAMADROID, especially in classifying and detecting evolved malware over longer periods of time. It would be very cool to see DroidEvolver improve to utilize dynamic analysis.

Discussion Summary

An important message is that not utilizing retraining will result in significant performance decline over time
In reality you should have multiple models. Train one in the background while one is operating one in real time, then keep swapping. Key challenge is that having a pool of models is the cost of resources
The number of models depends on drift, and many other factors
Attack opportunity window: drift detector has more difficulty detecting the drift in between a specific retraining window. That’s why the drift detector’s performance doesn’t have a perfect accuracy over time
Ideally you have the analyst or sandbox producing labels, but in the meantime if you don’t have access to those labels you can use other classifiers to get pseudo labels (not as good as ideal labels, but good for now). Retrain with pseudo labels then train again when the new sandbox labels come
A lot of new ML positions (MLOps for maintaining the drift detection pipeline and more, ML penetration testing)

My Thoughts

This discussion helped to expand on the dense information from this paper. The paper demonstrated the F-measure comparison of both DroidEvolver and MAMADROID. MAMADROID declined significantly over a 5 year span whereas DroidEvolver was mostly stable. This is largely due to MAMADROID not retraining on new malware with updated features. I’ve learned that utilizing multiple models is essential in various aspects: keeping systems alive while accurately detecting malware, boosting efficiency in updating models, and diversifying to different features. However, a model pool can be resource-intensive which should be kept in mind. The plots in the paper also show how DroidEvolver does still have dips in performance throughout the 5 year span. This is due to the drift detector’s ability to properly update and detect the drift. After some time, DroidEvolver does appear to improve and stabilize, however. Pseudo labels are also great in addressing the problem of true labels being widely unavailable. While they are not perfect, pseudo labels aid in updating models with the best estimate possible which can lead to better improvement later on once the true labels are defined. Learning about the pipeline in malware detection and ML has garnered a lot of interest in me. This will motivate me to look into MLOps positions.

That is all, thanks for reading!

Fast & Furious: On the modelling of malware detection as an evolving data stream (Seminar 4.1)

Transcend: Detecting Concept Drift in Malware Classification Models (Seminar 5.1)