Transcending TRANSCEND: Revisiting Malware Classification in the Presence of Concept Drift (Seminar 5.2)

NOTE: This blog is for a special topics course at Texas A&M (ML for Cyber Defenses). During each lecture a student presents information from the assigned paper. This blog summarizes and further discusses each topic.

During this seminar, Ali Ayati presented Transcending TRANSCEND: Revisiting Malware Classification in the Presence of Concept Drift. After his presentation our class had an open discussion related to the paper and more. This blog post will cover a summary of the information presented as well as a summary of our class discussion.

Presentation Summary

Review

A new, unseen sample is fed into a conformal evaluator rather than directly into a classifier
- Conformal evaluators computes and utilizes p-values (which signify how well a sample fits into a class) to assess the new sample
The results from the conformal evaluator are transferred to Transcend
- Transcend computes and uses per-class thresholds to divide reliable and unreliable decisions

Overview of the paper

Revisit conformal evaluator and Transcend to determine the most optimal operational settings
Propose the TRANSCENDENT framework which builds upon and outperforms the original Transcend
The authors make the following contributions:
- Formal treatment
  - Analysis of conformal evaluation’s effectiveness
- Novel Conformal Evaluators
  - They develop two novel conformal evaluators: inductive conformal evaluator (ICE) and cross-conformal evaluator (CCE)
  - Formalize Transcend’s calibration procedure as an optimization problem
- Operational Guidance
  - Evaluation of the proposals
  - Datasets include Android, Windows, and PDFs with different classifiers

Formal Treatment

Conformal evaluators rely on a notion of non-conformity
- Quantifies how dissimilar the new sample is compared to the history of past samples
- This helps in rejecting a new example that cannot be reliably classified
- Past examples are called “calibration points”
- For example, computing the Euclidean distance of a new example to a cluster centroid
- Different classification algorithms have different non-conformity measures, for example:
  - Nearest centroid: distance from centroid
  - SVM: distance from hyperplane
  - KNN: proportion of nearest neighbors
  - Random forest: proportion of decision trees
  - QDA: probability of belonging to a class
  - MLP: probability output by the final sigmoid activation layer
P-values
- Conformal evaluators use p-values to determine whether or not a new example belongs in the prediction region formed by elements
Conformal evaluation is built on top of conformal prediction which is a method for providing predictions that are correct with some guaranteed confidence
- Output prediction sets with guaranteed confidence 1 - ε
- A conformal predictor produces a prediction region given a classifier, a new example, and a significance level ε
- A set of labels in the label space that is guaranteed to contain the correct label y with probability no more than 1 – ε
- The order in which the examples appear does not affect the underlying distribution of the data
- The higher the confidence level, the more labels will be present
- Confidence is the greatest 1−ε for which the prediction region contains a single label which can be calculated as the complement to 1 of the second highest computed p-value
- Credibility is the greates ε where the prediction region is empty and corresponds to the largest computed p-value
  - Quantifies how relevant the training set is to the prediction

Novel Conformal Evaluators

Transductive Conformal Evaluator (TCE)
- Every training point is also used as a calibration point
- Does not scale to larger datasets
Approximate TCE (Approx-TCE)
- Calibration points are extracted in batches rather than individually using k-folds
- Repeating k times, one fold is used as the target of calibration and the remaining are used as the bag
- Batch processing makes it more efficient than TCE
Inductive Conformal Evaluator (ICE)
- Splits the training set into two partitions: proper training set and calibration set
- Cost efficient, but not information efficient
Cross-Conformal Evaluator (CCE)
- Combination of Approx-TCE and ICE

Thresholds

Use alpha assessment analysis to evaluate non-conformity measure which plots the distribution of p-values for each class, further split into whether the prediction was correct or incorrect
In Transcend there is a calibration phase (training) and a test phase
- Calibration phase searches for a set of per-class thresholds and analyzes credibility
- Can be accomplished manually or through grid search, for example
Improvements to threshold search
- Calibration procedure is modeled as an optimization problem which aims to maximize a performance metric
- Usually performance metrics are constrained by other metrics
- An improvement can be using random search instead of grid search, for example

Experiments and Evaluation

The setup is as follows:
- Android uses DREBIN with 260k apps and is classified with linear SVM
- Windows PE uses EMBER with 117k apps and is classified with Gradient Boosted Decision Tree
- PDF uses Hidost with 189k apps and is classified with Random forest
- There are 10 folds in k cross-validation
- The F1 threshold is 0.9 for keeping elements at a rejection rate of less than 15%
The authors noticed sampling bias
Probability for Approx-TCE decreases over months of testing (same with ICE)
CCE has a more stable probability over the testing period but still drops near the end
Overall, ICE had the shortest runtime in CPU hours
Random search performed better than grid search with 10,000 trials compared to 1,317,520

My Thoughts

Last seminar we learned about Transcend, which is already novel in itself by utilizing conformal evaluators and thresholds to determine reliability of a new sample. However, the introduction of the TRANSCENDENT framework, which builds upon and optimizes the original Transcend by proposing novel conformal evaluators and formalizing Transcend’s calibration procedure as an optimization problem, further illustrates a commitment to pushing the boundaries of current machine learning capabilities. Not only does this approach demonstrate an advancement in evaluating the effectiveness of conformal evaluation, but also provides guidance through the evaluation of diverse datasets and classifiers, showing its potential for widespread use. The formal treatment of non-conformity measures enhances the theoretical underpinnings of the Transcend framework and provides a clear, actionable framework. By addressing the limitations of previous models and introducing a method that balances cost efficiency with information efficiency, this work signifies a notable leap forward in the endeavor to make malware detectors more reliable

Discussion Summary

Goal is to minimize FPR, so that is why F1-measure is used because it accounts for false positives (this paper optimizes for F1-measure)
The goal of this paper is to show formal guarantees of the probability-based drift detectors and demonstrate how to better the TRANSCEND discussed in the previous paper
These authors are the same authors of the “Do’s and don’ts…” paper. They noticed their flaw in this paper discussed before: sampling bias
- They got two datasets and merged them which could cause sampling bias (not accounting for temporal information and is kind of data snooping as well)
- They found their flaw and fixed it
Never one size fits all in malware detection
They mention various kinds of malware/payload
- PDF: very dynamic, can store malicious JavaScript in the file which executes when the PDF compiles. Malware can be stored in Word documents as well
- Important to classify documents, not only applications
- Use dynamic analysis to detect malware in PDFs, all of the methods discussed on applications previously can be applied to PDFs as well
In ML for cyber we aren’t making black box solutions, rather a pipeline of various components
When choosing what files to protect as an AV, for example, we are threat modeling (reasoning why we chose those files to protect)

My Thoughts

As previously mentioned, the work proposed in this paper further explores and confirms the underlying theory of effectiveness in Transcend, while also providing improvements such as the use of novel conformal evaluators and utilizing better threshold search algorithms. I find it very cool that the authors were victims of sampling bias, when creating their data, yet still discovered their flaw and fixed it. If anything, this reiterates how difficult machine learning in malware detection is, especially when a plethora of data is not publicly available. The experimentatin on PDF files is very interesting as well. Similar to binaries, PDFs are dynamic and can store malware. Thus, similar approaches, such as dynamic analysis, can be used to detect malware. As technology is advancing, malicious intent expands to other, uncommon file formats. Therefore, it is important to develop a robust threat model that clearly defines the security context and trusted/untrusted components to prioritize points of breach and file formats.

That is all, thanks for reading!

Transcend: Detecting Concept Drift in Malware Classification Models (Seminar 5.1)

Shallow Security: on the Creation of Adversarial Variants to Evade Machine Learning-Based Malware Detectors (Seminar 6.1)