NOTE: This blog is for a special topics course at Texas A&M (ML for Cyber Defenses). During each lecture a student presents information from the assigned paper. This blog summarizes and further discusses each topic.

During this seminar, Ali Ayati presented Transcend: Detecting Concept Drift in Malware Classification Models. After his presentation our class had an open discussion related to the paper and more. This blog post will cover a summary of the information presented as well as a summary of our class discussion.

Presentation Summary


  • Probability of fit is a well known approach for qualitative assessment of decicisions of a learning model
    • Probabilities must sum to 1 so there is a high chance of results being skewed
  • New assessment techniques (conformal evaluator) have been developed that analyze objects statistically rather than probabilistically
  • As opposed to probablistic measures, statistical assessment covers how likely a test object will belong to a class compared to all of its other members

Non-Conformity Measure

  • Non-conformity measure is a scoring function that measures the difference between a group of objects belonging to the same class
    • In essence, it evaluates how “strange” a prediction is according to the different possibilities available
  • For example, in SVM the non-conformity measure is the distance of a test sample from the decision boundary
    • If a new malware sample lies close to the decision boundary, it has a high non-conformity score (SVM is not certain that it belongs more to class 1 or class 2)
  • Another example: for random forest, the non-conformity score can be computed using the proportion of decision trees that classify a sample as malware
    • If 40% of the trees classify a new test sample as class 1 and 60% classify as class 2, then there is a high non-conformity score
    • The sample does not conform to the majority of either class
    • If 90% of the trees classify as class 1, then the sample likely highly conforms to class 1 data

P-Value as a Similarity Metric

  • Measures how well a sample fits into a single class
    • P-value of an object is the proportion of objects in the class that are at least as dissimilar to other objects in the set as itself
    • Conformal evaluator computes a p-value for each class, for each test element
  • It is essentially the ratio between the number of training elements that are more dissimilar than the element under test

P-Values vs Probabilities

  • Probabilities must sum to 1.0
    • If probabilities are used for decision assessment, then an incorrect conclusion might be reached for previously unseen samples
  • P-values are not constrained by the same limitations
    • It is possible for two p-values to be low for the case of a previously unseen sample
  • When computing probability for a test sample, only the information belonging to the test samples are used (distance to a hyperplane, for example)
  • P-value is computed by comparing the scores of all samples in the class

Transcend

  • Extracts the non-conformity measure from the decision-making algorithm
  • Builds p-values for all training samples
  • Computes per-class threshold to divide reliable predictions from unreliable ones
    • Constrained in two dimensions: the desired performance level and the proportion of samples in an epoch that the malware analysis team is willing to manually investigate
    • An example of desired performance level is that the threshold increases with high-level performance
    • An example of the proportion is that the threshold will decrease with a low number of unreliable predictions

Case Studies

  • Case study 1: Binary
    • Android malware detection
    • Reimplemented Drebin algorithm using static features
    • Used linear SVM to compute non-conformity measure
    • Drebin dataset with data from 2010 to 2012 is used
      • 123,435 benign samples and 5,560 malicious samples
    • The Marvin dataset from 2010 to 2014 is also used where the data from 2010 to 2012 is merged with Drebin
      • 9,592 benign samples and 9,179 malicious samples
    • For training Drebin is used
    • For testing, 4,500 benign and 4,500 malicious random samples from Marvin are used
      • The model is affected by concept drift, reporting low precision and recall for the positive class representing malicious objects
    • The model is also trained on Drebin and tested on Marvin with p-value based threshold filtering
      • Enforcing cut-off quality thresholds significantly improve performance
    • The case study then retrains the simulation with training samples of Drebin along with the filtered out element of Marvin from the previous experiment
      • This brings slight improvement
  • Case study 2: Multiclass
    • Microsoft malware classification algorithm
    • Solution to Microsoft Kaggle competition
    • Static features from Windows PE binaries
    • Used random forests for computing non-conformity measures
    • Drift family is family discovery

My Thoughts

This seminar and paper did a great job at explaining the benefits of statistical methods over probabilistic methods in ML for malware detections, especially for novices like myself. As this seminar mentioned, p-values are especially a more useful metric in decision assessment. If a sample has probabilities of belonging to defined classes, then those probabilities must sum to 1. This could create a bias or form pointless conclusions for unseen samples. On the other hand, p-values uses proportions amongst each class to determine how well a sample fits into a class. This is a much more useful tool in assessing decision similarity.

Transcend, as proposed in the paper, is a very novel framework that utilizes non-conformity measure discussed in this seminar. While, it is not a deployed model yet, the results from the case studies demonstrate its effectiveness, which suggests potential deployment. I specifically am intrigued with how it covers corner cases in classification for malware. Transcend evaluates the robustness of predictions while also evaluating the quality of the non-conformity measure computed. This accounts for the non-conformity measure not always being a representative measure, hence ensuring whether or not new, unseen samples really conform to a class or not.


Discussion Summary


  • Transcend is a state of the art concept drift detector (it is currently being discussed)
  • Statistics and probability are typically used interchangeably, but there is a difference:
    • Probability: chances or what we expect in the future based on distributions (ex: flipping a coin thousands of times)
    • Statistics: analysis on past data
    • Difference between metrics appears all the time (like the difference between accuracy and precision)
    • Why do we need both statistics and probability? Probability must sum to 100% which could be a problem as a sample could be forced into a class. Statistics tells us how much we know about a given sample, and from there we can use probability to assign to classes
  • If statistics say that a sample’s confidence is high and the probability is low then there is concept drift
    • Feature prevalence is changing (features are being used differently, so the classifier doesn’t know what to do)
    • Difference in frequency: confidence is low because the distributions differ
  • P-value: what is the chance that a value is wrong?
  • Big question is how well can we trust in these results? A lot of experiments use p-values
  • Another thing being developed is classification with rejection (you don’t blindly classify your samples), out of distribution is another similar method being developed

My Thoughts

This discussion was very insightful and helped to reiterate the benefits of statistical methods over probabilistic in the case of machine learning for malware detection. The discussion provides a great high-level overview of how probabilistic methods can be detrimental in classification. Utilizing probabilities can force a new sample into a class when that is not the case. That would defeat the purpose of concept drift detectors, as malware is always evolving and hence the labels change over time as well. Thus, the use of statistical measures such as p-values significantly help in determining how well a sample belongs to a class, without skewing the decision. This does not mean that probabilities should be disregarded, rather that models should also incorporate statistics.

The discussion of the inverse relationship between statistical confidence and probability suggesting concept drift is also food for thought. The way I see it is that concept drift occurs when the statistical properties of the target variable, which the model is trying to predict, change over time. This demonstrates that the model predicts with high certainty but the outcomes turn out to be incorrect more often than the model’s confidence levels would suggest.


That is for my summary, thanks for reading!


<
Previous Post
DroidEvolver: Self-Evolving Android Malware Detection System (Seminar 4.2)
>
Next Post
Transcending TRANSCEND: Revisiting Malware Classification in the Presence of Concept Drift (Seminar 5.2)