NOTE: This blog is for a special topics course at Texas A&M (ML for Cyber Defenses). During each lecture a student presents information from the assigned paper. This blog summarizes and further discusses each topic.

This seminar was presented by none other than me!. During this seminar, I presented Adversarial Machine Learning in Image Classification: A Survey Toward the Defender’s Perspective. After my presentation our class had an open discussion related to the paper and more. This blog post will cover a summary of the information presented as well as a summary of our class discussion.

Presentation Summary


Introduction

  • Convolutional Neural Networks (CNNs) are a state-of-the-art DL algorithm for image classification in the field of Computer Vision
  • CNNs have made significant advancements recently, even in security-critical applications
  • Researchers have found vulnerabilities of DL models in various tasks due to adversarial attacks
    • Adversarial attacks in Computer Vision: subtle modifications generated by an optimization algorithm that are inserted into an image and can trick CNNs with high confidence
  • No established solutions for securing DL models or any fully accepted explanations yet
  • This paper provides an exhaustive review of adversarial ML in image classification from a defender’s perspective that can further help research in the field

Main Contributions

  • Updates to existing taxonomies to categorize different types of adversarial images and novel attack approaches raised in literature
  • Discussion and organization of defenses against adversarial attacks based on a novel taxonomy
  • An overview of understudied and overstudied scenarios using the introduced taxonomies, as well as the discussion of promising research paths for future works

Background

  • CNNs
    • Inspired by the structure of the brain, images are represented as 2D arrays
    • CNNs learn features through
      • Convolution, which applies an nxn filter over each region of the input, computes the dot product of the filter and each region, and produces a feature map
      • Pooling layers, which extracts useful features from images and reduces their spatial dimensionality
    • The fully connected layer(s) is after feature learning
      • Works similarly to an ordinary neural network
      • Produces a probability vector as output
      • Outputs a vector of probabilities where the corresponding class of the highest probability is the prediction
  • Autoencoders
    • Approximates an input x to its identity function by generating an output 𝑥 ̂ as similar as possible to x from a compressed representation learned
    • Autoencoders try to learn the inner representations of the input
    • Autoencoders are useful for two main purposes:
      • Dimensionality reduction (retaining the most important features)
      • Data generation process
  • Generative adversarial networks:
    • GANs are a framework for building generative models that resemble the data distribution used in the training set
      • Can be used to improve the representation of data to perform unsupervised learning and to create defenses against adversarial attacks
    • GANs consist of two models
      • Generator: receives input and tries to generate an output from a probability distribution
      • Discriminator: produces a label that determines if what class the generator’s output belongs to

Adversarial Images and Attacks

  • What is an adversarial image?
    • Let f be a classification model trained with legitimate images and x be a legitimate input image
    • From x an image x’ is crafted such that x’ = x + 𝛿x, where 𝛿x is the perturbation needed to make x cross the decision boundary
    • 𝛿x can also be seen as a vector where its magnitude represents the level of perturbation required to move the image point x across the decision boundary in the space
    • In other words, an adversarial image is a modification made to an image that tricks the classifier into predicting an incorrect output
    • An adversarial image is optimal if:
      • The perturbations are undetectable by human eyes
      • The perturbations can trick the classifier, preferably with high confidence
  • Taxonomy of Adversarial Images
    • Perturbation Scope
      • Adversarial images may contain individual- or universal-scoped perturbations:
        • Individual scoped perturbations: Generated individually for each input image
        • Universal scoped perturbations: Generated independently from any input sample
          • Often able to lead models to misclassification
          • Easier to conduct in real-world scenarios
    • Perturbation Visibility
      • Perturbation efficiency and visibility can be organized as follows:
        • Optimal perturbations: Imperceptible to human eyes, useful to trick DL models usually with high confidence
        • Indistinguishable perturbations: Imperceptible but can’t fool DL models
        • Visible perturbations: Can fool DL models but are easily spotted by humans
        • Physical perturbations: Physically added to real-world objects themselves, typically performed in object detection tasks
        • Fooling images: Corrupt images that make them unrecognizable by humans, classifiers can still assign them a class, sometimes with high confidence
        • Noise: Non-malicious/non-optimal corruptions that could be present or inserted into an image
    • Perturbation Measurement
      • P-norms are most used to control the size and amount of perturbations
        • Computes the distance in the input space between a legitimate image and the resulting adversarial sample
  • Taxonomy of Attacks and Attackers
    • Attacker’s influence
      • How the attacker will control the learning process of DL models by performing two types of attacks: poisoning and evasive attacks
        • Poisoning: Attacker influences the model’s learning process during the training stage
        • Evasive: Attacker influences the model’s learning during the testing stage
          • Most common type of attack
          • Can also have an exploratory nature
    • Attacker’s Knowledge
      • Depending on the attacker’s knowledge of the targeted model, three attacks can be performed: white box, black box, and grey box
        • White box: Attacker has full access to the model and defense method, most powerful, but least frequent
        • Black box: Attacker has no access and knowledge of the model, more representative of real-world scenarios
        • Grey box: Attacker has access to classification model but not the defense method
    • Security Violation
      • Security violations caused by adversarial attacks can affect the integrity, availability, and privacy of the targeted classifiers
        • Integrity violation: The performance of a model is degraded without compromising normal operation
        • Availability violation: Model becomes unusable, causing a denial of service
        • Privacy violation: Attacker gains private information such as model architecture, parameters, and even training data
    • Attack Specificity
      • Attacker can perform a targeted or untargeted attack:
        • Targeted: attacker choses the class for the classifier to mis predict the sample as, beforehand
        • Untargeted: attacker seeks for the classifier to choose any class different than the ground-truth label of the original sample
    • Attack Computation
      • The algorithms to compute perturbations can be one-step and iterative:
        • One-step: In one step, uses the gradients of the model’s loss for the legitimate image to find the most prominent pixels in the legitimate image that will maximize the error when perturbed
        • Iterative: Uses more iterations to form and fine-tune the perturbations
          • Comes with more expensive computation
          • Perturbations are usually smaller and have greater success in fooling classifiers
    • Attack Approach
      • An adversarial attack can be based on gradient, transferability/score, decision, and approximation
        • Gradient: Makes use of detailed information of the target model regarding its gradient for a given input
        • Transfer/Score: Depend on obtaining dataset access or the scores predicted by the model to approximate a gradient. The scores and training data are used to fit a substitute model and create perturbations for real images
        • Decision: Queries the softmax layer and iteratively computes smaller perturbations using a process of rejection sampling
        • Approximation: Uses a differentiable function that approximates the outputs from a random layer of a model to feed gradient-based attacks
  • Algorithms for Generating Adversarial Images
    • Computer vision algorithms to generate adversarial perturbations are optimization techniques that explore and expose flaws in pretrained models
    • Fast Gradient Sign Method (FGSM)
      • One-step algorithm that explains why adversarial samples can exist
      • Main pro is the low computational cost through perturbing in one step that maximizes the model error
        • This comes with more perturbations and less success in fooling models than iterative algorithms
    • Basic Iterative Method (BIM)
      • Iterative version of FGSM
        • BIM executes several minor steps 𝛼
        • Total size of the perturbation is limited by a bound defined by the attacker
        • BIM is a recursive method
    • DeepFool
      • Finds the nearest decision boundary of a legitimate image, then subtly perturb the image to cross the boundary
        • During each iteration, linearizes the classifier around an intermediate x’ to then continually update x’ toward the optimal direction by a small step until the decision boundary is crossed

Defense Against Adversarial Attacks

  • Defense Objective
    • The main objective of a defense can be proactive or reactive
      • Proactive: Defenses aim to correctly classify an adversarial image as if it were legit (robustness)
      • Reactive: Acts as a filter that detects malicious images before reaching the classifier, then discarding them or sending them to a recovery procedure
  • Defense Approach
    • There are various approaches when forming defenses against adversarial images
    • The most relevant proactive and reactive countermeasures are categorized: gradient masking, auxiliary detection models, statistical methods, preprocessing techniques, ensemble of classifiers, and proximity measurements
    • Gradient Masking:
      • Produces models that have smoother gradients, preventing the generation of useful gradients for adversarial samples
        • Gradient masking-based attacks can be organized in shattered gradients, stochastic gradients, and exploding/vanishing gradients
          • Shattered gradients: caused by non-differentiable defenses, resulting in nonexistent or incorrect gradients
          • Stochastic gradients: caused by randomized proactive/reactive defenses or randomized preprocessing on input prior to feeding into the classifier
          • Exploding/vanishing gradients: caused by defenses with really deep architectures, requiring multiple iterations of neural network evaluation
        • Two main strategies of gradient masking are adversarial training and defensive distillation
      • Adversarial training
        • Considered a brute force approach
        • Increase a classifier’s robustness by training on legitimate and adversarial images
          • Use an attack algorithm to create perturbed images from legitimate training images, then augment the training set with the perturbed images and retrain
        • Adversarial training weaknesses
          • The strong coupling of adversarial training with the attack algorithm during training
          • Adversarial training is computationally inefficient
        • Potential solution to the weaknesses
          • Training on generated adversarial samples using Projected Gradient Descent (PGD) attack
      • Defensive Distillation:
        • A proactive defense based on transfer of knowledge among learning models known as distillation
          • Learning distillation: the knowledge gained by a complex model after training is transferred to a smaller model
          • Defensive distillation: uses the knowledge of a model (probabilistic vectors from a first training) to perform a second training of the original model
      • Auxiliary Detection Models (ADM)
        • Gradient masking-based defenses produce models with smoother gradients, making it difficult for attackers to find optimal directions to perturb images
          • Attackers can train a surrogate/substitute model (black box) and transfer that knowledge
        • ADMs
          • A reactive method that uses adversarial training to create an auxiliary binary model/filter that detects whether an input image is legitimate or adversarial before it reaches the classifier
      • Statistical Methods
        • Compare distributions of legitimate and adversarial images
        • A reactive defense that approximates the hypothesis test Maximum Mean Discrepancy (MMD) with the Fisher’s permutation test
          • Verifies if a legitimate dataset belongs to the distribution of another dataset that potentially contains adversarial images using MMD
          • Elements of each dataset are permutated into two new datasets using Fisher’s test, and the new datasets are again compared via MMD
          • If the first comparison is not the same as the second, the null hypothesis is rejected , concluding that the datasets belong to different distributions
          • The p-value is the fraction of the number of times the null hypothesis was rejected
        • Kernel Density Estimation (KDE)
          • Verifies distribution similarity using Gaussian Mixture Models to analyze outputs of a DNN’s logits layer
      • Preprocessing Techniques:
        • Some works crafted defense based on preprocessing techniques, such as image transformations, noise layers, autoencoders, and dimensionality reduction
          • Random Resizing and Padding (RRP) adds a resizing and padding layer at the beginning of a DNN, where the resizing layer modifies the input image’s dimensions, and the padding layer randomly inserts null values on the surroundings of the resized image
          • Applying transformations like total variance minimization (TVM) and image quilting
            • TVM randomly selects pixels from an input image and iteratively optimizes to find an image consistent with the selected pixels
            • Image quilting reconstructs an image with small patches from a training database using KNN, ultimately crafting an image without any adversarial perturbations
            • TVM and image quilting have the best performances
          • Feature squeezing, a reactive defense using two dimensionality techniques: color bit depth reduction and spatial smoothing (or blur)
      • Ensemble of classifiers:
        • Defenses based on two or more classifiers that can be chosen at runtime
        • Each model can compensate for the weaknesses in other models
        • Different ensemble methods can be used
          • Bayesian algorithm to choose an optimal model
          • Ensembles of specialist models that detect and classify input images by majority vote
          • Using adversarial training to train the main classifier with adversarial images generated by a separate ensemble
      • Proximity measurements:
        • Defenses based on proximity measurements among legitimate and adversarial images to the decision boundary

Explanations for the Existence of Adversarial Samples

  • High Non-linearity hypothesis:
    • Adversarial examples exist due to high non-linearity in DNNs, contributing formation of low probability pockets in the data
    • Pockets are a result of deficiencies of objective functions, training procedures, and datasets limited in size and diversity of training samples
  • Linearity hypothesis:
    • Contradiction to non-linearity hypothesis
    • DNNs have a very linear behavior due to activation functions like ReLU and sigmoid
    • Classifier robustness is independent from the training procedure
      • Distance between classes in high-order classifiers is larger than linear ones
  • Boundary Tilting hypothesis:
    • A learned class boundary lies close to the training samples manifold, but is tilted
      • Adversarial samples can stem from modifying legitimate samples toward the boundary until it is crossed
      • Perturbation amount decreases as tilting degree decreases, hence high confidence and fooling classifiers while being imperceptible to humans
      • Could be a result of an overfitted model
  • High Dimension Manifold:
    • Adversarial examples stem from the data’s high dimensions
    • An experiment created a faked dataset to train a model
    • The model’s classifications were close to nearby misclassified adversarial inputs
      • Learning models are vulnerable to adversarial samples, regardless of training procedure
  • Lack of Enough Training Data:
    • Models must generalize strongly with the help of robust optimizations
    • The existence of adversarial examples is not the classifier’s fault, but is an inevitable consequence of working in a statistical setting
  • Non-robust Feature Hypothesis:
    • Adversarial perturbations stem from images’ features, not flaws in models or training processes
    • Two categories of features: robust and non-robust
      • Robust features lead models to correctly predict an input’s true class even when perturbed
      • Non-robust features are features derived from patterns in the data distribution that are highly predictive, yet brittle, remaining undetectable by humans and being easier to perturb
  • Explanations for Adversarial Transferability:
    • Adversarial transferability occurs when adversarial samples have to fool a target model and other models even with differing architectures
    • Two categories of transferability: intra-technique and cross-technique
    • The direction of perturbations may be a crucial factor in transferability since the disturbances acquire similar functions through training

Principles for Designing and Evaluating Defense

  • Define a detailed threat model that restricts the attacker’s capabilities
  • Simulate adaptive adversaries, considering every attack scenario and setting
  • Develop provable lower bounds of robustness, ensuring the performance of the evaluated defense will never fall below that level
  • Perform basic sanity tests, helping to identify anomalies that lead defenses to incorrect conclusions
  • Publicize source code, allowing the community to review for correctness

My Thoughts

This paper was very interesting and provided great insight into adversarial machine learning through surveying various attack and defense methods, while expanding on them. At the end of the day, attackers are just trying to bypass and fool the model, which in this case serves the purpose of image classification. Image classification models are not invincible, and can suffer from vulnerabilities even without exposing the inner-workings of the model through black-box attacks. As discussed in the paper, it is extremely essential to consider all edge cases when developing a model and defense.


Discussion Summary (What I remember from the discussion)

  • Why is this paper incorporated into the class if it surveys adversarial ML for image classification?
    • These attack and defense approaches can be directly translated to the malware detection space
    • Binary information can be transformed into representations that are similar to images
  • Who are the paper’s authors?
    • Members of the Brazilian Military
    • The authors have direct experience with the information they presented
  • The authors use the terms ‘adversarial attack’ and ‘adversarial examples’ a lot, what is the difference?
    • There is no established difference yet across the community

My Thoughts

In general, the models used in malware detectors are no different than those used for other machine learning tasks. Rather, they just operate on a different set of data and features, while still striving to classify successfully. For instance, DNNs can be used in both natural language processing and malware detection. Thus, the underlying causes of adversarial malware, such as the high non-linearity hypothesis, equally affect both domains. My takeaway is that using approaches from other domains in ML/AI can be beneficial in the ML for cyber defenses domain. Additionally, it is eye-opening to see that the authors are members of the Brazilian military and have actually experienced these forms of attacks. This demonstrates the practicality in their survey, and the push to develop widely established solutions.


That is all, thanks for reading!


<
Previous Post
EvadeDroid: A Practical Evasion Attack on Machine Learning for Black-box Android Malware Detection (Seminar 8.1)
>
Next Post
Pop Quiz! Can a Large Language Model Help With Reverse Engineering? (Seminar 9.1)