DeepSign: Deep Learning for Automatic Malware Signature Generation and Classification (Seminar 12.2)
NOTE: This blog is for a special topics course at Texas A&M (ML for Cyber Defenses). During each lecture a student presents information from the assigned paper. This blog summarizes and further discusses each topic.
During this seminar, Richa Sharma presented DeepSign: Deep Learning for Automatic Malware Signature Generation and Classification. After their presentation our class had an open discussion related to the paper and more. This blog post will cover a summary of the information presented as well as a summary of our class discussion.
Presentation Summary
Introduction
- The traditional method of defending against malware involves manually analyzing and crafting signatures for antivirus updates, but this process is slow and often allows malware to remain undetected and continue spreading due to minimal code modifications by its developers
- Several attempts at automatic malware signature generation have been made, targeting specific vulnerabilities or malware features; however, these methods often fail as malware authors can slightly alter their software to bypass these signatures
- This paper introduces a new method using deep belief networks (DBNs) to generate robust, invariant malware signatures that effectively detect new variants without relying on specific malware attributes, demonstrating a high classification accuracy of 98.6% in tests
Related Works
- Conventional methods for malware signature generation are often ineffective against zero-day malware, as they generally rely on detecting known behaviors or traffic patterns, which can be easily modified by new malware variants
- Various approaches, such as Autograph, Honeycomb, and PAYL, attempt to enhance signature detection by analyzing network traffic and generating signatures based on traffic patterns, frequent byte sequences, or anomaly detection in network flows
- Many anti-virus programs focus on analyzing executable files to identify malware, but new techniques like Auto-Sign improve resilience by generating multiple signatures from executable segments; however, malware can bypass this by encrypting executables or slightly altering code
Proposed Signature Generation Method
- Program behavior as binary vector
- Sandboxes are specialized environments used to log the behavior of programs, such as malware, capturing details like API calls, file activities, and network accesses, with logs typically saved as text files
- These sandbox logs are often manually analyzed to help develop malware signatures, using techniques from natural language processing like unigram extraction to convert logs into fixed size strings
- The proposed method simplifies the process by treating sandbox logs as plain text and extracting unigrams directly, including markup and tags, to form a dictionary of frequently occurring terms
- This dictionary is used to transform each sandbox log into a binary vector based on the presence of these unigrams, with irrelevant data expected to be filtered out by the learning system during analysis.
- Training a deep belief network
- The method involves converting the behavior of computer programs into a binary vector using deep denoising autoencoders, aimed at creating a signature resilient to small code changes in malware
- Autoencoders are trained to recreate their input at the output layer, using a smaller number of neurons in the hidden layer to learn a higher level representation, with denoising autoencoders introducing noise to improve generalization and reduce overfitting
- The training involves constructing a deep belief network through layer-wise training of multiple autoencoders, where each layer’s weights are “frozen” before training the next, resulting in a hierarchical structure that transforms large input vectors into compact high-level outputs
- The final output of this deep network serves as an invariant “signature” of a program, capable of converting a 20,000 input vector into a 30-sized output vector that effectively represents the program, particularly for detecting malware variants.
Implementation and Experimental Results
- Malware dataset and sandbox
- Dataset Composition: The dataset comprises 1,800 malware samples, split into six categories (Zeus, Carberp, SpyEye, Cidox, Andromeda, and DarkComet) with 300 variants in each category, provided by C4 Security
- Malware Impact: These malware categories have caused significant global damage, with hundreds of variants developed to evade detection by antivirus programs, remaining undetected until specifically identified and analyzed
- Criminal Use and Enforcement: The malware has been used for various criminal activities, infecting millions of computers worldwide. Despite over a hundred arrests by global law enforcement agencies, these malware variants are still actively used
- Technical Details: Variants often incorporate code from other malware, especially Zeus, complicating classification. Each sample in the dataset is analyzed in a Cuckoo sandbox, and the outputs are converted into 20,000-sized bit-strings for machine learning purposes
- Research Setup: The dataset is split into 1,200 training samples and 600 test samples, with 200 and 100 samples from each category respectively, to facilitate the development of a learning module that can predict the correct category of malware based on sandbox analysis outputs
- Training the DBN
- The deep denoising autoencoder is trained with eight layers using dropout for regularization, where during training, approximately half of the hidden units are randomly omitted to prevent reliance on other units and allow efficient model averaging
- Instead of traditional activation functions like logistic or tanh, rectified linear units (ReLU) are used to enhance training speed and alleviate the problem of gradient vanishing in deep networks
- The network setup includes training on a GPU for efficiency, with parameters like a noise ratio of 0.2, 1000 training epochs per layer, and a gradually decreasing learning rate, culminating in the generation of a 30-sized vector as a program’s signature
- Experimental results
- DeepSign processes 1,800 malware vectors into 30-sized signatures visualized using t-SNE, showing clustering by malware family, indicating effective capture of invariant malware characteristics
- Supervised classification on these signatures using an SVM classifier achieves 96.4% accuracy, while a k-nearest neighbor approach yields 95.3% accuracy, demonstrating the effectiveness of the signatures in malware detection
- Further improvement is explored by using the weights from the DBN as initial weights for a deep supervised neural network, resulting in a classification accuracy of 98.6% on test data
- Errors in clustering and distinctions in malware types are acknowledged, attributed to shared code among malware classes and the general challenges in distinguishing them, even with advanced antivirus tools
Concluding Remarks
- The paper reviews past methods of generating malware signatures and introduces a new approach using deep belief networks (DBNs), addressing the issue of malware variants evading detection by modifying code
- The novel method involves running malware in a sandbox, converting the log to a binary bit-string, and processing it through an 8-layer deep neural network to produce a 30-value output used as the malware signature
- Experimental results demonstrate the effectiveness of this unsupervised deep learning approach in generating invariant signatures for malware detection, suggesting its applicability to challenging domains beyond traditional areas like computer vision and speech recognition