Examining Zero-Shot Vulnerability Repair with Large Language Models (Seminar 10.1)

NOTE: This blog is for a special topics course at Texas A&M (ML for Cyber Defenses). During each lecture a student presents information from the assigned paper. This blog summarizes and further discusses each topic.

During this seminar, Nhat Nguyen presented Examining Zero-Shot Vulnerability Repair with Large Language Models. After their presentation our class had an open discussion related to the paper and more. This blog post will cover a summary of the information presented as well as a summary of our class discussion.

Presentation Summary

Introduction

LLMs show promise in assisting with coding tasks, including translating languages and explaining code, leveraging vast training datasets that include secure and insecure code snippets
The paper explores the potential of “off-the-shelf” LLMs to generate security patches without additional training, focusing on fixing security vulnerabilities, influenced by context in prompts and aiming for zero-shot generation
Results shows that while LLMs can produce security fixes in simple scenarios, their effectiveness in real-world applications is limited

Background and Motivation

Security bugs
- Developers often create insecure code which lead to security bugs that are hard to detect and and are cataloged in databases like MITRE’s CWE
- Tools like OWASP’s static analyzers and run-time sanitizers help identify and understand these bugs by instrumenting code to catch errors early
- While currently patches are mostly developed manually, there is ongoing research into using LLMs for automatic program repair to improve software security efficiency
Prompting LLMs
- LLMs predict sequences of tokens in response to prompts, similar to advanced autocomplete, to complete tasks like function body completion
- They use byte pair encoding (BPE) for efficient text processing, with tokens representing common character sequences and allowing customization of vocabulary size
- Outputs are tailored through parameters like temperature and top p, generating realistic continuations up to a chosen length or until a logical stopping point is reached
Studied off-the-shelf LLMs
- The paper evaluates the performance of several LLMs for code replacement in program repair, including OpenAI’s Codex, AI21’s Jurassic-1, and the ‘polycoder’ model
- LLMs studied are black box mainly due to their complexity and commercial value
- GitHub Copilot is excluded from the study due to access restrictions and its reliance on Codex, suggesting overlapping performance metrics with directly evaluated Codex
Design of ‘gpt2-csrc’
- Utilizing a local LLM allows unlimited sample generation without API constraints and full control over the model for detailed experimentation
- The gpt2-csrc LLM was trained with a unique dataset of C/C++ code from popular Debian packages, using a specialized BPE tokenizer for efficient code representation
- Training involved a GPT2-774M model on NVIDIA hardware over four GPU-months, employing specific techniques for optimized learning

From Prompts to Prompt Engineering

LLMs can perform multiple tasks in a zero-shot setting by carefully constructing prompts, despite predictions being based on probabilities, leading to possible alignment failures
Prompt engineering is emerging as a field, with studies showing model outputs, like GitHub’s Copilot, are sensitive to the composition of prompts, which includes code and comments
Effective prompting involves selecting relevant parts of the source file, such as comments, imports, existing code, and the target completion area, to guide the model’s recommendations
The entire source file can serve as a prompt if within the model’s token limit; otherwise, selection is crucial for specific recommendations, suggesting modifications like adding comments or code snippets
Recommendations for prompts include starting with comments, data, or code, emphasizing the importance of style, verbosity, and naming conventions in coding practices, with comments serving multiple purposes like marking bugs or fixes

Synthetic Experimentation

Overview
- Generated a large synthetic collection of buggy programs and tested repair capabilities of Codex LLMs with various parameters, later expanding the study to other LLMs and domains
- Conducted experiments in two stages: first, identifying optimal parameters for LLM code repair, and second, examining the effect of different prompt structures on the quality of code fixes provided by LLMs
- Utilized a desktop PC setup for the experiments and adopted CodeQL for static bug analysis, with all code and data made openly available
Model parameter sweep: vulnerable program generation
- Parameter impact: studied how “temperature” and “top p” parameters affect the generation of vulnerable code by LLMs, focusing on consistent bug types and prompt texts
- CWE Selection: focused on two high-impact CWEs: CWE-787 (Out of bounds Write) and CWE-89 (SQL Injection), chosen for their severity and direct detectability from code
- Synthetic generation: used OpenAI Codex to generate vulnerable programs in C and Python for CWE-787 and CWE-89, respectively, by completing provided program beginnings and assessing them for vulnerabilities
- Testing and evaluation: generated programs were compiled and evaluated for both functionality and security flaws using unit tests and CodeQL, yielding a dataset of unique, functional but vulnerable programs
- Findings: generated 95 unique programs with CWE-787 and 22 with CWE-89, noting that higher temperatures resulted in fewer CWE-787 vulnerabilities but more CWE-89 vulnerabilities, reflecting Codex’s training data biases
Model parameter sweep: vulnerable program repair
- Used 95 programs from CWE-787 and 22 from CWE-89, adding faulty code identified by CodeQL to create repair scenarios, with prompts including CodeQL comments
- Conducted experiments to find optimal generation parameters, resulting in 47,500 and 11,000 repair scenarios for CWE-787 and CWE-89, respectively
- Noted that original generation without bugs performed better, suggesting further exploration of repair prompt contents is needed
- Determined that no single temperature/top p setting works for all scenarios, opting for varied temperature settings with top p fixed for future experiments
Prompt engineering and hand-crafted vulnerable code
- Expanded experiment scope by increasing prompt variety, exploring complex scenarios, and comparing multiple LLM
- Designed scenarios around MITRE’s “Top 25” security weaknesses, focusing on high-impact and concrete bugs from Python web development and C programming
- Developed various repair prompt templates with differing contexts to test LLM’s ability to fix bugs, inspired by user guides and GitHub searches
- Results showed variable LLM performance across different prompts and scenarios, but each scenario was successfully repaired by at least one LLM combination
- High-context prompts generally yielded better results for complex bug-fixes, suggesting more detailed prompts improve LLM’s code generation for security and functionality
- OpenAI Codex models performed best at generating successful patches, highlighting the importance of broad training data and detailed prompt design
Repairing hardware CWEs
- Studied LLMs on code completion for Verilog, evaluating ability to fix hardware CWEs
- Designed experiments with two Hardware CWEs, focusing on simplicity for LLM understanding
- Found LLMs less adept at Verilog than C/Python, implemented post-processing for improvement
- Observed better LLM performance with less context, suggesting models favor simpler, correct solutions

Experiments with Real-World CVEs

Overview
- Investigating real-world CVEs in public projects to assess LLMs’ code repair abilities, highlighting larger, more realistic scenarios
- Challenges include the inability to provide full context due to token limits, necessitating a pre-generation reduction step
ExtractFix dataset
- Selected 12 vulnerabilities from the ExtractFix dataset for analysis, based on criteria like proof-of-concept availability, localized patches, and comprehensive test suites
- Prepared for repair by identifying patches, building projects with appropriate sanitizers, and running regression tests to ensure fixes didn’t break functionality
- Used developer-provided patches to localize vulnerabilities, highlighting the focus on using language models for zero-shot bug fixing in the context of security vulnerabilities
Code reduction and suggestion consolidation
- Token limits of models constrain the code that can be input and generated, affecting complex real-world scenarios
- To fit within limits, the process includes summarizing code while preserving context, based on developer knowledge guidelines
- After generation, the model’s response is integrated with existing code, ensuring overlap for continuity or inserting the fix directly if no overlap exists
- New templates were introduced for comment styles and error message handling, improving the bug-patching process
Findings
- An LLM repaired 8 out of 12 projects, defining repaired as the program passing functional tests and not crashing with specific input
- Compared to the state-of-the-art ExtractFix tool, which repaired 10 out of 12, some LLM fixes were implausible, fixing bugs but introducing others
- Despite not being trained for repair and working in a zero-shot setting, the LLM’s performance is remarkable, even fixing a scenario ExtractFix could not

Discussion on LLMs’ Reliability

LLMs successfully repaired 100% of synthetic and hand-crafted programs for specific security issues, generating thousands of potential patches across various test scenarios
In real-world projects with historical CVEs, LLMs achieved a lower success rate, fully repairing 8 out of 12 projects with nearly a thousand effective patches out of over 19,000 attempts
Despite the promising results in controlled scenarios, the reliability of LLMs for automatic program repair in complex, real-world contexts remains uncertain, highlighting the need for further development and evaluation

Discussion Summary

Main problem in this paper is to not only find the bugs, but repair the bugs
LLMs know how to code, therefore people expect them to fix code
- Not always true
- However, the assumption is if the LLM knows how to code, then it knows when its code is wrong
- The assumption is again not always true, but as demonstrated in this paper, the LLM was successful 7/12 times
The use of LLMs is not perfect but has extremely high potential
Prior to LLMs how were bugs found?
- Code analysis
- Fuzzing (generate random inputs to trigger certain paths that makes the model or program crash)
- Fuzzers virtually have not context of the model or program
- LLMs are much more accessible
- Regression testing: piece of a pipeline that ensures that there are no new bugs or erros when a new feature is pushed/updated
- We should also perform security regression testing (something to be developed for software engineering)
This past weekend (3/30-4/1) had news of a very impressive backdoor attack in XZ (a compression tool)
- Bug was there for two years, found by comprimising a legitimate account and discovering it
- Discovered by accident because someone was benchmarking compression algorithms
- This linux XZ would have been deployed for everyone using SSH
How efficient should LLMs be?
- They don’t need to be perfect as long as they are better than humans in the future
Bugs that can be found in HDL
- Logic description (of the functionality)
- Race conditions
- Deadlocks
- Glitches in hardware (logic bits shifting), sometimes due to silicon not being perfect
- LLMs can’t detect those problems, thus other verification tools needed
Pipeline of solutions is the way to go

That is all, thanks for reading!

TrojanPuzzle: Covertly Poisoning Code-Suggestion Models (Seminar 9.2)

Lost at C: A User Study on the Security Implications of Large Language Model Code Assistants (Seminar 10.2)