Primary supervisor
Sanoop MallisseryCan a Transformer understand a software patch and predict whether it truly fixes a vulnerability, introduces a new weakness, or leaves the system still exploitable?
This is much more specific than normal vulnerability detection.
Instead of asking:
“Is this code vulnerable?”
we ask:
“Did this security patch actually fix the problem safely?”
In real software projects, security patches are often rushed. A patch may:
- fix only part of the vulnerability,
- introduce a new bug,
- miss an edge case,
- break compatibility,
- silently create another weakness,
- or require a second/follow-up patch later.
This project uses Transformer models to study code changes, not just static code.
The model will read:
- old vulnerable code,
- new patched code,
- commit message,
- CVE/CWE description,
- security advisory,
- and possibly issue discussion.
Then it predicts whether the patch is likely to be:
- Complete fix
- Partial fix
- Risky fix
- Regression-prone fix
- Possibly still vulnerable
Aim/outline
The aim is to build a Transformer-based security patch analysis system that learns from real-world security patches and predicts patch quality/risk.
You will be working on:
- Collecting security patches from GitHub, CVE-linked commits, NVD references, or security advisories.
- Representing each patch as a code diff: deleted lines, added lines, surrounding context, and commit message.
- Fine-tuning a Transformer/code model such as CodeBERT, GraphCodeBERT, CodeT5, or a lightweight Transformer.
- Predicting whether a patch is clean, incomplete, risky, or likely to require follow-up fixes.
- Mapping patch behaviour to CWE categories where possible.
- Comparing Transformer results with simpler baselines such as TF-IDF, Random Forest, BiLSTM, or static-analysis signals.
- Optionally explaining which changed lines made the patch appear risky.
Most AI security tools try to find bugs. This project goes one step deeper:
It asks whether the fix itself can be trusted.
That is very relevant to real-world software engineering, open-source security, DevSecOps, and supply-chain security.
As a student you can say: “My project builds an AI model that reads security patches and predicts whether the patch is likely to be complete, risky, or still vulnerable.”
That sounds much more powerful than another generic vulnerability classifier.
URLs/references
You may use:
CVE-linked GitHub commits
NVD references
GitHub Security Advisories
Big-Vul patch pairs
CVEfixes dataset
Defects4J-style bug-fix pairs, if security mapping is possible
Linux/kernel or Apache security commits, if scoped carefully
Security patch and vulnerability-fix datasets
- CVEfixes Dataset: CVE-linked vulnerability-fixing commits from open-source projects. Useful for extracting vulnerable code, fixed code, commit metadata, CVE, and CWE information.
- MoreFixes Dataset: large-scale dataset of CVE fix commits, useful for collecting real-world security patches across many GitHub projects.
- MegaVul Dataset: large C/C++/Java vulnerability dataset with vulnerable functions and fix commits, useful for patch-pair and vulnerability-type analysis.
- CodeXGLUE Code Refinement Dataset: useful for learning bug-fix/code-repair patterns, although not purely security-specific.
- GitHub Security Advisories: useful for collecting security advisory text, affected packages, severity, and linked patch commits.
- National Vulnerability Database-NVD: useful for CVE descriptions, CVSS scores, CWE mappings, references, and affected products.
Transformer / Code model references
- CodeBERT: A Pre-Trained Model for Programming and Natural Languages: suitable for code-diff and commit-message representation.
- GraphCodeBERT: Pre-training Code Representations with Data Flow: useful if the project includes data-flow-aware patch understanding.
- CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code: useful for code understanding and possible patch explanation/repair generation.
- CodeXGLUE benchmark paper: useful as a reference for code intelligence tasks and evaluation design.
Required knowledge
Essential
Python programming, Basic machine learning and deep learning, Cybersecurity fundamentals, Secure coding concepts
Software vulnerabilities such as buffer overflow, SQL injection, command injection, path traversal, insecure API use, and improper input validation, Git/GitHub basics, Basic understanding of commits, patches, and code diffs, Evaluation metrics such as accuracy, precision, recall, F1-score, AUC, and confusion matrix
Useful
PyTorch or TensorFlow, Hugging Face Transformers, CodeBERT, GraphCodeBERT, or CodeT5, Pandas, NumPy, Scikit-learn, Basic software engineering and version control, C/C++, Java, or Python code-reading ability, CVE, CWE, CVSS, and GitHub Security Advisories, Static analysis tools such as Semgrep, CodeQL, or SonarQube
Nice to Have
Program slicing, Data-flow/control-flow analysis, Explainable AI for code models, Patch correctness analysis, Security regression testing, Automated program repair, DevSecOps pipeline knowledge, Docker and experiment tracking tools