GSoC Project Proposal: GlitchWitcher: AI-assisted Bug Prediction
GlitchWitcher: AI-assisted Bug Prediction
Description
Defects exist in source code. Some defects are easily found during code reviews, some are revealed by proper unit or integration testing. Before code is reviewed and the code is executed during testing, there are other ways to detect where bugs may be lurking.
The intent of this project is to trial and compare 2 approaches to predicting where defects live in source code. There are a lot of different research papers that exist that discuss different algorithms that assist in finding defects. This project will focus on implementation of 2 approaches.
Approach 1: Predicting Faults from Cached History
This first approach is a relatively simple, inexpensive technique for predicting where bugs live. It is outlined in this research paper [1]. We would revive an earlier prototype of BugTools [2], which is a utility that applies the BugCache/FixCache algorithm to selected Github repositories to return scores against the different files in the repositories, as per the algorithm outlined in the paper. Those with the highest hit rates are the most likely to contain defects, so are the more important to cover thoroughly with tests.
This phase of the project is not expected to take very long, it is really to warm up to the idea of analyzing source code and integrating a new verification ‘check’ into a workflow of a Github repository. This utility would report top 10 files that are most likely to contain defects.
Approach 2: Reconstruction Error Probability Distribution (REPD) model
The 2nd approach utilizes a supervised anomaly detection/classification model outlined in this research paper [3] to categorize defective and non-defective code. This approach is much more involved. Section 3 of the paper describes the model in use, while section 4 describes the methodology. They train against datasets from NASA ESDS Data Metrics project [4].
As part of this project, participants are asked to try and reproduce the REPD model described in this paper and apply it to both the data used by the researchers to see if similar results are found and then apply it to a separate C/C++ code base such as OpenJ9 [5] or OpenJDK [6] (or both).
Ideally, one of the outcomes of this project would be to compare the results found with Approach A versus Approach B when applied against the same codebase. A second outcome of this work would be to incorporate an interim verification check against a source code repository, perhaps on the cadence of every time a new tag is applied.
Reference Links
[1] https://web.cs.ucdavis.edu/~devanbu/teaching/289/Schedule_files/Kim-Predicting.pdf
[2] https://github.com/adoptium/aqa-test-tools/tree/master/BugPredict/BugTool
[3] https://www.sciencedirect.com/science/article/abs/pii/S0164121220301138
[4] https://www.earthdata.nasa.gov/about/data-metrics
[5] https://github.com/eclipse-openj9/openj9
[6] https://github.com/adoptium/jdk
Links to Eclipse Projects / Repositories
https://projects.eclipse.org/projects/adoptium.aqavit https://projects.eclipse.org/projects/adoptium.temurin https://projects.eclipse.org/projects/technology.openj9
https://github.com/adoptium/aqa-tests https://github.com/adoptium/aqa-test-tools https://github.com/eclipse-openj9/openj9 https://github.com/adoptium/jdk (mirror of upstream repository)
Expected outcomes
-
Trialing 2 different approaches (implemented as static analysis 'utilities’) to predict source code defects in a given source code base
-
A comparison of the 2 approaches (do they identify the same files in a code base as ‘most likely’ containing bugs)
-
An additional way to flag areas of code that need more scrutiny during code reviews and a greater emphasis during testing
-
A verification check (or workflow, a.k.a. GlitchWitcher) that runs these static analysis utilities against pull requests in a repository
Skills required/preferred
-
Languages & Frameworks: Python (for ML and automation), Git APIs, NLP libraries (e.g., SpaCy, BERT, GPT-based models). Awareness of different classifiers (Gaussian Naive Bayes, logistic regression, k-nearest-neighbors, decision tree, and Hybrid SMOTE-Ensemble) and statistical analysis will be helpful.
-
CI/CD Integration: GitHub Actions, Jenkins
-
Database & Storage: MongoDB (or PostgreSQL/MySQL) for storing historical build data and test results.
-
Deployment: integration with current development workflow and pipelines
Project size
350 hours
Possible mentors:
- Lan Xia lan_xia@ca.ibm.com
- Longyu Zhang longyu.zhang@ibm.com
- Shelley Lambert slambert@redhat.com
Rating
medium - hard