An Investigation Towards Resampling Techniques and Classification Algorithms on CM1 NASA PROMISEDataset for Software Defect Prediction
Dublin Core
Title
An Investigation Towards Resampling Techniques and Classification Algorithms on CM1 NASA PROMISEDataset for Software Defect Prediction
Subject
software defect prediction; machine learning; classificationalgorithm; imbalanced data; resampling
Description
Software defect prediction is a practical approach to improvingthequality and efficiency of software testing processes. However,establishing robust and trustworthy models for software defect prediction is quite challenging due to the limitation of historical datasets thatmostdevelopersare capable of collecting.The inherently imbalanced nature of mostsoftware defectdatasets also posed another problem. Therefore, aninsightinto howtoproperlyconstruct software defect prediction models on a small,yet imbalanced,dataset isrequired. The objective of thisstudyis thereforeto provide the required insightbyway ofinvestigating and comparinga number ofresampling techniques,classification algorithms, and evaluation measurements (metrics) for buildingsoftware defect prediction models on CM1 NASA PROMISE data as therepresentation of asmall yet unbalanceddataset. This study is comparative descriptive research.It followsapositivist(quantitative)approach. Data were collected through observation towards experiments on four categoriesof resampling techniques (oversampling, under sampling, ensemble, and combine) combined withthree categoriesof machine learning classification algorithms (traditional, ensemble, and neural network) to predictdefective software moduleson CM1 NASA PROMISE dataset. Trainingprocesses were carried outtwice, each of which used the5-fold cross-validationand the70% training and 30% testing data splitting(holdout)method. Our result shows that the combinedandoversamplingtechniques provide apositive effect on the performance of the models.In the context of classification models, ensemble-based algorithms,which extend the decision treeclassification mechanism such as Random Forestand eXtreme Gradient Boosting, achieved sufficiently good performance for predicting defective software modules.Regardingthe evaluation measurements, thecombined and rank-based performance metrics yielded modest variance values, which isdeemed suitable forevaluatingtheperformance of the models inthis contex
Creator
Agung Fatwanto1*, Muh Nur Aslam2, Rebbecah Ndugi3, Muhammad Syafrudin
Source
https://jurnal.iaii.or.id/index.php/RESTI/article/view/5910/973
Publisher
Informatics Department, Facultyof Science and Technology, UIN Sunan Kalijaga, Yogyakarta, Indonesia
Date
14-10-2024
Contributor
FAJAR BAGUS W
Format
PDF
Language
ENGLISH
Type
TEXT
Files
Collection
Citation
Agung Fatwanto1*, Muh Nur Aslam2, Rebbecah Ndugi3, Muhammad Syafrudin, “An Investigation Towards Resampling Techniques and Classification Algorithms on CM1 NASA PROMISEDataset for Software Defect Prediction,” Repository Horizon University Indonesia, accessed January 26, 2026, https://repository.horizon.ac.id/items/show/10439.