Detailed Analyses and Efficient Identification of Malware Evidence in CLaMP Dataset based on Machine Learning Approaches

Dublin Core

Title

Detailed Analyses and Efficient Identification of Malware Evidence in CLaMP Dataset based on Machine Learning Approaches

Subject

Malware Identification; Machine Learning Algorithms; Feature Selection; Windows PE headers

Description

Malware is a malicious software that is used to launch attacks of different types in computer networks and cyber space. Several signature and machine learning-based approaches have been used for the identification of malware types in the past. However,signature-based detection approaches have been reported to have serious limitations which gave room for machine learning-based malware identification techniques to be more popular. Despite the promises of the ML methods in the identification of malware evidence, some of the ML approaches in literature have poor detection rates which can be as a result of the size and nature of the patterns in the datasets used. This study used a dataset named CLaMP for the training and testing of the malware classification models. Firstly, comprehensive exploratory analyses of the dataset were carried out with a view to understanding the data distributions in it better and make informative decisions on how to pre-process and apply it for malware identification. During the experimentations, two scenarios were established before feeding the data into the learning algorithms. Scenario 1 involves building malware identification model without data cleaning and feature selection while scenario 2 involves the cleaning of the data and selection of promising features for building the models.In scenario 2, Recursive Feature Elimination (RFE) technique was used for selecting the promising attributes which were used to build the two malware classification models. Naive Bayes (NB) and Logistic Regression (LR) algorithms were used for building the models. The hyper parameters of the two selected algorithms were varied and the models tested and validated severally before optimal performances were arrived at. The results of the models were compared based on the selected metrics, namely: accuracy, precision, recall, f1-score and Area Under the Curve (AUC). Experimental results showed that in the scenario 1, where the dataset was not pre-processed and all the attributes were used for the model building, poor results were obtained by both models in all metrics except in recall. However, NB-based malware identification model slightly performed better than LR in all the metrics. It was also discovered that both NB and LR-based malware identification models performed well in scenario 2 when the dataset was pre-processed and promising features were selected using RFE. This study concluded that the detailed exploratory analyses, data cleaning and feature subset selection methods helped in achieving promising results from the malware identification models

Creator

1M. O. Ayinla, 2A. M. Oyelakin, 3U. A. Adeniyi, 4K. O. Tajudeen, 5O. J. Olaleye,

Source

www.ijcit.com

Date

March 2025

Contributor

peri irawan

Format

pdf

Language

english

Type

text

Files

Citation

1M. O. Ayinla, 2A. M. Oyelakin, 3U. A. Adeniyi, 4K. O. Tajudeen, 5O. J. Olaleye, , “Detailed Analyses and Efficient Identification of Malware Evidence in CLaMP Dataset based on Machine Learning Approaches,” Repository Horizon University Indonesia, accessed June 6, 2025, https://repository.horizon.ac.id/items/show/9190.