Date of Award

8-1-2024

Document Type

Thesis

Degree Name

Master of Science (MS)

Department

Mathematics

First Advisor

Hansapani Rodrigo

Second Advisor

Sanjeev Kumar

Third Advisor

Zhuanzhuan Ma

Abstract

Class imbalance is a common issue in various real-world machine learning applications such as medical diagnosis and fraud detection, where one class significantly predominates over the other(s). Conventional methods often lead to biased models that favor the majority class, which can negatively impact the performance of the minority class. To address this issue, techniques like Synthetic Minority Oversampling Technique (SMOTE), Adaptive Synthetic Sampling (ADASYN), and NearMiss have been employed to adjust the class distribution. However, these techniques may not effectively capture the nuances when the feature space is noisy, irrelevant, or high-dimensional. This thesis presents a novel approach that combines class rebalancing techniques with feature elimination strategies and then applies these techniques to each feature by passing them through a Random Forest (RF) and Artificial Neural Network (ANN) for in-depth analysis. The focus of this study is on feature elimination methods such as Chi-Square, Information Gain, Logistic Regression, Recursive Feature Elimination (RFE), LASSO, and Decision Tree-based importance to identify and discard non-informative features, thus streamlining the models and potentially reducing overfitting. The distinctive feature of this study is the use of a dual-modeling approach that combines the strengths of both random forests (RF) and artificial neural networks (ANN) to analyze feature importance rankings and complex pattern recognition abilities in the context of imbalanced datasets. By passing each selected feature through both models, we provide a deeper understanding of feature behavior and model performance. The study utilizes four datasets—Heart Disease, Fraud Detection, Breast Cancer and IT Customer Churn—each presenting its own unique challenges and class imbalance scenarios for a comprehensive evaluation of the proposed methods. Moreover, a thorough benchmarking analysis was conducted comparing the performance of conventional classifiers on the original imbalanced datasets with those using our integrated approach of class rebalancing and feature elimination. This comparative assessment not only demonstrates the effectiveness of our method in various class imbalance scenarios but also evaluates the impact of each class rebalancing technique when combined with advanced predictive modeling. This study presents an integrated solution that addresses class imbalance through established resampling techniques and enhances predictive modeling using a unique feature elimination and dual-modeling approach. The findings of this study provide valuable insights and practical guidance for practitioners dealing with imbalanced datasets, aiming to improve model accuracy, interpretability, and generalization in real-world applications.

Comments

Supplemental Material Asare iterations_Updated_06_14_2024.xlsx (261 kB)
Supplemental Material

Recommended Citation

Asare, M. (2024). Evaluating Feature Selection Methods in Machine Learning With Class Imbalance [Master's thesis, The University of Texas Rio Grande Valley]. ScholarWorks @ UTRGV. https://scholarworks.utrgv.edu/etd/1590

Download

COinS

Theses and Dissertations

Evaluating Feature Selection Methods in Machine Learning With Class Imbalance

Date of Award

Document Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Abstract

Comments

Recommended Citation

Browse

Search

Author Corner

Links

Theses and Dissertations

Evaluating Feature Selection Methods in Machine Learning With Class Imbalance

Author

Date of Award

Document Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Abstract

Comments

Recommended Citation

Share

Browse

Search

Author Corner

Links