Theses and Dissertations
Date of Award
8-1-2024
Document Type
Thesis
Degree Name
Master of Science (MS)
Department
Mathematics
First Advisor
Hansapani Rodrigo
Second Advisor
Sanjeev Kumar
Third Advisor
Zhuanzhuan Ma
Abstract
Class imbalance is a common issue in various real-world machine learning applications such as medical diagnosis and fraud detection, where one class significantly predominates over the other(s). Conventional methods often lead to biased models that favor the majority class, which can negatively impact the performance of the minority class. To address this issue, techniques like Synthetic Minority Oversampling Technique (SMOTE), Adaptive Synthetic Sampling (ADASYN), and NearMiss have been employed to adjust the class distribution. However, these techniques may not effectively capture the nuances when the feature space is noisy, irrelevant, or high-dimensional. This thesis presents a novel approach that combines class rebalancing techniques with feature elimination strategies and then applies these techniques to each feature by passing them through a Random Forest (RF) and Artificial Neural Network (ANN) for in-depth analysis. The focus of this study is on feature elimination methods such as Chi-Square, Information Gain, Logistic Regression, Recursive Feature Elimination (RFE), LASSO, and Decision Tree-based importance to identify and discard non-informative features, thus streamlining the models and potentially reducing overfitting. The distinctive feature of this study is the use of a dual-modeling approach that combines the strengths of both random forests (RF) and artificial neural networks (ANN) to analyze feature importance rankings and complex pattern recognition abilities in the context of imbalanced datasets. By passing each selected feature through both models, we provide a deeper understanding of feature behavior and model performance. The study utilizes four datasets—Heart Disease, Fraud Detection, Breast Cancer and IT Customer Churn—each presenting its own unique challenges and class imbalance scenarios for a comprehensive evaluation of the proposed methods. Moreover, a thorough benchmarking analysis was conducted comparing the performance of conventional classifiers on the original imbalanced datasets with those using our integrated approach of class rebalancing and feature elimination. This comparative assessment not only demonstrates the effectiveness of our method in various class imbalance scenarios but also evaluates the impact of each class rebalancing technique when combined with advanced predictive modeling. This study presents an integrated solution that addresses class imbalance through established resampling techniques and enhances predictive modeling using a unique feature elimination and dual-modeling approach. The findings of this study provide valuable insights and practical guidance for practitioners dealing with imbalanced datasets, aiming to improve model accuracy, interpretability, and generalization in real-world applications.
Recommended Citation
Asare, Martha, "Evaluating Feature Selection Methods in Machine Learning With Class Imbalance" (2024). Theses and Dissertations. 1590.
https://scholarworks.utrgv.edu/etd/1590
Supplemental Material
Comments
Copyright 2024 Martha Asare. https://proquest.com/docview/3116456316