Theses and Dissertations

Date of Award

8-1-2024

Document Type

Thesis

Degree Name

Master of Science (MS)

Department

Mathematics

First Advisor

Hansapani Rodrigo

Second Advisor

Sanjeev Kumar

Third Advisor

Zhuanzhuan Ma

Abstract

Class imbalance is a common issue in various real-world machine learning applications such as medical diagnosis and fraud detection, where one class significantly predominates over the other(s). Conventional methods often lead to biased models that favor the majority class, which can negatively impact the performance of the minority class. To address this issue, techniques like Synthetic Minority Oversampling Technique (SMOTE), Adaptive Synthetic Sampling (ADASYN), and NearMiss have been employed to adjust the class distribution. However, these techniques may not effectively capture the nuances when the feature space is noisy, irrelevant, or high-dimensional. This thesis presents a novel approach that combines class rebalancing techniques with feature elimination strategies and then applies these techniques to each feature by passing them through a Random Forest (RF) and Artificial Neural Network (ANN) for in-depth analysis. The focus of this study is on feature elimination methods such as Chi-Square, Information Gain, Logistic Regression, Recursive Feature Elimination (RFE), LASSO, and Decision Tree-based importance to identify and discard non-informative features, thus streamlining the models and potentially reducing overfitting. The distinctive feature of this study is the use of a dual-modeling approach that combines the strengths of both random forests (RF) and artificial neural networks (ANN) to analyze feature importance rankings and complex pattern recognition abilities in the context of imbalanced datasets. By passing each selected feature through both models, we provide a deeper understanding of feature behavior and model performance. The study utilizes four datasets—Heart Disease, Fraud Detection, Breast Cancer and IT Customer Churn—each presenting its own unique challenges and class imbalance scenarios for a comprehensive evaluation of the proposed methods. Moreover, a thorough benchmarking analysis was conducted comparing the performance of conventional classifiers on the original imbalanced datasets with those using our integrated approach of class rebalancing and feature elimination. This comparative assessment not only demonstrates the effectiveness of our method in various class imbalance scenarios but also evaluates the impact of each class rebalancing technique when combined with advanced predictive modeling. This study presents an integrated solution that addresses class imbalance through established resampling techniques and enhances predictive modeling using a unique feature elimination and dual-modeling approach. The findings of this study provide valuable insights and practical guidance for practitioners dealing with imbalanced datasets, aiming to improve model accuracy, interpretability, and generalization in real-world applications.

Comments

Copyright 2024 Martha Asare. https://proquest.com/docview/3116456316

Share

COinS