Logo do repositório
 
A carregar...
Miniatura
Publicação

Imbalanced Learning: A comparative study of oversampling and undersampling techniques

Utilize este identificador para referenciar este registo.
Nome:Descrição:Tamanho:Formato: 
TCDMAA3129.pdf3.73 MBAdobe PDF Ver/Abrir

Resumo(s)

Imbalanced data distribution is a recurrent and challenging problem in classification models as most algorithms are designed to assume balanced data. This imbalance often results in poor predictive performance for the minority class, despite an acceptable overall accuracy. A common and easily implementable approach to address this issue is resampling, which can be categorized into oversampling, undersampling, and hybrid methods—a combination of both. However, the effectiveness of these techniques varies based on dataset characteristics such as imbalance ratio, class overlap, and dimensionality. This study evaluates 10 resampling techniques across 35 benchmark datasets from various domains. To mitigate classifier bias, the evaluation employs 4 different classifiers. Unlike many studies focusing on individual resampling types, this research concurrently examines all three categories of resampling methods. Furthermore, the study offers a detailed analysis of average scores and rankings, facilitating a deeper understanding of each technique's relative performance. It also provides specific guidelines for selecting appropriate resampling methods based on the characteristics of each dataset. These findings aim to improve the application of resampling methods, helping practitioners make informed decisions to enhance classification performance in the presence of imbalanced data.

Descrição

Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data Science

Palavras-chave

Imbalanced Learning Class Imbalance Oversampling Undersampling Resampling

Contexto Educativo

Citação

Projetos de investigação

Unidades organizacionais

Fascículo