G-SOMO : an oversampling approach based on self-organized map oversampling and geometric SMOTE

Rauch, Rene

http://hdl.handle.net/10362/63811

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
TAA0033.pdf		2.06 MB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

Rauch, Rene

Orientador(es)

Bação, Fernando José Ferreira Lucas

Resumo(s)

Traditional supervised machine learning classifiers are challenged to learn highly skewed data distributions as they are designed to expect classes to equally contribute to the minimization of the classifiers cost function. Moreover, the classifiers design expects equal misclassification costs, causing a bias for underrepresented classes. Thus, different strategies to handle the issue are proposed by researchers. The modification of the data set managed to establish since the procedure is generalizable to all classifiers. Various algorithms to rebalance the data distribution through the creation of synthetic instances were proposed in the past. In this paper, we propose a new oversampling algorithm named G-SOMO, a method that is inspired by our previous research. The algorithm identifies optimal areas to create artificial data instances in an informed manner and utilizes a geometric region during the data generation to increase variability and to avoid correlation. Our experimental setup compares the performance of G-SOMO with a benchmark of effective oversampling methods. The oversampling methods are repeatedly validated with multiple classifiers on 69 datasets. Different metrics are used to compare the retrieved insights. To aggregate the different performances over all datasets, a mean ranking is introduced. G-SOMO manages to consistently outperform competing oversampling methods. The statistical significance of our results is proven.

Descrição

Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics

Palavras-chave

Oversampling Imbalanced Learning Clustering Synthetic Data Generation

URI

http://hdl.handle.net/10362/63811

Coleções

NIMS - Dissertações de Mestrado em Ciência de Dados e Métodos Analíticos Avançados (Data Science and Advanced Analytics)

Licença CC

cclicense-by

Ver registo completo