| Nome: | Descrição: | Tamanho: | Formato: | |
|---|---|---|---|---|
| 2.06 MB | Adobe PDF |
Autores
Orientador(es)
Resumo(s)
Traditional supervised machine learning classifiers are challenged to learn highly skewed
data distributions as they are designed to expect classes to equally contribute to the
minimization of the classifiers cost function. Moreover, the classifiers design expects equal
misclassification costs, causing a bias for underrepresented classes. Thus, different strategies
to handle the issue are proposed by researchers. The modification of the data set managed
to establish since the procedure is generalizable to all classifiers.
Various algorithms to rebalance the data distribution through the creation of synthetic
instances were proposed in the past. In this paper, we propose a new oversampling
algorithm named G-SOMO, a method that is inspired by our previous research. The
algorithm identifies optimal areas to create artificial data instances in an informed manner
and utilizes a geometric region during the data generation to increase variability and to
avoid correlation.
Our experimental setup compares the performance of G-SOMO with a benchmark of
effective oversampling methods. The oversampling methods are repeatedly validated with
multiple classifiers on 69 datasets. Different metrics are used to compare the retrieved
insights. To aggregate the different performances over all datasets, a mean ranking is
introduced.
G-SOMO manages to consistently outperform competing oversampling methods. The
statistical significance of our results is proven.
Descrição
Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics
Palavras-chave
Oversampling Imbalanced Learning Clustering Synthetic Data Generation
