Logo do repositório
 
A carregar...
Logótipo do projeto
Projeto de investigação

Sem título

Autores

Publicações

Improving Active Learning Performance through the Use of Data Augmentation
Publication . Fonseca, João; Bação, Fernando; NOVA Information Management School (NOVA IMS); Information Management Research Center (MagIC) - NOVA Information Management School; Wiley
Active learning (AL) is a well-known technique to optimize data usage in training, through the interactive selection of unlabeled observations, out of a large pool of unlabeled data, to be labeled by a supervisor. Its focus is to find the unlabeled observations that, once labeled, will maximize the informativeness of the training dataset, therefore reducing data-related costs. The literature describes several methods to improve the effectiveness of this process. Nonetheless, there is a paucity of research developed around the application of artificial data sources in AL, especially outside image classification or NLP. This paper proposes a new AL framework, which relies on the effective use of artificial data. It may be used with any classifier, generation mechanism, and data type and can be integrated with multiple other state-of-the-art AL contributions. This combination is expected to increase the ML classifier’s performance and reduce both the supervisor’s involvement and the amount of required labeled data at the expense of a marginal increase in computational time. The proposed method introduces a hyperparameter optimization component to improve the generation of artificial instances during the AL process as well as an uncertainty-based data generation mechanism. We compare the proposed method to the standard framework and an oversampling-based active learning method for more informed data generation in an AL context. The models’ performance was tested using four different classifiers, two AL-specific performance metrics, and three classification performance metrics over 15 different datasets. We demonstrated that the proposed framework, using data augmentation, significantly improved the performance of AL, both in terms of classification performance and data selection efficiency (all the codes and preprocessed data developed for this study are available at https://github.com/joaopfonseca/publications/).
Geometric SMOTE for imbalanced datasets with nominal and continuous features
Publication . Fonseca, Joao; Bacao, Fernando; Information Management Research Center (MagIC) - NOVA Information Management School; NOVA Information Management School (NOVA IMS); Elsevier Science B.V., Amsterdam.
Imbalanced learning can be addressed in 3 different ways: Resampling, algorithmic modifications and cost-sensitive solutions. Resampling, and specifically oversampling, are more general approaches when opposed to algorithmic and cost-sensitive methods. Since the proposal of the Synthetic Minority Oversampling TEchnique (SMOTE), various SMOTE variants and neural network-based oversampling methods have been developed. However, the options to oversample datasets with nominal and continuous features are limited. We propose Geometric SMOTE for Nominal and Continuous features (G-SMOTENC), based on a combination of G-SMOTE and SMOTENC. Our method modifies SMOTENC’s encoding and generation mechanism for nominal features while using G-SMOTE’s data selection mechanism to determine the center observation and k-nearest neighbors and generation mechanism for continuous features. G-SMOTENC’s performance is compared against SMOTENC’s along with two other baseline methods, a State-of-the-art oversampling method and no oversampling. The experiment was performed over 20 datasets with varying imbalance ratios, number of metric and non-metric features and target classes. We found a significant improvement in classification performance when using G-SMOTENC as the oversampling method. An open-source implementation of G-SMOTENC is made available in the Python programming language.
G-SOMO
Publication . Douzas, Georgios; Rauch, Rene; Bação, Fernando; Information Management Research Center (MagIC) - NOVA Information Management School; NOVA Information Management School (NOVA IMS); Elsevier Science B.V., Amsterdam.
Traditional supervised machine learning classifiers are challenged to learn highly skewed data distributions as they are designed to expect classes to equally contribute to the minimization of the classifiers cost function. Moreover, the classifiers design expects equal misclassification costs, causing a bias for overrepresented classes. Different strategies have been proposed to correct this issue. The modification of the data set has become a common practice since the procedure is generalizable to all classifiers. Various algorithms to rebalance the data distribution through the creation of synthetic instances were proposed in the past. In this paper, we propose a new oversampling algorithm named G-SOMO. The algorithm identifies optimal areas to create artificial data instances in an informed manner and utilizes a geometric region during the data generation process to increase their variability. Our empirical results on 69 datasets, validated with different classifiers and metrics against a benchmark of commonly used oversampling methods show that G-SOMO consistently outperforms competing oversampling methods. Additionally, the statistical significance of our results is established.
Tabular and latent space synthetic data generation
Publication . Fonseca, Joao; Bacao, Fernando; Information Management Research Center (MagIC) - NOVA Information Management School; NOVA Information Management School (NOVA IMS); Springer
The generation of synthetic data can be used for anonymization, regularization, oversampling, semi-supervised learning, self-supervised learning, and several other tasks. Such broad potential motivated the development of new algorithms, specialized in data generation for specific data formats and Machine Learning (ML) tasks. However, one of the most common data formats used in industrial applications, tabular data, is generally overlooked; Literature analyses are scarce, state-of-the-art methods are spread across domains or ML tasks and there is little to no distinction among the main types of mechanism underlying synthetic data generation algorithms. In this paper, we analyze tabular and latent space synthetic data generation algorithms. Specifically, we propose a unified taxonomy as an extension and generalization of previous taxonomies, review 70 generation algorithms across six ML problems, distinguish the main generation mechanisms identified into six categories, describe each type of generation mechanism, discuss metrics to evaluate the quality of synthetic data and provide recommendations for future research. We expect this study to assist researchers and practitioners identify relevant gaps in the literature and design better and more informed practices with synthetic data.
Characterization of the Firm-Firm Public Procurement Co-Bidding Network from the State of Ceará (Brazil) Municipalities
Publication . Lyra, Marcos da Silva; Curado, António; Damásio, Bruno; Bação, Fernando; Pinheiro, Flávio L.; NOVA Information Management School (NOVA IMS); Information Management Research Center (MagIC) - NOVA Information Management School; Springer Nature
Fraud in public funding can have deleterious consequences for societies’ economic, social, and political well-being. Fraudulent activity associated with public procurement contracts accounts for losses of billions of euros every year. Thus, it is of utmost relevance to explore analytical frameworks that can help public authorities identify agents that are more susceptible to irregular activities. Here, we use standard network science methods to study the co-bidding relationships between firms that participate in public tenders issued by the 184 municipalities of the State of Ceará (Brazil) between 2015 and 2019. We identify 22 groups/communities of firms with similar patterns of procurement activity, defined by their geographic and activity scopes. The profiling of the communities allows us to highlight organizations that are more susceptible to market manipulation and irregular activities. Our work reinforces the potential application of network analysis in policy to unfold the complex nature of relationships between market agents in a scenario of scarce data.

Unidades organizacionais

Descrição

Palavras-chave

Contribuidores

Financiadores

Entidade financiadora

Fundação para a Ciência e a Tecnologia

Programa de financiamento

3599-PPCDT

Número da atribuição

DSAIPA/DS/0116/2019

ID