Different Approaches of Machine Learning Models in Credit Risk: A Case Study on Default on Credit Cards

Gonsalves, Eduardo Barreto Sulz

http://hdl.handle.net/10362/150753

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
TEGI2849..pdf		1.04 MB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

Gonsalves, Eduardo Barreto Sulz

Orientador(es)

Damásio, Bruno Miguel Pinto

Resumo(s)

Credit scoring is a very important process for banks. It allows the credit analysts to calculate the probability of a client defaulting a payment on a specific time horizon. This process helps the bank to manage their assets, preparing themselves ahead of time for possible defaults and also in the decision-making process of conceding or denying a loan to a new client. There are several different machine learning classifiers that can be used to calculate the probability of default. Studies shown that there is no specific model that can be used as the best one for all circumstances, each model will depend on the dataset. In this study, six different machine learning models are applied on datasets to classify and predict clients more likely to commit credit default. The models compared in this study were chosen based on the most frequently used techniques in this field and because of the lack of studies comparing these six models in specific, namely Logistic Regression, Support Vector Machine, Decision Tree, Random Forest, k-NN and Naïve Bayes. The goal of this comparison is to identity if there is a model that constantly outperforms the others. Three datasets are used. The first one is the German Credit Data with socioeconomic information from the clients requesting for a loan. The second one is the Credit Card Default Dataset with historic information about previous payments of credit cards invoice from clients, both datasets are from UCI repositorium. The last dataset is about credit concession with sociodemographic information about the clients obtained from Kaggle. To compare the models AUC is the main common metric used, followed by confusion matrix. After analysis, the random forest model presents the higher AUC for all datasets, other models vary their position on the ranking depending on the dataset. Finally, decision tree presented a bad AUC since it does not calculate probabilities but had one of the best accuracies of all models for two of the three datasets.

Descrição

Dissertation presented as the partial requirement for obtaining a Master's degree in Statistics and Information Management, specialization in Risk Analysis and Management

Palavras-chave

Credit Scoring Logistic Regression Random Forest Decision Tree k-NN SVM Naïve Bayes

URI

http://hdl.handle.net/10362/150753

Coleções

NIMS - Dissertações de Mestrado em Estatística e Gestão da Informação (Statistics and Information Management)

Licença CC

cclicense-by

Ver registo completo