Internship at the Bank of Portugal: Development of a dashboard to track corrections in granular credit data and a gender classification model

Vigueras, Julio César Rojas

http://hdl.handle.net/10362/175320

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
TCDMAA3741.pdf		2.91 MB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

Vigueras, Julio César Rojas

Orientador(es)

Henriques, Roberto André Pereira

Resumo(s)

The objective of my internship at the Bank of Portugal was to enhance data analytics capabilities by building tools designed to improve and assess the quality of the Central Credit Register database, thereby supporting data-driven decision-making processes. To achieve this, I collaborated on creating a dashboard that tracks corrections made to a copy of the Central Credit Register database. This dashboard aims to assist in identifying those institutions that will need to improve the data quality of the reports they send to the Bank of Portugal, and in identifying the specific variables that will require this further improvement. Besides, I collaborated on developing a gender classification model based on the first name and nationality of a person to correct a variable in a table used by a public dashboard of the Bank of Portugal, which derives its data from the Central Credit Register database. To create the dashboard I followed the methodology explained by Cole Nussbaumer in (Nussbaumer Knaflic, 2020), which consists of, first, considering the audience to whom it is directed; then, crafting and polishing the message it intends to convey; after that, planning the content it will have; subsequently, choosing the visuals; afterwards, reviewing and dropping unnecessary information; following that, refining it to highlight key information; thereafter, improving the design; and finally, planning the storytelling. To build a visually engaging dashboard I carefully selected the preattentive attributes, such as size, colour, position, and shape, among other characteristics of all elements of the graphs and the text, to make the audience focus their attention on what I wanted them to focus it. To extract the required data, I ran SQL queries in the Power Query editor in Power BI. To build the gender classification model, first, I conducted research to find the best solutions to tackle our problem; then, I applied an algorithm from a python library named nomquamgender; after that, I collected information from different sources to build a list of gender-names. Additionally, a team member provided me with a very useful Harvard table of gender-names; afterwards, I created a gender classification model. To do so, first, I cleaned the data to maintain the first names in capital letters, preprocessed the names by romanizing them and dropping diacritics, and encoded the names into vectors. Then, I applied Multinomial Naive Bayes, Logistic Regression, Random Forest, and k-Nearest Neighbours, where the best one was Random Forest; thereafter, I compared the performances between nomquamgender, a model that a team member built, and the one that I proposed, with the Harvard table; following that, we built a pipeline for the final model, composed of the best models and solutions we had in an ordered manner, starting with the highest-performing models and finishing with the model that performed the worst but that was able to assign a gender to any name. For the extraction, collection, and aggregation of data, we employed Microsoft SQL Server. Python was used to compare the models, calculate performances, and create the final model. This report focuses on explaining the process that was followed to create two tools designed to help the team assess and improve the data quality of the Central Credit Register database, thus contributing to the Bank of Portugal by reducing decision-making time and enhancing the data quality of the Central Credit Register database.

Descrição

Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data Science

Palavras-chave

Gender classification model dashboard credit data SDG 10 - Reduced inequalities

URI

http://hdl.handle.net/10362/175320

Coleções

NIMS - Dissertações de Mestrado em Ciência de Dados e Métodos Analíticos Avançados (Data Science and Advanced Analytics)

Licença CC

cclicense-by

Ver registo completo