Logo do repositório
 

NIMS - Dissertações de Mestrado em Ciência de Dados e Métodos Analíticos Avançados (Data Science and Advanced Analytics)

URI permanente para esta coleção:

Anteriormente: Dissertações de Mestrado em Métodos Analíticos Avançados (Advanced Analytics)

Navegar

Entradas recentes

A mostrar 1 - 10 de 765
  • Time-space analysis of car accidents in Portugal
    Publication . Ribeiro, Tomás Francisco; Painho, Marco Octávio Trindade; Costa, Ana Cristina Marinho da
    Road traffic accidents represent a persistent public health and socio-economic challenge, making the understanding of their spatial and temporal dynamics essential for effective prevention. This study analyzes road accidents in Portugal between 2021 and 2023 using an integrated spatiotemporal approach supported by Geographic Information Systems and advanced analytical methods. An Exploratory Data Analysis provides an initial characterization of accident patterns, followed by the construction of a space–time cube to examine temporal trends and spatial concentrations. Spatialtemporal Hotspots, Space-time Clusters and Outliers, and Time series clustering are applied to identify persistent, emerging, and anomalous accident patterns across the national territory. A focused assessment of high-risk areas further contrasts accident characteristics in critical zones with those observed elsewhere. The results reveal clear spatial concentrations, seasonal behaviors, and exposure-driven patterns, particularly in metropolitan regions. Overall, the study offers a comprehensive understanding of accident dynamics in Portugal and provides evidence-based insights to support targeted road safety interventions and policy development.
  • Urban Sentinel Signals for Extreme Electricity Demand Ramp Forecasting: Evidence from Portugal’s Power System
    Publication . Pereira, Ricardo Nunes; Neves, Maria de Fátima dos Santos Trindade
    The rapid expansion of variable renewable energy has made extreme electricity demand ramp events, abrupt, large-magnitude changes in consumption over hourly intervals, a growing threat to grid stability in systems such as Portugal's, where renewable penetration exceeded 68% of national demand in 2025. Despite advances in short-term load forecasting, most existing approaches focus on average demand behavior and rarely address extreme ramp events as a distinct prediction problem. This study investigates whether machine learning models can effectively predict extreme electricity demand ramps in Portugal and whether urban-scale signals from Lisbon can improve predictive performance at the national level. To address this problem, a two-stage machine learning pipeline was developed: Stage 1 formulates ramp event detection as a binary classification task, whilst Stage 2 estimates the signed ramp magnitude through regression. Two datasets were evaluated: a national baseline derived from E-REDES grid data and an enhanced dataset incorporating Lisbon substation-level load statistics and local meteorological variables. Multiple models were tested, and XGBoost achieved the strongest classification performance on the baseline dataset, reaching a PR-AUC of 0.73 and a recall of 0.85, recovering 86 of 101 extreme ramp events despite a severe 19:1 class imbalance. For ramp value estimation, LightGBM achieved the best results on the enhanced dataset, with a MAE of 49,5 kWh and R² of 0.94, whilst Random Forest produced the strongest baseline regression results. SHAP analysis confirmed that ramp volatility, temporal patterns, and urban substation variability were key predictive drivers. However, the severe underestimation of the April 2025 Iberian blackout event highlighted the limitations of data-driven forecasting models when confronted with historically unprecedented disruptions. Overall, the findings demonstrate meaningful but constrained predictive capacity for extreme demand ramp forecasting in Portugal. The results provide partial support for the proposed urban sentinel hypothesis, suggesting that Lisbon substation signals may contribute incremental contextual information for national ramp prediction, particularly in the regression task. The study further contributes to the emerging literature on explainable, urban-aware early-warning systems for smart grid resilience under increasing renewable volatility.
  • Understanding Online Purchasing Behavior Under Fraud Risk: An extended Theory of Planned Behavior model integrating gamification, fraud perception, influencers, and FoMO
    Publication . Azinheira, Ana Marta Oliveira; Tam Chuem Vai, Carlos
    With the fast growth of e-commerce, online fraud has emerged as a major challenge that threatens consumer confidence and shapes online purchasing behavior. This study examines how fraud perception influences consumers’ attitudes, intentions, and behavior in online shopping contexts. Grounded in the theory of planned behavior, the model is tested using survey data from 407 respondents. The results confirm the explanatory power of attitude, subjective norms, and perceived behavioral control in predicting behavioral intention. Fraud perception negatively influences attitude but does not directly reduce intention, operating instead through indirect effects. Influencers significantly affect purchase behavior but not behavioral intention, acting as situational triggers. These findings offer relevant theoretical extensions and practical insights.
  • Intelligent Chatbot to Support Access to PT2030 Incentives: An Application in the Context of IAPMEI
    Publication . Oliveira, António José Pratas de; Neves, Maria de Fátima dos Santos Trindade
    This dissertation presents the development and evaluation of a chatbot designed to support access to information related to PT2030 incentives and IAPMEI procedures. The proposed system is based on a Retrieval-Augmented Generation (RAG) architecture that combines a Large Language Model (LLM) with a hybrid retrieval strategy integrating dense semantic search and sparse lexical matching. The knowledge base was constructed by web scraping official institutional sources, followed by data cleaning, preprocessing, chunking, and vectorization. To optimize system performance, multiple configurations were systematically tested, including variations in embedding models, chunk size and overlap, dense-sparse retrieval weighting, number of retrieved context chunks, and the inclusion of reranking mechanisms. Performance evaluation was conducted using the RAGAS framework, assessing Answer Relevancy, Faithfulness, Context Precision, and Context Recall, and was complemented by a questionnaire-based stakeholder evaluation to capture perceptions of usability, response quality, and overall satisfaction. The results indicate that hybrid retrieval strategies with balanced or slightly sparse-weighted configurations achieved the most robust performance, while appropriately tunedchunking strategies contributed to stronger answer relevancy and faithfulness across tested scenarios. Additionally, reranking mechanisms introduced additional latency without providing consistent performance improvements. Stakeholder feedback further suggested that the chatbot was perceived as intuitive, useful, and valuable as an institutional support tool. The findings demonstrate the practical viability of RAG-based conversational systems for domain-specific public sector information retrieval, while highlighting key considerations for optimizing retrieval architectures in institutional chatbot deployments.
  • Urban Pressures and Rental Market Affordability in Lisbon: An Interpretable Machine Learning Approach
    Publication . Ferreira, Miguel Dias Fonseca; Neves, Maria de Fátima dos Santos Trindade
    The housing market of Lisbon has experienced a deterioration in rental affordability over the last decade, driven by the interaction of multiple urban pressures rather than a single factor. While tourism and the expansion of short-term rentals have received substantial attention, rental affordability is also shaped by broader structural forces, including demographic change, housing stock constraints, urban amenities, and construction activity. This study applied a data-driven, interpretable machine learning framework to model advertised rental affordability at the parish-month level in Lisbon between 2016 and 2025. The analysis integrated heterogeneous data sources capturing market signals, tourism pressures, urban dynamics, and structural sociodemographic characteristics. Among the evaluated tree-based algorithms, an Augmented LightGBM model achieved superior predictive performance (R² of 0.90). Model outputs were interpreted using explainable artificial intelligence (XAI) techniques, namely SHAP values and Individual Conditional Expectation (ICE) plots. The results identify Short-Term Rental (STR) density as the dominant inflationary driver, establishing a price ceiling, while also highlighting the restrictive impact of vacant housing stock retention and the moderating potential of urban rehabilitation pipelines. Methodologically, the study bridges predictive accuracy and policy relevance. In practice, it delivers the Urban Policy Simulator, a dynamic Decision Support System (DSS) that allows municipal policymakers to simulate the financial impacts of housing policies in real-time, effectively transforming a complex machine learning pipeline into a transparent, actionable urban auditing tool.
  • Explainable Deep Learning in Lung Cancer Oncology: A Multimodal Late Fusion Architecture Using Grad-CAM
    Publication . Notø, Solveig Ødegaard; Castelli, Mauro
    Lung cancer is a leading cause of cancer-related mortality, making early detection through Computed Tomography critical. Although deep learning models show strong potential for diagnosis, their lack of transparency limits trust in clinical settings. This study develops a multimodal, multi-task 3D DL framework to simultaneously classify non-small cell lung cancer histological subtypes and predict binary survival risk using CT scans and clinical metadata from the NSCLC-Radiomics dataset. The methodology uses a late fusion architecture combining a pre-trained 3D ResNet-18 for spatial features and a multi-layer perceptron for clinical history, alongside 3D GradCAM to visualize the results. Quantitatively, the model achieved a 72.00% accuracy for binary histology classification, and 43.86% for survival risk stratification, demonstrating the limitations of predicting long-term survival from a single static scan. The qualitative evaluation via 3D Grad-CAM revealed shortcut learning, since the network frequently fixed predictions to irrelevant background noise instead of true tumor pathology despite statistical convergence. These results highlight a gap between statistical accuracy and true clinical reasoning, indicating that relying on quantitative metrics causes risks. Incorporating spatial explainability and tracking patient data over time is essential for the safe use of artificial intelligence in clinical oncology.
  • Multimodal Matching of Jobs and VET Courses in Portugal: Integrating Semantic and Geospatial Analysis
    Publication . Cordeiro, Jorge Miguel Estanqueiro Galo da Silva; Henriques, Roberto André Pereira; Santos, Ricardo Miguel Costa
    Aligning Vocational Education and Training (VET) with labor market demands is critical for economic development and workforce readiness. However, accurately mapping educational course syllabi to real-world job descriptions remains challenging, particularly in data-sparse institutional environments. Traditional retrieval methods relying on exact keyword matching suffer from vocabulary mismatch, while modern Cross-Encoder architectures capture deep semantic context but introduce prohibitive quadratic computational complexity. To address these bottlenecks, this thesis proposes a highly scalable recommendation pipeline leveraging Bi-Encoder architectures (Siamese Networks). Independently mapping Portuguese VET courses and job descriptions into a shared dense vector space enables rapid and efficient semantic matching via cosine similarity. Furthermore, to overcome prohibitive manual annotation costs, the study introduces an "LLM-as-a-Judge" methodology to evaluate semantic alignment on a nuanced 4-point scale, rigorously validated against a humanaudited, stratified sample across five key economic sectors. Benchmarking multiple embedding models across varied input normalization strategies revealed that the Language-agnostic BERT Sentence Embedding (LaBSE) model, paired with spaCybased syntactic keyword extraction, achieved the highest performance, yielding a peak Mean Normalized Discounted Cumulative Gain (MNDCG)@10 of 0.7873 and a Mean Precision (MP@10) of 0.464. Notably, this syntactic keyword preprocessing consistently outperformed both raw text inputs and generative LLM-based normalization. Ultimately, this research provides a robust, automated framework for vocational recommendation, offering institutions a highly efficient, data-driven tool to enhance the alignment between educational offerings and modern labor market needs.
  • Churn Prediction in Digital Service Platforms
    Publication . Simões, Mara Cordeiro; Castelli, Mauro
    Customer churn prediction has become an important task for companies operating in competitive digital environments, particularly in non-contractual digital platforms where churn is not directly observable and must be inferred from patterns of user inactivity. This study develops and evaluates machine learning models to predict customer churn in a Portuguese digital service platform characterised by irregular and heterogeneous user activity patterns. Churn is defined using a 180-day inactivity threshold, supported by the distribution of inter-purchase intervals. The project follows the Cross-Industry Standard Process for Data Mining (CRISP-DM) and includes data preparation, feature engineering, and model comparison across several machine learning algorithms, including Logistic Regression, Random Forest, Gradient Boosting, XGBoost, LightGBM, Neural Networks, and a Stacking Ensemble. Special attention is given to class imbalance, as the dataset presents a reversed imbalance structure in which active users represent the minority class. The results show that models trained on the original imbalanced data achieve misleadingly strong performance by favouring the majority class, while the application of SMOTE leads to more balanced predictions across both classes. Among the evaluated models, LightGBM achieved the best overall performance, obtaining the highest F1-score while maintaining good generalisation and computational efficiency. The results also show the importance of handling class imbalance appropriately, selecting suitable evaluation metrics, and designing features that capture customer engagement patterns. In addition, engineered transactional features were shown to provide useful predictive information for churn prediction in non-contractual digital platforms. Overall, the study shows that machine learning models can effectively predict churn in environments characterised by irregular user activity patterns and non-standard class distributions.
  • Predicting Career Change and First Job Sector Using LinkedIn Data: A Machine Learning Approach
    Publication . Costa, Pedro Daniel Nunes dos Santos Rosário da; Henriques, Roberto André Pereira
    The transition from education to the labour market has become increasingly non-linear and individualized. Rather than following a single, predictable path, many professionals move across sectors throughout their careers, driven by technological change, personal ambition, and globalization. Understanding these trajectories can help raise awareness of the prevalence of non-linear careers and support individuals, academic institutions, and policymakers in preparing for the future of work. This thesis investigates whether publicly available LinkedIn profile data can be used to predict two key early-career outcomes: whether graduates remain in the sector associated with their field of study or move into a different sector, and the sector of their first job. To address these questions, a dataset of 3,600 LinkedIn profiles from a public GitHub repository was curated through data cleaning, preprocessing, and feature engineering. Weak labels were assigned using domain-specific keywords, and a pre-trained BERT model was fine-tuned to classify job positions into eleven main sectors. The resulting features included counts of experiences, education entries, and skills, as well as highest education level and the ratio of experiences to education entries. These features, together with the BERT-based sector predictions, were used to train logistic regression models for binary and multi-class classification tasks. Given the severe class imbalance in the dataset, with only around 3% of individuals remaining in their original academic sector, class weighting and repeated stratified cross-validation were applied. For the binary task, the weighted logistic regression model achieved a macro F1-score of 0.56 and a balanced accuracy of 0.84, outperforming a baseline that always predicted sector change. For the multi-class task, the model achieved a macro F1-score of 0.13. Overall, the findings show that cross-sector career transitions are common in this dataset and highlight both the potential and the limitations of using small-scale social media data to model early-career outcomes. The main contribution of this thesis is the development and evaluation of an interpretable machine learning framework that demonstrates the feasibility of using LinkedIn data to analyse earlycareer mobility and predict both career change and first-job sector.
  • Agentic AI in Central Public Administration: Exploring Opportunities and Challenges for Workflow Integration and Task Automation
    Publication . Cavaco, José António Rodrigues; Santos, Vítor Manuel Pereira Duarte dos
    This thesis examines the application of Agentic AI in Central Public Administration and proposes practical guidelines for its adoption. The study is motivated by persistent inefficiencies in public administration, including fragmented workflows, coordination challenges, and repetitive administrative tasks. While AI adoption has increased, most implementations remain limited to assistive or predictive systems, with limited exploration of autonomous, agent-based approaches. To address this gap, a Design Science Research methodology was applied. A literature review was conducted to assess the current state of Agentic AI in public administration, identifying key challenges, technologies, and use cases. Based on these findings, a set of guidelines was developed to identify suitable processes for Agentic AI application, along with a structured implementation framework covering use case selection, readiness assessment, implementation, deployment, and evaluation. The framework was illustrated through a simplified use case and evaluated through semi-structured interviews with three experts. The results indicate that the proposed approach is useful and applicable, particularly due to its structured and gradual implementation. Some improvements were identified, namely in process selection and monitoring. Overall, this research contributes a practical framework to support the integration of Agentic AI into public administration workflows.