Logo do repositório
 

NIMS - Dissertações de Mestrado em Ciência de Dados e Métodos Analíticos Avançados (Data Science and Advanced Analytics)

URI permanente para esta coleção:

Anteriormente: Dissertações de Mestrado em Métodos Analíticos Avançados (Advanced Analytics)

Navegar

Entradas recentes

A mostrar 1 - 10 de 741
  • Exploring a Genetic Algorithm Approach to Determine Optimal Technical Indicator Parameters for Algorithmic Trading in the Forex Market
    Publication . Onorieru, Evans Akpo; Bravo, Jorge Miguel Ventura
    This thesis investigates the application of Genetic Algorithms (GAs) to optimize technical indicator parameters in algorithmic trading, focusing on the Relative Strength Index (RSI) and Average Directional Index (ADX) within the Forex market. The central motivation is to address the limitations of static calibration methods, which often fail to adapt to the non stationary and structurally complex nature of financial markets. The GA was implemented in Python using the DEAP framework, with reproducibility ensured through fixed random seeds. Historical EUR/USD daily data from 2009–2019 was used for in sample optimization, while out of sample testing covered 2020–2023. The GA searched over RSI and ADX parameter ranges using tournament selection, blend crossover, and polynomial bounded mutation, with fitness defined by cumulative return and annualized Sharpe ratio. Empirical findings show that the GA converged on RSI(7) and ADX(7), achieving an in sample cumulative return of 18.73%, outperforming conventional benchmarks. However, out of sample testing revealed a negative cumulative return of –20.42% and poor risk adjusted performance, underscoring the challenge of generalization. Robustness tests produced mixed results: while performance was preserved under randomized data and weekly timeframes, profitability eroded under randomized trade order and across alternative currency pairs. Slippage and rolling window analyses further highlighted the sensitivity of the strategy to trading frictions and regime shifts. These results affirm the potential of GAs as a heuristic for navigating complex financial optimization problems but also emphasize the need for rigorous validation frameworks and adaptive retraining. Future research should extend this approach by incorporating alternative risk adjusted objectives, macroeconomic covariates, and multi objective evolutionary frameworks that balance profitability, risk, interpretability, and regulatory compliance.
  • Evaluating Gains on Blockchain Analysis for Money Laundering (AML) detection through Machine Learning Operations (MLOps) adoption
    Publication . Seganfredo, Henrique; Agostinho, Nuno Filipe Rosa; Scott, Ian James
    The complexity and scale of blockchain transaction data present significant challenges for Anti-Money Laundering (AML) detection, particularly in the absence of labeled ground truth. This thesis explores how Machine Learning Operations (MLOps) practices can improve the design, reproducibility, and performance of AML detection pipelines built upon blockchain data. A modular pipeline is proposed to process real-time USDC token transfers from the Ethereum mainnet using Hyperledger Besu, Apache Kafka, and Neo4j. Transactions are enriched with graph-based features to expose laundering typologies such as smurfing and funneling. Unsupervised learning is applied through Isolation Forests trained on generated synthetic data. MLOps tools - including MLflow for model tracking, Feast for feature management, and Prefect for orchestration - are assessed to enable versioning, monitoring, and automated retraining in response to concept drift. The pipeline’s effectiveness is demonstrated through anomaly detection experiments, and a comparative analysis quantifies the measurable benefits introduced by MLOps adoption. This research contributes a practical framework that bridges data science and production, illustrating how operational maturity enhances the traceability, robustness, and interpretability of AML systems in blockchain contexts.
  • Conformal Finance Pipeline with Visualization Dashboard: Designing and Evaluating a Conformal Finance Pipeline with ECI-Enhanced Visualization Dashboard: A DSR Approach to Uncertainty Quantification in Financial Forecasting
    Publication . Freitas, Leonardo Fernandes Machado de; Bravo, Jorge Miguel Ventura
    This master’s thesis investigates the Conformal Finance Pipeline with Visualization Dashboard, an artifact developed using the Design Science Research methodology to address uncertainty in financial forecasting. Empirical validation is conducted under both static-split and rolling-window (walk-forward) evaluation protocols. While the static-split method enables comprehensive hyperparameter testing across multiple test-window lengths (41–252 days) and crisis periods (2014, COVID-19, 2022 monetary tightening), the rolling-window protocol— aligned with the original ECI authors’ online adaptive framework—serves as the primary benchmark for real-world application, despite its higher computational demands. Results show that XGBoost and RandomForest consistently produce the narrowest valid intervals in regression tasks, while the classification pipeline achieves strong directional accuracy and well-calibrated prediction-set sizes. Conditional coverage analysis by volatility buckets and Winkler (WIS) scoring further supports the adaptive performance of the ECI-Integral and ECI-Cutoff variants across low-, medium-, and high-volatility regimes. The study also uncovers stable patterns in interval dynamics—particularly in Diff_qs_NCM and intraday-range ratios— that point to promising avenues for uncertainty-aware trading strategies. All results are integrated into a Power BI dashboard, providing analytical insights through prediction intervals, empirical coverage rates (ECR), average interval widths (AIW), and the Winkler Score (WIS), a proper scoring rule that penalizes both miscoverage and excessive width.
  • Folding Using Optimization
    Publication . Gorjão, Afonso Mexia; Vanneschi, Leonardo; Araújo, Nuno
    Polyhedrons can be unfolded into flat templates called nets. The study of these nets and their folding can lead to the design of new materials and structures with programmable properties. These can have applications ranging from medicine to space exploration and encompass a growing area of research in material design: self-folding. This thesis attempts to fill a gap in the existing literature: given a specific net, which are the possible structures that are able to be folded? To solve this problem, a function constructed to have minimums in the angle space defined by a net is proposed. The two terms of this function are Γ1, that sums the distance between closest edges in the net, and Γ2, which penalizes faces from overlapping by summing the inverse distance between face centers of adjacent faces in the net. The relative strength of these two terms is controlled by a coefficient 𝛾 that multiplies Γ2. To minimize the proposed function, two different optimization methods are combined: a Genetic Algorithm is developed to output points in the angle space near minimums of the function. These points are then inputted into Powell’s Method, which arrives at solutions close to the minimums (up to some tolerance). Appropriate values of 𝛾 are searched by brute force by applying the Genetic Algorithm multiple times per 𝛾 value and analysing the behaviour of the terms Γ1 and Γ2 at the last outputted generation. Understanding the dependence of 𝛾 on the problem variables lies beyond the scope of this thesis. The Genetic Algorithm is tuned using Bayesian optimization. The results show strong evidence that the proposed function has minimums corresponding to closed structures in the angle space, has it was able to find closed structures for all the nets tested, encompassing nets of the Platonic solids and nets of other irregular polyhedrons. In many of the nets tested the algorithm found more than one closed structure and more than one way of folding the same closed structure. Two different ways of folding the same structure can correspond to distinct minimums in the function due to the way Γ2 is constructed. Furthermore, the results also show evidence that closed structures are not the only minimums of the function, with some outlier anomalies not corresponding to closed structures being identified. This entails the necessity of a third global face term, that accounts for the distance of non-adjacent face centers in the net.
  • Dynamic definition of binding contracts for high demand business processes
    Publication . Rodrigo, Gustavo Azevedo; Caldeira, João Carlos Palmela Pinheiro
    Service Level Agreements (SLAs) often rely on static thresholds derived from historical averages, yet such fixed targets fail to reflect the variability of modern business processes. This work investigates how Process Mining and Machine Learning can support dynamic, context-aware SLA definition. Using real event-log data from a financial institution, the study reconstructs process behaviour, extracts prefix-based features, and develops models to estimate the total duration of a case as it unfolds. Two predictive approaches are examined. The first is a class-based conditional estimator that predicts the final outcome class using Random Forest and XGBoost and assigns a duration based on class-specific historical statistics. All percentiles from 1 to 100 are evaluated to determine the most accurate estimator for each prefix. The second approach trains LSTM networks separately for prefixes 1 through 5, learning temporal patterns from the first k events and using a log transformation to stabilise the skewed duration distribution. Results show that the class-based estimator consistently outperforms the static global-mean SLA baseline, reducing error by nearly 50% at the earliest prefixes. Optimal percentiles vary across prefixes, highlighting the limitations of fixed SLA thresholds. While the LSTM performs poorly at very short prefixes, it improves at later ones and demonstrates the potential of deep learning for duration prediction. Overall, the findings support the shift from static to predictive, data-driven SLA targets, enabling more accurate, adaptive, and operationally meaningful performance expectations.
  • Evaluating the Impact of Large Language Model-Generated Synthetic Data on Recommender Systems Performance
    Publication . Felisberto, Matheus; Henriques, Roberto André Pereira
    The rapid expansion of digital catalogs necessitates effective Recommender Systems (RSs) to guide users to relevant items. However, less popular items often suffer from the cold-start problem in RSs. With the rise of Large Language Models (LLMs), it is now possible to generate synthetic user–item interaction data to alleviate this issue. This thesis evaluates how LLM-generated samples impact RS performance in coldstart scenarios. We distilled the Amazon Books dataset (10,000 users × 37,000 items) and used an LLM to produce synthetic interactions, augmenting two models: Neural Matrix Factorization (NeuMF) and the Wide & Deep. Each model was evaluated using five cross-validation runs with different random seeds, on both augmented and non-augmented versions, employing Recall@10, nDCG@10, and F1-Score as evaluation metrics. A one-sided Wilcoxon signed-rank test (𝑝 < 0.05) was applied to the F1-Score to assess the statistical significance of performance differences. In cold-start settings, augmentation yielded improvements of 12 Percentage Points (pp) in Recall@10 and 15 pp in nDCG@10. For warm-start items, a moderate decrease was observed (6 pp Recall@10, 10 pp nDCG@10), indicating a performance trade-off. These results confirm that LLM-based augmentation can help mitigate cold-start challenges. Future work may explore richer LLM pipelines (e.g., Retrieval Augmented Generation (RAG)) and benchmark against simpler content-similarity approaches.
  • How blockchain technology can prevent fake degrees-diplomas: a UTAUT2 model approach
    Publication . Kukreja, Armaan; Scott, Ian James
    Fake academic degrees are a growing global problem, affecting the credibility of institutions and employers. Blockchain technology offers a potential solution due to its security, decentralization, and transparency. This study investigates the factors influencing the public’s intention to adopt blockchain-based academic credentials using the UTAUT2 model. A quantitative method was applied, and data was collected from 207 participants through an online survey. The analysis, conducted using PLS-SEM, examined the relationships between key constructs. The results show that Information Literacy, Facilitating Conditions, and Hedonic Motivation significantly influence behavioural intention to adopt the technology. In contrast, Performance Expectancy, Effort Expectancy, Social Influence, Trust, Subjective Knowledge, and Objective Knowledge did not have a direct effect, although both knowledge variables positively affected Performance Expectancy and Effort Expectancy. These findings indicate that improving awareness, digital literacy, and support systems may help increase acceptance of blockchain for academic credentials in the future.
  • Idea Engineering: Design and Implementation of a Decision Support System for Generating Research Topics
    Publication . Rodrigues, Carolina Ochoa Gomes; Damásio, Bruno Miguel Pinto
    The selection of research topics is one of the most challenging stages in the development of master's theses, traditionally requiring extensive manual review of literature in order to identify gaps in knowledge. In addition, the exponential growth of academic production makes this process impractical, thus demonstrating the vast need for the creation of automated systems capable of synthesizing prospective knowledge contained in large volumes of scientific documents. This research developed a topic recommendation system that combines Text Mining techniques with Generative Artificial Intelligence, enabling the automatic extraction and transformation of future research proposals into academically structured suggestions. The methodology integrates a pipeline that operates through the automated collection of ‘Future Work’ sections from documents, extracted from OpenAlex and Unpaywall databases, whose abstract includes the filtered topic (e.g., ‘Text Mining in Higher Education’), rigorous pre-processing, and subsequent application of the Latent Dirichlet Allocation algorithm to identify latent topics optimized by the 𝐶𝑣 Coherence Score metric, ending with a linguistic synthesis via the Gemini API controlled by Prompt Engineering. The analysis of 242 DOIs resulted in 22 final documents, identifying 8 distinct latent topics with a coherence of 0.4729 and a 99.58% reduction in vocabulary. The system enabled the generation of 3 linguistically fluid and academically appropriate proposals per each of the 8 topics, proving the feasibility of integrating unsupervised pattern discovery and advanced linguistic synthesis. It has also validated the applicability of hybrid Latent Dirichlet Allocation-Large Language Model architectures in academic guidance, offering a scalable approach to automating knowledge discovery processes.
  • Analysis and Evaluation of the Model for Impact Assessment, for a social impact NGO: Validity Testing in a Real World Context
    Publication . Vital, Ana Leonor Gonçalves; Costa, Maria Manuela Simões Aparício da
    This dissertation evaluates the validity and robustness of the Entity B model for measuring social impact in non-profit housing interventions and examines its ability to capture multidimensional change. Guided by two research questions, this study applies a mixed-methods approach structured under the Cross-Industry Standard Process for Data Mining (CRISP-DM) framework. Quantitative analysis was conducted using a multiple linear regression on survey data, while qualitative insights were integrated through sentiment analysis of beneficiaries’ open-ended responses. Results confirm strong internal coherence and explanatory power (R-squared = 0.90), with physical indicators emerging as the dominant predictors of perceived impact. Yet, the social and financial dimensions exhibit weak explanatory capacity, and the professional dimension had to be excluded due to data constraints - indicating that the model only partially achieves its multidimensional ambition. The sentiment analysis corroborates these findings, revealing predominantly positive perceptions associated with housing improvements but limited evidence of relational or economic transformation. Theoretically, this research advances methodological innovation by validating a hybrid framework that combines data mining processes with computational text analysis, offering a replicable approach for multidimensional impact assessment. Practically, it provides actionable recommendations for non-profit organisations: baseline data collection, longitudinal follow-up, and stakeholder triangulation, to strengthen accountability and learning mechanisms. Limitations include the absence of baseline data, reliance on self-reported perceptions, short observation windows, and exclusion of the professional dimension. Future research should address these gaps through longitudinal designs, reintegration of omitted dimensions, refinement of relational indicators, and benchmarking against established frameworks such as Social Return on Investment (SROI) and Most Significant Change (MSC).
  • Evaluation of LSTM Networks for Stock Price Forecasting: A Benchmark Against traditional econometric models (ARIMA and PROPHET)
    Publication . Kuznetsov, Stepan; Castelli, Mauro
    This thesis explores the effectiveness of the use of Long Short-Term Memory (LSTM) neural networks for prediction of daily stock closing prices and compares their performance with several traditional time series models, such as AutoRegressive Integrated Moving Average (ARIMA), its extended versions and Facebook PROPHET. The analysis is based on historical stock data from a selected group of companies listed in the S&P 500 index, collected from publicly available Yahoo Finance databases. All models are trained using a consistent preprocessing and forecasting pipeline with the objective of predicting the next day closing price. Specially selected accuracy metrics are used to evaluate the size of prediction errors and how well the models spot trends and movements. The results indicate that while traditional models can model linear relationships and seasonal changes, they tend to underperform with the messy behavior and dynamics seen in financial markets. In contrast, the LSTM model can pick up on complicated time patterns straight from the data, without need to make assumptions like stationarity. This study highlights the growing relevance of machine learning in financial areas, especially forecasting. It offers a comparative analysis that reflects the strengths and limitations of traditional, hybrid and modern approaches. The results received may be useful for researchers and practitioners, who can leverage the presented findings and model designs in their own time series forecasting applications, including closing stock price prediction.