| Nome: | Descrição: | Tamanho: | Formato: | |
|---|---|---|---|---|
| 5.66 MB | Adobe PDF |
Orientador(es)
Resumo(s)
Digital libraries are a central technology for the dissemination and sharing of knowledge, endless quantities of documents are stored and accessed through them. However, the efficiency of the associated search systems and their ability to identify relevant documents continues to be a bottleneck, and are not keeping pace with the ever-increasing volume of stored data. In this thesis, we present Network TD-SOM, a systematic process that offers a practical method for organizing, searching, visualising, discovering, and extracting knowledge from a vast corpus. Network TD-SOM combines topic modelling with Self-Organizing Maps and Network Analysis algorithms, to provide a visually rich environment where the user can explore and interact with a corpus, and find relevant documents. We test two different topic modelling algorithms separately and use their topic vectors to produce a Self-Organizing Map, which in turn is simplified through the use of a hierarchical clustering algorithm. We apply Network Analysis to the documents using the 3 best topics of each document and visualise the relations between the different documents. Finally, the Network TD-SOM methodology is evaluated on the masterās thesis dataset from NOVA IMS. LDA and BERTopic successfully uncovered the thematic structure and extracted helpful knowledge from the dataset. In this context, BERTopic achieves better results and provides a more meaningful clustering solution. On the contrary, when it comes to the network analysis, and although the arrangement of the two network theses had similarities, the one modelled by using features/topics from LDA presents better results.
Descrição
Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data Science
Palavras-chave
Corpus Visualisation Topic modelling Clustering Network analysis SDG 4 - Quality education
