Automating news classification with large language models: Exploring fine-tuning, dataset size, and architecture

Yesilyurt, Burcu

Utilize este identificador para referenciar este registo: http://hdl.handle.net/10362/190563

Título:	Automating news classification with large language models: Exploring fine-tuning, dataset size, and architecture
Autor:	Yesilyurt, Burcu
Orientador:	Bação, Fernando José Ferreira Lucas
Palavras-chave:	Large Language Models Text Classification News Classification Fine-Tuning Hyperparameter Optimization BERT SDG 4 - Quality education SDG 9 - Industry, innovation and infrastructure SDG 16 - Peace, justice and strong institutions
Data de Defesa:	29-Out-2025
Resumo:	A comprehensive evaluation benchmarks Large Language Models with traditional machine learning algorithms for automatic news classification is done on three standard news classification datasets: BBC News, 20 Newsgroups, and AG News. We implement traditional models, including Naive Bayes, Logistic Regression, Support Vector Machine, and Random Forest, to provide clear and interpretable baselines using manual term-frequency and syntactic features. Then, fine‐tuned transformer architectures, including BERT, RoBERTa, T5, GPT, and their distilled variants, were used to quantify improvements in predictive accuracy, resource efficiency, and explainability. Performance is measured via 5-fold cross-validation using F1 and accuracy metrics, and statistical significance is assessed with a Friedman test followed by Holm’s correction. Results show that transformer models consistently outperform classical approaches, with BERT achieving the highest scores under both balanced and imbalanced conditions. Distilled models rival or surpass full-size transformers on larger datasets while reducing memory requirements and maintaining comparable inference latency. Attention‐based attribution methods provide semantic explanations on par with feature‐importance metrics, confirming that LLMs deliver superior accuracy, adaptability, and transparency in news classification. Future work should investigate multilingual pretraining, multilabel classification, and ensemble techniques to further strengthen real‐time, explainable news‐analysis pipelines.
Descrição:	Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data Science
URI:	http://hdl.handle.net/10362/190563
Designação:	Mestrado em Ciência de Dados e Métodos Analíticos Avançados, especialização em Data Science
Aparece nas colecções:	NIMS - Dissertações de Mestrado em Ciência de Dados e Métodos Analíticos Avançados (Data Science and Advanced Analytics)

Ficheiros deste registo:

Ficheiro	Descrição	Tamanho	Formato
TCDMAA4245.pdf		7,33 MB	Adobe PDF	Ver/Abrir Acesso Restrito. Solicitar cópia ao autor!

Mostrar registo em formato completo Dê a sua opinião sobre este registo.