IDENTIFICATION OF LANGUAGES AND DIALECTS IN HARD CONTEXTS

Pereira, Tomás Rocha

Utilize este identificador para referenciar este registo: http://hdl.handle.net/10362/176243

Registo completo

Campo DC	Valor	Idioma
dc.contributor.advisor	Silva, Joaquim	-
dc.contributor.author	Pereira, Tomás Rocha	-
dc.date.accessioned	2024-12-05T15:25:27Z	-
dc.date.available	2024-12-05T15:25:27Z	-
dc.date.issued	2022-02	-
dc.identifier.uri	http://hdl.handle.net/10362/176243	-
dc.description.abstract	Language identification (LI) is a functionality required by more web and mobile applications with each passing day, this is happening because LI is useful, not only in identifying languages to be translated afterwards, like the social media applications do, but also for general text mining like sentiment analysis, extraction of expressions that are characteristic for a language, or even to identify entities that are important to respond to queries in apps involving chatbots. The more concrete problem that this thesis is attempting to solve goes one step further than pure language identification, in the sense that the objective of the thesis is not only be able to correctly distinguish between language variants like European Portuguese and Brazilian Portuguese, but also to identify languages in short texts like tweets or text messages, and even to have the ability to reject objects that do not belong to any of the known classes (in this case each class represents a language or a variant). The solution that was implemented is based on machine learning, in the sense that there is an algorithm that is going to be trained using an arbitrary number of languages and documents and is going to analyze the several given texts up for classification, by extracting relevant sets of characters (character n-grams) and comparing them to the ones extracted from the documents in the training phase, in order to understand the language or variant of a text.	pt_PT
dc.description.abstract	A identificação de línguas em textos escritos é uma funcionalidade necessária em cada vez mais aplicações web e mobile, uma vez que esta é útil não só no sentido de distinguir línguas para posteriormente serem traduzidas, como por exemplo é feito pelos algoritmos das aplicações de redes sociais, mas também no aspeto da análise e extração do texto em si, como por exemplo análise de sentimentos implícitos no texto, expressões caraterísticas de uma linguagem ou até entidades para resposta a queries no caso de aplicações que envolvam chatbots. O problema mais concreto que se tentou resolver é ainda um passo à frente da identi- ficação de línguas, pois para além disso, esta tese tem como objetivo analisar e distinguir corretamente variantes de línguas, tais como Português e Brasileiro, identificar a língua em textos de dimensões reduzidas, tais como tweets ou mensagens de texto e ainda, ter a capacidade de rejeitar objetos que não pertencem a nenhuma das classes conhecidas (neste caso as classes representam línguas ou variantes de língua). A solução que foi implementada passa pela criação de um algoritmo de aprendizagem automática que será treinado com línguas e documentos arbitrários e analisa os vários textos dados para classificação, extraindo conjuntos de carateres relevantes (n-grams de carater) e comparando com os documentos da fase de treino, de modo a entender a língua ou variante de cada texto.	pt_PT
dc.language.iso	eng	pt_PT
dc.rights	openAccess	pt_PT
dc.subject	Machine learning	pt_PT
dc.subject	character n-grams	pt_PT
dc.subject	classification	pt_PT
dc.subject	language variant	pt_PT
dc.subject	cluster	pt_PT
dc.title	IDENTIFICATION OF LANGUAGES AND DIALECTS IN HARD CONTEXTS	pt_PT
dc.type	masterThesis	pt_PT
thesis.degree.name	MASTER IN COMPUTER SCIENCE	pt_PT
dc.subject.fos	Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática	pt_PT
Aparece nas colecções:	FCT: DI - Dissertações de Mestrado

Ficheiros deste registo:

Ficheiro	Descrição	Tamanho	Formato
Pereira_2022.pdf		4,19 MB	Adobe PDF	Ver/Abrir

Mostrar registo em formato simples Dê a sua opinião sobre este registo.