IDENTIFICATION OF LANGUAGES AND DIALECTS IN HARD CONTEXTS

Pereira, Tomás Rocha

Please use this identifier to cite or link to this item: http://hdl.handle.net/10362/176243

Title:	IDENTIFICATION OF LANGUAGES AND DIALECTS IN HARD CONTEXTS
Author:	Pereira, Tomás Rocha
Advisor:	Silva, Joaquim
Keywords:	Machine learning character n-grams classification language variant cluster
Defense Date:	Feb-2022
Abstract:	Language identification (LI) is a functionality required by more web and mobile applications with each passing day, this is happening because LI is useful, not only in identifying languages to be translated afterwards, like the social media applications do, but also for general text mining like sentiment analysis, extraction of expressions that are characteristic for a language, or even to identify entities that are important to respond to queries in apps involving chatbots. The more concrete problem that this thesis is attempting to solve goes one step further than pure language identification, in the sense that the objective of the thesis is not only be able to correctly distinguish between language variants like European Portuguese and Brazilian Portuguese, but also to identify languages in short texts like tweets or text messages, and even to have the ability to reject objects that do not belong to any of the known classes (in this case each class represents a language or a variant). The solution that was implemented is based on machine learning, in the sense that there is an algorithm that is going to be trained using an arbitrary number of languages and documents and is going to analyze the several given texts up for classification, by extracting relevant sets of characters (character n-grams) and comparing them to the ones extracted from the documents in the training phase, in order to understand the language or variant of a text. A identificação de línguas em textos escritos é uma funcionalidade necessária em cada vez mais aplicações web e mobile, uma vez que esta é útil não só no sentido de distinguir línguas para posteriormente serem traduzidas, como por exemplo é feito pelos algoritmos das aplicações de redes sociais, mas também no aspeto da análise e extração do texto em si, como por exemplo análise de sentimentos implícitos no texto, expressões caraterísticas de uma linguagem ou até entidades para resposta a queries no caso de aplicações que envolvam chatbots. O problema mais concreto que se tentou resolver é ainda um passo à frente da identi- ficação de línguas, pois para além disso, esta tese tem como objetivo analisar e distinguir corretamente variantes de línguas, tais como Português e Brasileiro, identificar a língua em textos de dimensões reduzidas, tais como tweets ou mensagens de texto e ainda, ter a capacidade de rejeitar objetos que não pertencem a nenhuma das classes conhecidas (neste caso as classes representam línguas ou variantes de língua). A solução que foi implementada passa pela criação de um algoritmo de aprendizagem automática que será treinado com línguas e documentos arbitrários e analisa os vários textos dados para classificação, extraindo conjuntos de carateres relevantes (n-grams de carater) e comparando com os documentos da fase de treino, de modo a entender a língua ou variante de cada texto.
URI:	http://hdl.handle.net/10362/176243
Designation:	MASTER IN COMPUTER SCIENCE
Appears in Collections:	FCT: DI - Dissertações de Mestrado

Files in This Item:

File	Description	Size	Format
Pereira_2022.pdf		4,19 MB	Adobe PDF	View/Open

Show full item record Give your opinion