SUPERVISED CLASSIFICATION AND REJECTION OF DOCUMENTS FOR LIMITED DATASETS IN CHALLENGING CONTEXTS – A LANGUAGE- INDEPENDENT APPROACH

Oliveira, Pedro Miguel Rocha Correia de

http://hdl.handle.net/10362/190515

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
Oliveira_2024.pdf		5.88 MB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

Oliveira, Pedro Miguel Rocha Correia de

Orientador(es)

Silva, Joaquim

Resumo(s)

Attributing authorship to a text poses a multifaceted challenge for both experts and Artificial Intelligence (AI) systems. The intricacy arises from diverse factors, including capturing distinct writing styles, handling texts from the same era and languages, manag- ing distinct heteronyms of the same writer, or discerning the author’s gender. Traditionally, solutions for Authorship Attribution necessitated the extraction of numerous attributes, often facilitated by specialized linguistic tools, and relied on extensive training documents. The emergence of Deep Learning Transformers has exacerbated this dependency on data quantity. Conventional classification approaches typically assign a class to documents, even if they deviate significantly from the learned classes during the training phase. However, it is imperative to reject such anomalous texts based on established methodologies to bolster the reliability of classifiers. This thesis introduces a language-independent approach to Authorship Attribution, capable of rejecting unusual samples in challenging contexts. By evaluating the discriminant ability of each attribute, the final set of features can be significantly streamlined. The field of document classification, with its expanding applications, has experienced a proliferation of studies, leading to continuous advancements in proposals and method- ologies. The classification of documents diverges based on diverse problem-solving approaches, introducing variations in how data is presented. Notably, supervised classifi- cation tends to outperform unsupervised methods, benefiting from its reliance on prior class-labeled data. Within the realm of document classification, Authorship Attribution and plagiarism verification, while distinct in objectives, converge in their shared aim of extracting information about an author from the document’s data. Fundamentally, any classification problem operates on the premise that unique at- tributes enable the differentiation of objects. However, identifying the attributes essential for distinguishing authors poses a formidable challenge. The desired outcome entails representing an author through a set of attributes that vividly characterize them, thereby facilitating the grouping of authors based on these carefully selected characteristics. This dissertation aims to develop a system proficient in Authorship Attribution while also capable of rejecting documents significantly different from any prototypes learned during the training phase. The system undergoes a training phase where it receives docu- ments from each author, extracts representative information, and later, when presented with new documents, endeavors to attribute them to one of the authors from the training phase. If a document deviates substantially from any learned prototypes, the system must reject assigning authorship based on a reasoned approach.

Atribuir autoria a um texto apresenta um desafio multifacetado tanto para especialistas quanto para sistemas de Inteligência Artificial (IA). A complexidade advém de diversos fatores, incluindo a captura de estilos de escrita distintos, o tratamento de textos da mesma época e línguas, a gestão de heterónimos distintos do mesmo autor ou a identificação do género do autor. Tradicionalmente, as soluções para Atribuição de Autoria requeriam a extração de numerosos atributos, frequentemente obtidos através de ferramentas linguísticas especializadas, e dependiam de extensos documentos de treino. O surgimento dos Transformers baseados em modelos de Aprendizagem Profunda/Deep Learning tem agravado a dependência da quantidade de dados. As abordagens de classificação convencionais geralmente atribuem uma classe a documentos, mesmo que estes se desviem significativamente das classes aprendidas durante a fase de treino. No entanto, é imperativo rejeitar textos anómalos com base em metodologias estabelecidas para reforçar a fiabilidade dos classificadores. Esta tese propõe uma abordagem independente de língua para Atribuição de Autoria, capaz de rejeitar amostras estranhas ao sistema de classificação em contextos desafiantes. Ao avaliar a capacidade discriminante de cada atributo, o conjunto final de características pode ser significativamente simplificado. O campo da classificação de documentos, com as suas aplicações em expansão, tem testemunhado um aumento de estudos, resultando em melhorias contínuas em propostas e metodologias. A classificação de documentos diverge com base em diferentes abordagens para a resolução de problemas, introduzindo variações na apresentação dos dados. Notavelmente, a classificação supervisionada tende a superar os métodos não supervisionados, beneficiando da sua dependência dos dados previamente rotulados. No âmbito da classificação de documentos, a Atribuição de Autoria e a verificação de plágio, embora distintas nos seus objetivos, convergem no objetivo comum de extrair informações sobre um autor a partir dos dados do documento. Fundamentalmente, qualquer problema de classificação parte do pressuposto de que atributos únicos permitem a diferenciação de objetos pertencentes a classes distintas. No entanto, identificar os atributos essenciais, caso estes existam, para distinguir autores revela-se um desafio considerável. O resultado desejado envolve representar um autor através de um conjunto de atributos que o caracterizam de forma distinta, facilitando assim o agrupamento/clustering de autores com base nessas características previamente selecionadas. Esta dissertação visa desenvolver um sistema proficiente em Atribuição de Autoria, capaz também de rejeitar documentos significativamente diferentes de quaisquer protótipos aprendidos durante a fase de treino. O sistema passa por uma fase de treino onde recebe documentos de cada autor, extrai informações representativas e, posteriormente, ao ser apresentado a novos documentos, procura atribuí-los a um dos autores previamente conhecidos na fase de treino. Se um documento, representado em função dos atributos capturados no seio do texto, se desvia substancialmente de quaisquer protótipos aprendidos, o sistema deve ser capaz de rejeitar a Atribuição de Autoria, segundo um abordagem fundamentada.

URI

http://hdl.handle.net/10362/190515

Coleções

FCT: DI - Dissertações de Mestrado

Ver registo completo