Entity Recognition and Linking for Biomedical Documents Applying recent Transformer-based Entity Recognition and Linking Algorithms for the Biomedical Domain, to a Multi-Lingual Scenario

Gonçalves, Rodrigo Miguel Gameiro Vilhena

http://hdl.handle.net/10362/182370

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
Goncalves_2024.pdf		3.11 MB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

Gonçalves, Rodrigo Miguel Gameiro Vilhena

Orientador(es)

Lamúrias, André

Resumo(s)

In the ever expanding domain of Biomedicine, a wealth of crucial information is embedded within an extensive number of free-text documents. However, the unstructured nature of this textual reservoir, coupled with the intricacy of biomedical terminology, poses a significant challenge for automated systems to extract valuable insights in an efficient way. Furthermore, the majority of resources are allocated to English, but other lower- resource languages also contain valuable information. The current state-of-the-art models in multilingual biomedical Natural Language Processing lag behind their general domain and English-specific counterparts. This disparity emphasizes the need for approaches that can tackle the complexity of biomedical literature across languages. This dissertation focuses on two Natural Language Processing tasks within the infor- mation extraction realm: Named Entity Recognition (NER) and Entity Linking (EL). To tackle these tasks, we used the following methodology: first, we investigated the efficacy of transfer learning approaches, by adapting pre-trained Transformer-based models to the complexities of multilingual biomedical texts. Second, we employed a data augmentation technique to enrich the training data and try to enhance the performance of those models. We evaluated our approaches on the SympTEMIST, CANTEMIST and MultiCardioNER shared tasks - competitions that provide a benchmark for evaluating NER and EL tech- niques within the biomedical domain for various languages. We obtained competitive results for both NER and EL, consistently surpassing the mean and median results for those shared tasks, and even establishing a new state-of-the-art score for DrugTEMIST English in NER. Our methodology can be easily extended to other languages and datasets.

Num domínio em constante expansão como a Biomedicina, grande parte da informação encontra-se em documentos de texto livre. No entanto, a natureza não estruturada deste re- servatório textual, juntamente com a complexidade da terminologia biomédica, representa um desafio significativo na tentativa de extrair informação valiosa de forma eficiente por parte de sistemas automatizados. Para além disso, a maioria dos recursos estão disponíveis em inglês, mas outras línguas com menos recursos também possuem informações que podem ser bastante valiosas. O actual estado de arte dos modelos de Processamento de Linguagem Natural (em inglês Natural Language Processing) no âmbito biomédico e num cenário multilingue fica aquém dos modelos sem domínio específico e focados na língua inglesa. Esta disparidade realça a necessidade de desenvolver abordagens que possam lidar com a complexidade da literatura biomédica em várias línguas. Esta dissertação centra-se em duas tarefas de Processamento de Linguagem Natural, no âmbito da extração de informação: Reconhecimento de Entidades Nomeadas (Named Entity Recognition, ou NER) e Mapeamento de Entidades (Entity Linking, EL). Para abordar estas tarefas, utilizámos a seguinte metodologia: primeiro, investigámos a eficácia de abordagens de aprendizagem por transferência (Transfer Learning), adaptando modelos pré-treinados baseados na arquitetura Transformer às complexidades dos textos biomédicos na vertente multilingue. Segundo, desenvolvemos uma técnica de aumento de dados (Data Augmentation) para enriquecer os dados de treino e tentar melhorar o desempenho destes modelos. Avaliámos as nossas abordagens nas tarefas partilhadas (shared tasks) SympTEMIST, CANTEMIST e MultiCardioNER - competições que se dedicam à avaliação de técnicas de NER e EL, no domínio biomédico em várias línguas. Obtivemos resultados competitivos tanto para NER como para EL, ultrapassando consistentemente os resultados médios e medianos das tarefas partilhadas referidas, e estabelecendo um novo estado de arte, nomeadamente para DrugTEMIST English em NER. A nossa metodologia pode ser facilmente estendida a outras línguas e conjuntos de dados.

Palavras-chave

Natural Language Processing Biomedicine Multilingual Transformers Transfer Learning Data Augmentation

URI

http://hdl.handle.net/10362/182370

Coleções

FCT: DI - Dissertações de Mestrado

Ver registo completo