Automatic Completion of Text-based Tasks

Henriques, Daniel Filipe Rodrigues

http://hdl.handle.net/10362/92296

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
Henriques_2019.pdf		2.29 MB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

Henriques, Daniel Filipe Rodrigues

Orientador(es)

Freitas, João

Correia, Rui

Resumo(s)

Crowdsourcing is a widespread problem-solving model which consists in assigning tasks to an existing pool of workers in order to solve a problem, being a scalable alternative to hiring a group of experts for labeling high volumes of data. It can provide results that are similar in quality, with the advantage of achieving such standards in a faster and more efficient manner. Modern approaches to crowdsourcing use Machine Learning models to do the labeling of the data and request the crowd to validate the results. Such approaches can only be applied if the data in which the model was trained (source data), and the data that needs labeling (target data) share some relation. Furthermore, since the model is not adapted to the target data, its predictions may produce a substantial amount of errors. Consequently, the validation of these predictions can be very time-consuming. In this thesis, we propose an approach that leverages in-domain data, which is a labeled portion of the target data, to adapt the model. The remainder of the data is labeled based on these model’s predictions. The crowd is tasked with the generation of the in-domain data and the validation of the model’s predictions. Under this approach, train the model with only in-domain data and with both in-domain data and data from an outer domain. We apply these learning settings with the intent of optimizing a crowdsourcing pipeline for the area of Natural Language Processing, more concretely for the task of Named Entity Recognition (NER). This optimization relates to the effort required by the crowd to performed the NER task. The results of the experiments show that the usage of in-domain data achieves effort savings ranging from 6% to 53%. Furthermore, we such savings in nine distinct datasets, which demonstrates the robustness and application depth of this approach. In conclusion, the in-domain data approach is capable of optimizing a crowdsourcing pipeline of NER. Furthermore, it has a broader range of use cases when compared to reusing a model to generate predictions in the target data.

Palavras-chave

Crowdsourcing Transfer Learning Machine Learning Named Entity Recognition Natural Language Processing

URI

http://hdl.handle.net/10362/92296

Coleções

FCT: DI - Dissertações de Mestrado

Ver registo completo