| Nome: | Descrição: | Tamanho: | Formato: | |
|---|---|---|---|---|
| 2.29 MB | Adobe PDF |
Orientador(es)
Resumo(s)
Crowdsourcing is a widespread problem-solving model which consists in assigning tasks
to an existing pool of workers in order to solve a problem, being a scalable alternative to
hiring a group of experts for labeling high volumes of data. It can provide results that are
similar in quality, with the advantage of achieving such standards in a faster and more
efficient manner. Modern approaches to crowdsourcing use Machine Learning models to
do the labeling of the data and request the crowd to validate the results.
Such approaches can only be applied if the data in which the model was trained
(source data), and the data that needs labeling (target data) share some relation. Furthermore,
since the model is not adapted to the target data, its predictions may produce a
substantial amount of errors. Consequently, the validation of these predictions can be
very time-consuming. In this thesis, we propose an approach that leverages in-domain
data, which is a labeled portion of the target data, to adapt the model. The remainder
of the data is labeled based on these model’s predictions. The crowd is tasked with the
generation of the in-domain data and the validation of the model’s predictions. Under
this approach, train the model with only in-domain data and with both in-domain data
and data from an outer domain.
We apply these learning settings with the intent of optimizing a crowdsourcing pipeline
for the area of Natural Language Processing, more concretely for the task of Named Entity
Recognition (NER). This optimization relates to the effort required by the crowd to
performed the NER task. The results of the experiments show that the usage of in-domain
data achieves effort savings ranging from 6% to 53%. Furthermore, we such savings in
nine distinct datasets, which demonstrates the robustness and application depth of this
approach.
In conclusion, the in-domain data approach is capable of optimizing a crowdsourcing
pipeline of NER. Furthermore, it has a broader range of use cases when compared to
reusing a model to generate predictions in the target data.
Descrição
Palavras-chave
Crowdsourcing Transfer Learning Machine Learning Named Entity Recognition Natural Language Processing
