| Nome: | Descrição: | Tamanho: | Formato: | |
|---|---|---|---|---|
| 2.59 MB | Adobe PDF |
Orientador(es)
Resumo(s)
Extraction of relevant fields from documents has been a relevant matter for decades. Although there
are well-established algorithms to perform this task since the late XX century, this field of study has
again gathered more attention with the fast growth of deep learning models and transfer learning.
One of these models is LayoutLM, which is a Transformer-based architecture pre-trained with
additional features that represent the 2D position of the words.
In this dissertation, LayoutLM is fine-tuned on a set of invoices to extract some of its relevant fields,
such as company name, address, document date, among others. Given the objective of deploying the
model in a company’s internal accountant software, an end-to-end machine learning pipeline is
presented. The training layer receives batches with images of documents and their corresponding
annotations and fine-tunes the model for a sequence labeling task. The production layer inputs images
and predicts the relevant fields.
The images are pre-processed extracting the whole document text and bounding boxes using OCR. To
automatically label the samples using Transformers-based input format, the text is labeled using an
algorithm that searches parts of the text equal or highly similar to the annotations.
Also, a new dataset to support this work is created and made publicly available. The dataset consists
of 813 pictures and the annotation text for every relevant field, which include company name,
company address, document date, document number, buyer tax number, seller tax number, total
amount and tax amount.
The models are fine-tuned and compared with two baseline models, showing a performance very close
to the presented by the model authors. A sensitivity analysis is made to understand the impact of two
datasets with different characteristics. In addition, the learning curves for different datasets define
empirically that 100 to 200 samples are enough to fine-tune the model and achieve top performance.
Based on the results, a strategy for model deployment is defined. Empirical results show that the
already fine-tuned model is enough to guarantee top performance in production without the need of
using online learning algorithms.
Descrição
Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data Science
Palavras-chave
Document data extraction Deep Learning Transformers Invoice dataset
