Comparing Multimodal LLMS and Traditional Neural Networks for Table Extraction From PDFs and Images: An Evaluation of Structure and Content Extraction from Table in Images

Nunes, Guilherme Guerra Marques

http://hdl.handle.net/10362/191038

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
TCDMAA4517.pdf		7.85 MB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

Nunes, Guilherme Guerra Marques

Orientador(es)

Baptista, Márcia Lourenço

Resumo(s)

Extracting tables from images, such as cropped sections from PDFs or screenshots of spreadsheets, remains a challenging task due to the variability in table layouts and the absence of structural metadata. Traditional OCR-based systems, like the Table Transformer (TATR) combined with PaddleOCR, rely on explicit structure detection and text recognition. More recently, multimodal Large Language Models (LLMs) such as GPT-4o, GPT-4o Mini, Granite Vision, and PHI-3 Vision have introduced an alternative approach, generating structured outputs directly from images without relying on traditional OCR pipelines. This thesis compares both strategies using 2,000 annotated tables from the PubTables-1M dataset, evenly split between simple and complex cases. Evaluation focuses on structural accuracy, content fidelity, and layout robustness, with GriTSCon used as a unified metric. Results show that GPT-4o performs best among multimodal LLMs on simple tables (GriTSCon F1 = 89.6%), while TATR-OCR outperforms all models on complex tables (GriTSCon F1 = 85.5%). GPT-4o achieves higher cell-content accuracy at exact-match thresholds on simple layouts but experiences a performance drop of 17 points when handling complex structures. In contrast, TATR-OCR maintains high accuracy across both scenarios, with low failure rates and stable structure recognition. These findings highlight the limitations of current multimodal LLMs in complex visual tasks and support the potential of hybrid approaches that combine the strengths of OCR-based systems with LLM reasoning capabilities.

A extração de tabelas a partir de images, como recortes de PDFs ou capturas de tela de planilhas, continua sendo uma tarefa desafiadora devido à variabilidade dos layouts e à ausência de metadados estruturais. Sistemas tradicionais baseados em OCR, como o Table Transformer (TATR) combinado com o PaddleOCR, dependem da detecção explícita da estrutura e do reconhecimento de texto. Mais recentemente, modelos de linguagem multimodal de grande escala , como o GPT-4o, GPT-4o Mini, Granite Vision e PHI-3 Vision, introduziram uma abordagem alternativa, gerando saídas estruturadas diretamente a partir de imagens, sem utilizar um pipeline OCR tradicional. Esta dissertação compara essas duas estratégias utilizando 2.000 tabelas anotadas do conjunto de dados PubTables-1M,divididas igualmente entre casos simples e complexos. A avaliação foca na precisão estrutural, fidelidade do conteúdo e robustez do layout, utilizando o GriTSCon como métrica unificada. Os resultados mostram que o GPT-4o apresenta o melhor desempenho entre os MLLMs em tabelas simples (GriTSCon F1 = 89,6%), enquanto o TATR-OCR supera todos os modelos em tabelas complexas (GriTSCon F1 = 85,5%). O GPT-4o alcança maior precisão de conteúdo em limiares de correspondência exata em layouts simples, mas sofre uma queda de aproximadamente 17 pontos em estruturas mais complexas. Em contraste, o TATR-OCR mantém alta precisão em ambos os cenários, com baixas taxas de falha e reconhecimento estrutural estável. Esses resultados destacam as limitações dos LLMs atuais em tarefas visuais complexas e apontam para o potencial de abordagens híbridas que combinam as vantagens dos sistemas baseados em OCR com a capacidade de raciocínio dos LLMs.

Descrição

Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data Science

Palavras-chave

Table Extraction Optical Character Recognition Multimodal Large Language Models Table Transformer GriTSCon SDG 4 - Quality education SDG 9 - Industry, innovation and infrastructure SDG 16 - Peace, justice and strong institutions SDG 17 - Partnerships for the goals Extração de Tabelas Reconhecimento Óptico de Caracteres Modelos Multimodais de Linguagem Table Transformer GriTSCon

URI

http://hdl.handle.net/10362/191038

Coleções

NIMS - Dissertações de Mestrado em Ciência de Dados e Métodos Analíticos Avançados (Data Science and Advanced Analytics)

Licença CC

cclicense-by

Ver registo completo