NewsMQA: A Multimodal Question Answering Benchmark over News Pieces

Lopes, Carolina Magalhães

http://hdl.handle.net/10362/163255

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
Lopes_2023.pdf		23.28 MB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

Lopes, Carolina Magalhães

Orientador(es)

Semedo, David

Resumo(s)

News articles are one of the main sources of information available to people all around the world. These are highly diverse documents that cover a wide range of topics and refer to various entities, relationships and concepts spanning through a multitude of domains. The illustrations accompanying these articles play a crucial role in capturing the readers’ attention and adding to the storytelling. In complex domains such as news, question-answering can be an effective method for delivering acute information-seeking. Due to the open-domain and multimodal nature of news, where images complement the textual medium, QA systems are required to dynamically multiplex both visual and linguistic sources to ground answers. To create such a system, we need to combine both natural language and visual information to find an answer. Moreover, the model needs to answer open-ended questions and learn multimodal representations, correlating both images and text elements. In this thesis, we propose NewsMQA, a novel dataset and benchmark for the task of Multimodal Question-answering for News. NewsMQA differs from existing datasets by fully enclosing the multimodal facets of news and improving on the quality vs. scale trade-off. We adopt a two-part approach that combines human annotation with synthetic question-answer generation through answer roundtrip consistency. We comprehensively study the created dataset, highlighting its unique characteristics, features, quality, and the research challenges of the task that it supports. To benchmark the dataset, we leverage pre-trained Transformers and propose different strategies to extend these with visual information extracted from corresponding images. We conduct an extensive evaluation of the intricacies and challenges of the dataset and provide insights regarding the impact of enriching the input of these models with image-related information. Finally, we provide a critical discussion regarding the best performing approaches and discuss the task open challenges.

Artigos de notícias são uma das principais fontes de informação disponíveis para pessoas em todo o mundo. São documentos que cobrem uma ampla gama de tópicos e que referem várias entidades, relacionamentos e conceitos que abrangem uma diversidade de domínios. As ilustrações que acompanham estes artigos desempenham um papel crucial em capturar a atenção dos leitores e contribuem para o eriqueciemtno da narrativa. Em domínios complexos como as notícias, métodos de resposta a perguntas podem ser bastante eficazes para fornecer informações ao utilizador. Devido à natureza multimodal das noticias, estes sistemas deveriam comtemplar tanto estas fontes visuais como as linguísticas para conseguir responder corretamente. Para criar um sistema assim, é necessário abordar tarefas como processamento de linguagem natural e de visão. Além disso, é preciso responder a perguntas abertas e aprender representações multimodais, correlacionando as imagens e o texto. Os conjunto de dados existentes para notícias são bastante limitados e ocultam a complexidade do problema. Propomos NewsMQA, um conjunto de dados para a tarefa de Resposta a Perguntas Multimodais para Notícias. Este difere dos conjuntos de dados existentes, contemplando as facetas multimodais das notícias e procurando um boa relação qualidade-tamanho. Em relação à última, sugerimos uma abordagem que combina anotação humana com geração sintética de perguntas e respostas. Fornecemos uma analise abrangente do conjunto de dados introduzido, destacando as suas características e desafios que suporta. Para avaliar o nosso conjunto de dados, utilizamos Transformers pré-treinados e propomos estender estes modelos para suportar multimodalidade, incorporando informações extraídas das imagens na sequencia de entrada fornecida a estes. Realizamos um conjunto de análises e estudos com os quais avaliamos e discutimos a complexidade e desafios do conjunto de dados e fornecemos a nossa percepção sobre as melhores informações que podemos usar para enriquecer os modelos e melhorar sua performance na resolução da tarefa formulada.

Palavras-chave

News Media Dataset Creation Transformers Multimodal Question Answering Natural Language Processing

URI

http://hdl.handle.net/10362/163255

Coleções

FCT: DI - Dissertações de Mestrado

Ver registo completo