Where and When? Large Vision-Language Models for Entity Prediction in Multimodal News Events

Calvo, Bernardo Alexandre Vaz

http://hdl.handle.net/10362/182403

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
Calvo_2024.pdf		3.69 MB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

Calvo, Bernardo Alexandre Vaz

Orientador(es)

Semedo, David

Müller-Budack, Eric

Resumo(s)

The contemporary era has introduced a multimedia format in news communication, encompassing text, images, and video content. However, understanding multimodal events requires combining vision, language, and event knowledge, motivating to perform event-related tasks in the news domain, which have diverse applications. The recent emergence and success of Large Vision-Language Models offered new possibilities to address complex challenges in different areas of research, as they have shown to be effective in encoding and using world knowledge to perform several tasks. Although, the question remains whether these models are capable of reasoning over real-world event contexts and can be further used in event-related tasks. The core motivation of this thesis is to explore the state-of-the-art models’ behavior on multimodal events data, focusing on temporal and spatial information, contributing to the development of ubiquitous Artificial Intelligence assistants. We proposed a novel proxy task, involving both general and fine-grained evaluations, and diverse settings to align with the study’s objectives. Together with, a new dataset specifically created for this task, containing multimodal news data on both past events and more recent or future ones. Additionally, we conducted an agreement study to assess whether powerful models can evaluate the task, as humans would. The key findings of this study revealed that multimodal models leverage visual cues and outperform their Large Language Model backbones. Furthermore, the use of In- Context Learning prompting strategies enhances performance compared to zero-shot prompts, and fine-tuning is the most appropriate approach to achieve even better results. Generally, models behave better on spatial information than temporal and generalize well for event and location entities when reasoning about unseen data, but they show clear limitations with recent dates. Therefore, this thesis provides insights into the potential of Large Vision-Language Models for multimodal event-related tasks.

A era contemporânea introduziu o formato multimédia na comunicação de notícias, incluindo texto, imagens e vídeo. No entanto, compreender eventos multimodais requer a combinação de visão, linguagem e conhecimento geral sobre os mesmos, motivando a realização de tarefas relacionadas com eventos que têm diversas aplicações. O recente sucesso dos Large Vision-Language Models oferecem novas possibilidades de abordar desafios em diferentes áreas de investigação, uma vez que demonstram ser capazes de codificar e utilizar conhecimento do mundo para realizar várias tarefas. Contudo, a questão mantem-se sobre se estes modelos são capazes de raciocinar sobre contextos de eventos do mundo real e se podem ser usados em tarefas relacionadas com eventos. A principal motivação desta tese é explorar o comportamento de modelos state-of-the- art em dados multimodais sobre eventos, com foco em informações temporais e espaciais, contribuindo para o desenvolvimento de assistentes de Inteligência Artificial. Propusemos uma nova tarefa, que envolve avaliações gerais e mais detalhadas, e diversas configurações para se alinhar com os objetivos deste estudo. Criámos um novo conjunto de dados especificamente para esta tarefa, contendo dados multimodais de notícias sobre eventos passados, assim como eventos mais recentes ou futuros. Adicionalmente, realizámos um estudo de concordância para avaliar se modelos mais avançados conseguem avaliar a tarefa da mesma forma que os humanos. As conclusões revelam que os modelos multimodais tiram partido das imagens e superam os seus respetivos Large Language Models. Além disso, estratégias de prompting como o In-Context Learning melhoram o desempenho em comparação com zero-shot, e o fine-tuning é a abordagem mais apropriada para se alcançar resultados ainda melhores. No geral, os modelos apresentam um melhor desempenho em espaço do que em tempo e generalizam bem para entidades de eventos e localizações em dados novos, mas demons- tram limitações com datas recentes. Portanto, esta tese fornece novos conhecimentos sobre o potencial destes modelos para realizar tarefas multimodais relacionadas com eventos.

Palavras-chave

Real-world events Time and space Large Vision-Language Models Masked Entity Prediction Prompting Supervised fine-tuning

URI

http://hdl.handle.net/10362/182403

Coleções

FCT: DI - Dissertações de Mestrado

Ver registo completo