Bridging Vision and Language over Time with Neural Cross-modal Embeddings

Semedo, David Fernandes

Utilize este identificador para referenciar este registo: http://hdl.handle.net/10362/101656

Título:	Bridging Vision and Language over Time with Neural Cross-modal Embeddings
Autor:	Semedo, David Fernandes
Orientador:	Magalhães, João
Palavras-chave:	Temporal Embeddings Cross-modal embeddings Multimedia understanding Vision and language Neural networks
Data de Defesa:	15-Jul-2020
Resumo:	Giving computers the ability to understand multimedia content is one of the goals of Artificial Intelligence systems. While humans excel at this task, it remains a challenge, requiring bridging vision and language, which inherently have heterogeneous computational representations. Cross-modal embeddings are used to tackle this challenge, by learning a common space that uni es these representations. However, to grasp the semantics of an image, one must look beyond the pixels and consider its semantic and temporal context, with the latter being de ned by images’ textual descriptions and time dimension, respectively. As such, external causes (e.g. emerging events) change the way humans interpret and describe the same visual element over time, leading to the evolution of visual-textual correlations. In this thesis we investigate models that capture patterns of visual and textual interactions over time, by incorporating time in cross-modal embeddings: 1) in a relative manner, where by using pairwise temporal correlations to aid data structuring, we obtained a model that provides better visual-textual correspondences on dynamic corpora, and 2) in a diachronic manner, where the temporal dimension is fully preserved, thus capturing visual-textual correlations evolution under a principled approach that jointly models vision+language+time. Rich insights stemming from data evolution were extracted from a 20 years large-scale dataset. Additionally, towards improving the e ectiveness of these embedding learning models, we proposed a novel loss function that increases the expressiveness of the standard triplet-loss, by making it adaptive to the data at hand. With our adaptive triplet-loss, in which triplet speci c constraints are inferred and scheduled, we achieved state-of-the-art performance on the standard cross-modal retrieval task.
URI:	http://hdl.handle.net/10362/101656
Designação:	Doctor of Philosophy in Computer Science
Aparece nas colecções:	FCT: DI - Teses de Doutoramento

Ficheiros deste registo:

Ficheiro	Descrição	Tamanho	Formato
Semedo_2020.pdf		22,74 MB	Adobe PDF	Ver/Abrir

Mostrar registo em formato completo Dê a sua opinião sobre este registo.