Bridging Vision and Language over Time with Neural Cross-modal Embeddings

Semedo, David Fernandes

Please use this identifier to cite or link to this item: http://hdl.handle.net/10362/101656

Title:	Bridging Vision and Language over Time with Neural Cross-modal Embeddings
Author:	Semedo, David Fernandes
Advisor:	Magalhães, João
Keywords:	Temporal Embeddings Cross-modal embeddings Multimedia understanding Vision and language Neural networks
Defense Date:	15-Jul-2020
Abstract:	Giving computers the ability to understand multimedia content is one of the goals of Artificial Intelligence systems. While humans excel at this task, it remains a challenge, requiring bridging vision and language, which inherently have heterogeneous computational representations. Cross-modal embeddings are used to tackle this challenge, by learning a common space that uni es these representations. However, to grasp the semantics of an image, one must look beyond the pixels and consider its semantic and temporal context, with the latter being de ned by images’ textual descriptions and time dimension, respectively. As such, external causes (e.g. emerging events) change the way humans interpret and describe the same visual element over time, leading to the evolution of visual-textual correlations. In this thesis we investigate models that capture patterns of visual and textual interactions over time, by incorporating time in cross-modal embeddings: 1) in a relative manner, where by using pairwise temporal correlations to aid data structuring, we obtained a model that provides better visual-textual correspondences on dynamic corpora, and 2) in a diachronic manner, where the temporal dimension is fully preserved, thus capturing visual-textual correlations evolution under a principled approach that jointly models vision+language+time. Rich insights stemming from data evolution were extracted from a 20 years large-scale dataset. Additionally, towards improving the e ectiveness of these embedding learning models, we proposed a novel loss function that increases the expressiveness of the standard triplet-loss, by making it adaptive to the data at hand. With our adaptive triplet-loss, in which triplet speci c constraints are inferred and scheduled, we achieved state-of-the-art performance on the standard cross-modal retrieval task.
URI:	http://hdl.handle.net/10362/101656
Designation:	Doctor of Philosophy in Computer Science
Appears in Collections:	FCT: DI - Teses de Doutoramento

Files in This Item:

File	Description	Size	Format
Semedo_2020.pdf		22,74 MB	Adobe PDF	View/Open

Show full item record Give your opinion