Please use this identifier to cite or link to this item:
http://hdl.handle.net/10362/101656
Title: | Bridging Vision and Language over Time with Neural Cross-modal Embeddings |
Author: | Semedo, David Fernandes |
Advisor: | Magalhães, João |
Keywords: | Temporal Embeddings Cross-modal embeddings Multimedia understanding Vision and language Neural networks |
Defense Date: | 15-Jul-2020 |
Abstract: | Giving computers the ability to understand multimedia content is one of the goals of Artificial Intelligence systems. While humans excel at this task, it remains a challenge, requiring bridging vision and language, which inherently have heterogeneous computational representations. Cross-modal embeddings are used to tackle this challenge, by learning a common space that uni es these representations. However, to grasp the semantics of an image, one must look beyond the pixels and consider its semantic and temporal context, with the latter being de ned by images’ textual descriptions and time dimension, respectively. As such, external causes (e.g. emerging events) change the way humans interpret and describe the same visual element over time, leading to the evolution of visual-textual correlations. In this thesis we investigate models that capture patterns of visual and textual interactions over time, by incorporating time in cross-modal embeddings: 1) in a relative manner, where by using pairwise temporal correlations to aid data structuring, we obtained a model that provides better visual-textual correspondences on dynamic corpora, and 2) in a diachronic manner, where the temporal dimension is fully preserved, thus capturing visual-textual correlations evolution under a principled approach that jointly models vision+language+time. Rich insights stemming from data evolution were extracted from a 20 years large-scale dataset. Additionally, towards improving the e ectiveness of these embedding learning models, we proposed a novel loss function that increases the expressiveness of the standard triplet-loss, by making it adaptive to the data at hand. With our adaptive triplet-loss, in which triplet speci c constraints are inferred and scheduled, we achieved state-of-the-art performance on the standard cross-modal retrieval task. |
URI: | http://hdl.handle.net/10362/101656 |
Designation: | Doctor of Philosophy in Computer Science |
Appears in Collections: | FCT: DI - Teses de Doutoramento |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Semedo_2020.pdf | 22,74 MB | Adobe PDF | View/Open |
Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.