Utilize este identificador para referenciar este registo:
http://hdl.handle.net/10362/101656
Título: | Bridging Vision and Language over Time with Neural Cross-modal Embeddings |
Autor: | Semedo, David Fernandes |
Orientador: | Magalhães, João |
Palavras-chave: | Temporal Embeddings Cross-modal embeddings Multimedia understanding Vision and language Neural networks |
Data de Defesa: | 15-Jul-2020 |
Resumo: | Giving computers the ability to understand multimedia content is one of the goals of Artificial Intelligence systems. While humans excel at this task, it remains a challenge, requiring bridging vision and language, which inherently have heterogeneous computational representations. Cross-modal embeddings are used to tackle this challenge, by learning a common space that uni es these representations. However, to grasp the semantics of an image, one must look beyond the pixels and consider its semantic and temporal context, with the latter being de ned by images’ textual descriptions and time dimension, respectively. As such, external causes (e.g. emerging events) change the way humans interpret and describe the same visual element over time, leading to the evolution of visual-textual correlations. In this thesis we investigate models that capture patterns of visual and textual interactions over time, by incorporating time in cross-modal embeddings: 1) in a relative manner, where by using pairwise temporal correlations to aid data structuring, we obtained a model that provides better visual-textual correspondences on dynamic corpora, and 2) in a diachronic manner, where the temporal dimension is fully preserved, thus capturing visual-textual correlations evolution under a principled approach that jointly models vision+language+time. Rich insights stemming from data evolution were extracted from a 20 years large-scale dataset. Additionally, towards improving the e ectiveness of these embedding learning models, we proposed a novel loss function that increases the expressiveness of the standard triplet-loss, by making it adaptive to the data at hand. With our adaptive triplet-loss, in which triplet speci c constraints are inferred and scheduled, we achieved state-of-the-art performance on the standard cross-modal retrieval task. |
URI: | http://hdl.handle.net/10362/101656 |
Designação: | Doctor of Philosophy in Computer Science |
Aparece nas colecções: | FCT: DI - Teses de Doutoramento |
Ficheiros deste registo:
Ficheiro | Descrição | Tamanho | Formato | |
---|---|---|---|---|
Semedo_2020.pdf | 22,74 MB | Adobe PDF | Ver/Abrir |
Todos os registos no repositório estão protegidos por leis de copyright, com todos os direitos reservados.