| Nome: | Descrição: | Tamanho: | Formato: | |
|---|---|---|---|---|
| 22.21 MB | Adobe PDF |
Autores
Orientador(es)
Resumo(s)
Giving computers the ability to understand multimedia content is one of the goals
of Artificial Intelligence systems. While humans excel at this task, it remains a challenge,
requiring bridging vision and language, which inherently have heterogeneous
computational representations. Cross-modal embeddings are used to tackle this challenge,
by learning a common space that uni es these representations. However, to grasp
the semantics of an image, one must look beyond the pixels and consider its semantic
and temporal context, with the latter being de ned by images’ textual descriptions and
time dimension, respectively. As such, external causes (e.g. emerging events) change the
way humans interpret and describe the same visual element over time, leading to the
evolution of visual-textual correlations.
In this thesis we investigate models that capture patterns of visual and textual interactions
over time, by incorporating time in cross-modal embeddings: 1) in a relative manner,
where by using pairwise temporal correlations to aid data structuring, we obtained a
model that provides better visual-textual correspondences on dynamic corpora, and 2) in
a diachronic manner, where the temporal dimension is fully preserved, thus capturing
visual-textual correlations evolution under a principled approach that jointly models
vision+language+time. Rich insights stemming from data evolution were extracted from
a 20 years large-scale dataset. Additionally, towards improving the e ectiveness of these
embedding learning models, we proposed a novel loss function that increases the expressiveness
of the standard triplet-loss, by making it adaptive to the data at hand. With our
adaptive triplet-loss, in which triplet speci c constraints are inferred and scheduled, we
achieved state-of-the-art performance on the standard cross-modal retrieval task.
Descrição
Palavras-chave
Temporal Embeddings Cross-modal embeddings Multimedia understanding Vision and language Neural networks
