A carregar...
Projeto de investigação
Converging broadcast and user generated content for interactive ultra-high definition services
Financiador
Autores
Publicações
Bridging Vision and Language over Time with Neural Cross-modal Embeddings
Publication . Semedo, David Fernandes; Magalhães, João
Giving computers the ability to understand multimedia content is one of the goals
of Artificial Intelligence systems. While humans excel at this task, it remains a challenge,
requiring bridging vision and language, which inherently have heterogeneous
computational representations. Cross-modal embeddings are used to tackle this challenge,
by learning a common space that uni es these representations. However, to grasp
the semantics of an image, one must look beyond the pixels and consider its semantic
and temporal context, with the latter being de ned by images’ textual descriptions and
time dimension, respectively. As such, external causes (e.g. emerging events) change the
way humans interpret and describe the same visual element over time, leading to the
evolution of visual-textual correlations.
In this thesis we investigate models that capture patterns of visual and textual interactions
over time, by incorporating time in cross-modal embeddings: 1) in a relative manner,
where by using pairwise temporal correlations to aid data structuring, we obtained a
model that provides better visual-textual correspondences on dynamic corpora, and 2) in
a diachronic manner, where the temporal dimension is fully preserved, thus capturing
visual-textual correlations evolution under a principled approach that jointly models
vision+language+time. Rich insights stemming from data evolution were extracted from
a 20 years large-scale dataset. Additionally, towards improving the e ectiveness of these
embedding learning models, we proposed a novel loss function that increases the expressiveness
of the standard triplet-loss, by making it adaptive to the data at hand. With our
adaptive triplet-loss, in which triplet speci c constraints are inferred and scheduled, we
achieved state-of-the-art performance on the standard cross-modal retrieval task.
A benchmark of visual storytelling in social media
Publication . Marcelino, Gonçalo; Semedo, David; Mourão, André; Blasi, Saverio; Mrak, Marta; Magalhães, João; NOVALincs
Media editors in the newsroom are constantly pressed to provide a "like-being there" coverage of live events. Social media provides a disorganised collection of images and videos that media professionals need to grasp before publishing their latest news updated. Automated news visual storyline editing with social media content can be very challenging, as it not only entails the task of finding the right content but also making sure that news content evolves coherently over time. To tackle these issues, this paper proposes a benchmark for assessing social media visual storylines. The SocialStories benchmark, comprised by total of 40 curated stories covering sports and cultural events, provides the experimental setup and introduces novel quantitative metrics to perform a rigorous evaluation of visual storytelling with social media data.
Diachronic cross-modal embeddings
Publication . Semedo, David; Magalhães, João; NOVALincs
Understanding the semantic shifts of multimodal information is only possible with models that capture cross-modal interactions over time. Under this paradigm, a new embedding is needed that structures visual-textual interactions according to the temporal dimension, thus, preserving data's original temporal organisation. This paper introduces a novel diachronic cross-modal embedding (DCM), where cross-modal correlations are represented in embedding space, throughout the temporal dimension, preserving semantic similarity at each instant t. To achieve this, we trained a neural cross-modal architecture, under a novel ranking loss strategy, that for each multimodal instance, enforces neighbour instances' temporal alignment, through subspace structuring constraints based on a temporal alignment window. Experimental results show that our DCM embedding successfully organises instances over time. Quantitative experiments, confirm that DCM is able to preserve semantic cross-modal correlations at each instant t while also providing better alignment capabilities. Qualitative experiments unveil new ways to browse multimodal content and hint that multimodal understanding tasks can benefit from this new embedding.
Cross-modal subspace learning with scheduled adaptive margin constraints
Publication . Semedo, David; Magalhães, João; NOVALincs
Cross-modal embeddings, between textual and visual modalities, aim to organise multimodal instances by their semantic correlations. State-of-the-art approaches use maximum-margin methods, based on the hinge-loss, to enforce a constant margin m, to separate projections of multimodal instances from different categories. In this paper, we propose a novel scheduled adaptive maximum-margin (SAM) formulation that infers triplet-specific constraints during training, therefore organising instances by adaptively enforcing inter-category and inter-modality correlations. This is supported by a scheduled adaptive margin function, that is smoothly activated, replacing a static margin by an adaptively inferred one reflecting triplet-specific semantic correlations while accounting for the incremental learning behaviour of neural networks to enforce category cluster formation and enforcement. Experiments on widely used datasets show that our model improved upon state-of-the-art approaches, by achieving a relative improvement of up to approximate to 12.5% over the second best method, thus confirming the effectiveness of our scheduled adaptive margin formulation.
Unidades organizacionais
Descrição
Palavras-chave
Contribuidores
Financiadores
Entidade financiadora
European Commission
Programa de financiamento
H2020
Número da atribuição
687605
