A carregar...
Projeto de investigação
Sem título
Financiador
Autores
Publicações
Cross-modal subspace learning with scheduled adaptive margin constraints
Publication . Semedo, David; Magalhães, João; NOVALincs
Cross-modal embeddings, between textual and visual modalities, aim to organise multimodal instances by their semantic correlations. State-of-the-art approaches use maximum-margin methods, based on the hinge-loss, to enforce a constant margin m, to separate projections of multimodal instances from different categories. In this paper, we propose a novel scheduled adaptive maximum-margin (SAM) formulation that infers triplet-specific constraints during training, therefore organising instances by adaptively enforcing inter-category and inter-modality correlations. This is supported by a scheduled adaptive margin function, that is smoothly activated, replacing a static margin by an adaptively inferred one reflecting triplet-specific semantic correlations while accounting for the incremental learning behaviour of neural networks to enforce category cluster formation and enforcement. Experiments on widely used datasets show that our model improved upon state-of-the-art approaches, by achieving a relative improvement of up to approximate to 12.5% over the second best method, thus confirming the effectiveness of our scheduled adaptive margin formulation.
Diachronic cross-modal embeddings
Publication . Semedo, David; Magalhães, João; NOVALincs
Understanding the semantic shifts of multimodal information is only possible with models that capture cross-modal interactions over time. Under this paradigm, a new embedding is needed that structures visual-textual interactions according to the temporal dimension, thus, preserving data's original temporal organisation. This paper introduces a novel diachronic cross-modal embedding (DCM), where cross-modal correlations are represented in embedding space, throughout the temporal dimension, preserving semantic similarity at each instant t. To achieve this, we trained a neural cross-modal architecture, under a novel ranking loss strategy, that for each multimodal instance, enforces neighbour instances' temporal alignment, through subspace structuring constraints based on a temporal alignment window. Experimental results show that our DCM embedding successfully organises instances over time. Quantitative experiments, confirm that DCM is able to preserve semantic cross-modal correlations at each instant t while also providing better alignment capabilities. Qualitative experiments unveil new ways to browse multimodal content and hint that multimodal understanding tasks can benefit from this new embedding.
Conversational Search with Random Walks over Entity Graphs
Publication . Gonçalves, Gustavo; Magalhães, João; Callan, Jamie; NOVALincs
The entities that emerge during a conversation can be used to model topics, but not all entities are equally useful for this task. Modeling the conversation with entity graphs and predicting each entity's centrality in the conversation provides additional information that improves the retrieval of answer passages for the current question. Experiments show that using random walks to estimate entity centrality on conversation entity graphs improves top precision answer passage ranking over competitive transformer-based baselines.
Bridging Vision and Language over Time with Neural Cross-modal Embeddings
Publication . Semedo, David Fernandes; Magalhães, João
Giving computers the ability to understand multimedia content is one of the goals
of Artificial Intelligence systems. While humans excel at this task, it remains a challenge,
requiring bridging vision and language, which inherently have heterogeneous
computational representations. Cross-modal embeddings are used to tackle this challenge,
by learning a common space that uni es these representations. However, to grasp
the semantics of an image, one must look beyond the pixels and consider its semantic
and temporal context, with the latter being de ned by images’ textual descriptions and
time dimension, respectively. As such, external causes (e.g. emerging events) change the
way humans interpret and describe the same visual element over time, leading to the
evolution of visual-textual correlations.
In this thesis we investigate models that capture patterns of visual and textual interactions
over time, by incorporating time in cross-modal embeddings: 1) in a relative manner,
where by using pairwise temporal correlations to aid data structuring, we obtained a
model that provides better visual-textual correspondences on dynamic corpora, and 2) in
a diachronic manner, where the temporal dimension is fully preserved, thus capturing
visual-textual correlations evolution under a principled approach that jointly models
vision+language+time. Rich insights stemming from data evolution were extracted from
a 20 years large-scale dataset. Additionally, towards improving the e ectiveness of these
embedding learning models, we proposed a novel loss function that increases the expressiveness
of the standard triplet-loss, by making it adaptive to the data at hand. With our
adaptive triplet-loss, in which triplet speci c constraints are inferred and scheduled, we
achieved state-of-the-art performance on the standard cross-modal retrieval task.
Knowledge-Driven Answer Generation for Conversational Search
Publication . Leite, Mariana Estríbio; Magalhães, João; Semedo, David
Conversational Information Seeking has been recognized as a major emerging research
area, with the rise of a new generation of virtual personal assistants (Google Assistant,
Alexa, Siri, Cortana, amongst others). These systems, however, only support limited
information tasks in narrow domains. Conventional search engines, while supporting
open-domain queries, provide the user with a ranked list of documents instead of straight-
forward answers.
In this context, we address the problem of open-domain conversation assistance sup-
ported by a corpus of text passages from Wikipedia and Web, totaling 40 million passages.
The core proposal of this thesis consists of a framework for generating answers by focus-
ing the answers in the conversation central named entities. With this knowledge, various
strategies were researched to select the Wikipedia passages that should be summarized
by three different Transformer architectures. These models were fine-tuned for the sum-
marization task and enabled the creation of a single and more natural knowledge guided
answer for a given conversation turn.
Our proposed pipeline was evaluated both quantitatively and qualitatively, most noto-
riously using the TREC CAsT dataset and a human evaluation experiment with over 130
participants. Results have shown that the goal of creating answers with better informa-
tion quality was successfully met. Furthermore, the application of a modified PageRank
algorithm with the BART model has shown to further enhance the system’s performance
by over 6%.
Unidades organizacionais
Descrição
Palavras-chave
Contribuidores
Financiadores
Entidade financiadora
Fundação para a Ciência e a Tecnologia
Programa de financiamento
5665-PICT
Número da atribuição
CMUP-ERI/TIC/0046/2014
