Sem título

Financiador

Organização

Publicações

Cross-modal subspace learning with scheduled adaptive margin constraints

Publication . Semedo, David; Magalhães, João; NOVALincs

Cross-modal embeddings, between textual and visual modalities, aim to organise multimodal instances by their semantic correlations. State-of-the-art approaches use maximum-margin methods, based on the hinge-loss, to enforce a constant margin m, to separate projections of multimodal instances from different categories. In this paper, we propose a novel scheduled adaptive maximum-margin (SAM) formulation that infers triplet-specific constraints during training, therefore organising instances by adaptively enforcing inter-category and inter-modality correlations. This is supported by a scheduled adaptive margin function, that is smoothly activated, replacing a static margin by an adaptively inferred one reflecting triplet-specific semantic correlations while accounting for the incremental learning behaviour of neural networks to enforce category cluster formation and enforcement. Experiments on widely used datasets show that our model improved upon state-of-the-art approaches, by achieving a relative improvement of up to approximate to 12.5% over the second best method, thus confirming the effectiveness of our scheduled adaptive margin formulation.

2019-10-15Documento de conferência

Acesso aberto

Ver mais

Diachronic cross-modal embeddings

Publication . Semedo, David; Magalhães, João; NOVALincs

Understanding the semantic shifts of multimodal information is only possible with models that capture cross-modal interactions over time. Under this paradigm, a new embedding is needed that structures visual-textual interactions according to the temporal dimension, thus, preserving data's original temporal organisation. This paper introduces a novel diachronic cross-modal embedding (DCM), where cross-modal correlations are represented in embedding space, throughout the temporal dimension, preserving semantic similarity at each instant t. To achieve this, we trained a neural cross-modal architecture, under a novel ranking loss strategy, that for each multimodal instance, enforces neighbour instances' temporal alignment, through subspace structuring constraints based on a temporal alignment window. Experimental results show that our DCM embedding successfully organises instances over time. Quantitative experiments, confirm that DCM is able to preserve semantic cross-modal correlations at each instant t while also providing better alignment capabilities. Qualitative experiments unveil new ways to browse multimodal content and hint that multimodal understanding tasks can benefit from this new embedding.

2019-10-15Documento de conferência

Acesso aberto

Ver mais

Conversational Search with Random Walks over Entity Graphs

Publication . Gonçalves, Gustavo; Magalhães, João; Callan, Jamie; NOVALincs

The entities that emerge during a conversation can be used to model topics, but not all entities are equally useful for this task. Modeling the conversation with entity graphs and predicting each entity's centrality in the conversation provides additional information that improves the retrieval of answer passages for the current question. Experiments show that using random walks to estimate entity centrality on conversation entity graphs improves top precision answer passage ranking over competitive transformer-based baselines.

2023-08-09Documento de conferência

Acesso aberto

Ver mais

Bridging Vision and Language over Time with Neural Cross-modal Embeddings

Publication . Semedo, David Fernandes; Magalhães, João

Giving computers the ability to understand multimedia content is one of the goals of Artificial Intelligence systems. While humans excel at this task, it remains a challenge, requiring bridging vision and language, which inherently have heterogeneous computational representations. Cross-modal embeddings are used to tackle this challenge, by learning a common space that uni es these representations. However, to grasp the semantics of an image, one must look beyond the pixels and consider its semantic and temporal context, with the latter being de ned by images’ textual descriptions and time dimension, respectively. As such, external causes (e.g. emerging events) change the way humans interpret and describe the same visual element over time, leading to the evolution of visual-textual correlations. In this thesis we investigate models that capture patterns of visual and textual interactions over time, by incorporating time in cross-modal embeddings: 1) in a relative manner, where by using pairwise temporal correlations to aid data structuring, we obtained a model that provides better visual-textual correspondences on dynamic corpora, and 2) in a diachronic manner, where the temporal dimension is fully preserved, thus capturing visual-textual correlations evolution under a principled approach that jointly models vision+language+time. Rich insights stemming from data evolution were extracted from a 20 years large-scale dataset. Additionally, towards improving the e ectiveness of these embedding learning models, we proposed a novel loss function that increases the expressiveness of the standard triplet-loss, by making it adaptive to the data at hand. With our adaptive triplet-loss, in which triplet speci c constraints are inferred and scheduled, we achieved state-of-the-art performance on the standard cross-modal retrieval task.

2020-07-15Tese de doutoramento

Acesso aberto

Ver mais

Knowledge-Driven Answer Generation for Conversational Search

Publication . Leite, Mariana Estríbio; Magalhães, João; Semedo, David

Conversational Information Seeking has been recognized as a major emerging research area, with the rise of a new generation of virtual personal assistants (Google Assistant, Alexa, Siri, Cortana, amongst others). These systems, however, only support limited information tasks in narrow domains. Conventional search engines, while supporting open-domain queries, provide the user with a ranked list of documents instead of straight- forward answers. In this context, we address the problem of open-domain conversation assistance sup- ported by a corpus of text passages from Wikipedia and Web, totaling 40 million passages. The core proposal of this thesis consists of a framework for generating answers by focus- ing the answers in the conversation central named entities. With this knowledge, various strategies were researched to select the Wikipedia passages that should be summarized by three different Transformer architectures. These models were fine-tuned for the sum- marization task and enabled the creation of a single and more natural knowledge guided answer for a given conversation turn. Our proposed pipeline was evaluated both quantitatively and qualitatively, most noto- riously using the TREC CAsT dataset and a human evaluation experiment with over 130 participants. Results have shown that the goal of creating answers with better informa- tion quality was successfully met. Furthermore, the application of a modified PageRank algorithm with the BART model has shown to further enhance the system’s performance by over 6%.

2021-02Dissertação de mestrado

Acesso aberto

Ver mais

Entidade financiadora

Fundação para a Ciência e a Tecnologia

Programa de financiamento

5665-PICT

Número da atribuição

CMUP-ERI/TIC/0046/2014

Sem título

Financiador

Autores

Publicações

Unidades organizacionais

Descrição

Palavras-chave

Contribuidores

Financiadores

Entidade financiadora

Programa de financiamento

Número da atribuição

ID