MapIntel: Enhancing Competitive Intelligence Acquisition Through Embeddings and Visual Analytics

. Competitive Intelligence allows an organization to keep up with market trends and foresee business opportunities. This practice is mainly performed by analysts scanning for any piece of valuable information in a myriad of dispersed and unstructured sources. Here we present MapIntel, a system for acquiring intelligence from vast collections of text data by representing each document as a multidimensional vector that captures its own semantics. The system is designed to handle complex Natural Language queries and visual exploration of the corpus, potentially aiding overburdened analysts in finding meaningful insights to help decision-making. The system searching module uses a retriever and re-ranker engine that first finds the closest neighbors to the query embedding and then sifts the results through a cross-encoder model that identifies the most relevant documents. The browsing module also leverages the embeddings by projecting them onto two dimensions while preserving the original landscape, resulting in a map where semantically related documents form topical clusters which we capture using topic modeling. This map aims at promoting a fast overview of the corpus while allowing a more detailed exploration and interactive information encountering process. In this work, we evaluate the system and its components on the 20 newsgroups dataset and demonstrate the superiority of Transformer-based components.


Introduction
Competitive Intelligence (CI) is the process and forward-looking practices used in producing knowledge about the competitive environment to improve organizational performance [17].CI has a fundamental role in helping businesses remain competitive, influencing a wide range of decision-making areas, and leading to substantial improvements such as the increase of revenue, new products or services, cost savings, time savings, profit increases, and achievement of financial goals.
Competitive Intelligence analysts are responsible for developing the CI task through a combination of gathering data, processing it, and communicating information.The digitalization of the market and the growth of the data economy have pushed the business environment to an online realm where every action and event is public and thus potentially relevant for decision-making.This shift has produced a large volume of data about products, customers, competitors, and any aspect of the business environment that can be used to foresee opportunities and risks.Given the vastness and diversity of this data, it has become necessary to design tools that can aid analysts in the CI gathering and analysis process.Therefore, the goal is to enhance the analyst's task by providing a means to explore, organize and visualize the environmental data present in the array of existing sources.
Traditionally, the most important sources of CI have been news providers, corporate websites, and trade publications, respectively [19].With the advent of the internet, new sources, such as social networks [7], have emerged, while existing ones have become enriched and easily accessible.Despite the increased availability, CI resources are dispersed through a variety of websites and the underlying data is unstructured and noisy.These characteristics add to the difficulty of the analyst's task and exacerbate the need for tools to support it.
Various studies have attempted to create systems for exploring and gathering intelligence from extensive collections of textual data [7,9,14,5].These studies have consistently applied Natural Language Processing (NLP) techniques to help users comprehend large volumes of text without requiring to sift through every document.[7] designed a system for CI that captures data from multiple sources, cleans it, uses NLP to identify and tag the relevant content, stores it, generates consolidated reports, and produces alerts on predefined triggers.
Although the previous systems have been used for dealing with large amounts of text successfully, insufficient attention has been paid to the exploratory and serendipitous aspects of the analyst's task.Accordingly, we propose an information environment that supports analysts in having stimulating and productive information encounters [8].This is achieved by incorporating two types of information acquisition tasks: searching, consisting of an information retrieval module that allows ad hoc queries on the entire document collection, giving the user the ability to seek information actively, and browsing, consisting of a visualization module that equips the user with tools to actively or passively acquire information through the visual exploration of the document corpus (and its thematic cohorts) in a two-dimensional map.
With the recent emergence of the Transformer architecture [28], significant improvements were made in several NLP subdomains, having reached state-ofthe-art results in a wide range of tasks [28].This new architecture is based solely on the attention mechanism, providing parallelization capabilities and thus avoiding the sequential nature of existing Recurrence models.The attention mechanism allows incorporating information from the input sequence words into the one it is currently being processed, thus providing "context" to the word from the rest of the sequence.Language models like Bidirectional Encoder Rep-resentations from Transformers (BERT) [6] leverage this architecture, making up a large part of the modern NLP landscape by providing a powerful off-the-shelf way to create state-of-the-art models for a wide range of tasks.The Transformer flexibility, allied to its reduced training times and improved ability to learn longrange relationships between terms in a sequence, make it one of the pillars of modern NLP research, and we intend to apply this architecture in our work.
This paper explores Transformer-based models for representing documents as semantic vectors.These vectorial representations are commonly denominated as embeddings, and we intend to use them in a CI system as a mechanism for extracting information from environmental data.Furthermore, the system facilitates information encountering by incorporating searching and browsing mechanisms that leverage the document embeddings.We have named the proposed system MapIntel which derives from (Competitive) Intelligence Map.

Related Work
The process of extracting business-related information for anticipating risks and opportunities is an important task for many companies, yet analysts are overwhelmed with extensive amounts of unstructured data.To support CI analysts, we propose an NLP system for exploring and gathering intelligence from large collections of textual data.To situate our contribution, we review existing work on similar systems applied in CI as well as in other domains, in this section.
Arguably, the closest method to ours in terms of domain application is [7].They formulated a system for acquiring competitive intelligence from different web resources, including social media, using a wide array of text mining techniques.They also showed how the system can be integrated with the business data and adopted for future decision-making.Their goal is to help the analyst in the task of reading, extracting information, and organizing the data.The paper presents an approach for labeling news articles according to CI-related topics by applying Latent Dirichlet Allocation (LDA) [4] clustering.The labeling contributes to the organization of the collection and facilitates the information extraction process.
[14] proposed a method for modeling and mapping topics from bibliometric data and built a web application based on this method.The produced map allows users to read a body of research "at a distance" while providing multiple levels of detail of the documents' topics.They also incorporated a time dimension, allowing users to understand the evolution of the topics over time.They applied Non-negative Matrix Factorization [16] to discover the underlying topics in the data and obtain vectorial representations of the documents, followed by t-distributed Stochastic Neighbor Embedding (t-SNE) [27] for visualizing the documents, resulting in a two-dimensional representation of the corpus.To allow for different detail levels, the authors produced two maps: a coarse map of 9 topics that gives a general overview of the topics within the data and a detailed map of 36 topics that captures more specific research themes.The web applica-tion consists of an interactive dashboard that allows users to explore the map of documents and easily extract information.
We based our searching module on the Vector Space Model (VSM) [26, p. 120-126], a common framework in Information Retrieval, consisting of representing a set of documents as vectors in a vector space while also allowing full-text queries to be represented in the same space.The model then ranks each document in decreasing order of their similarity with the query.The fundamental assumption of the model is that similar documents will be placed close together in the vector space, whereas dissimilar documents will be far away.An application of VSM for querying COVID-19 literature can be found in [9].They proposed Co-Search, an Information Retrieval system that combines semantic search, question answering, and abstractive summarization.The system uses Sentence-BERT (SBERT) [24], a Transformer-based model for representing documents as semantic vectors, combining it with approximate nearest neighbors and cosine similarity to return the relevant results for a query.
A more recent work focusing on the frontier between Computer Graphics and Machine Learning is Cartolabe [5].Cartolabe is a web-based, scalable and efficient system for visualizing and exploring large textual corpora, relying on topic modeling algorithms like LDA [4] to represent documents as vectors of topics and on UMAP [20] to produce a 2-dimensional plane that preserves the original topology and neighborhood of the documents.Additionally, they provided an interactive high-level visualization that allows exploration of the corpus in realtime by offloading most of the computations to the data pre-processing pipeline making the system highly scalable to large collections of documents.We intend to apply the same idea of performing the pre-processing offline to improve the system's responsiveness and user interaction.Contrarily to Cartolabe, we aim to explore Transformer-based embeddings instead of topic vectors due to the novelty aspect of this architecture and the improved results it has shown in multiple benchmarks in other NLP subdomains.

MapIntel
We propose MapIntel -Figure 1, a system that supports exploring a document collection while promoting serendipity and satisfying emerging information needs by allowing full-text queries over the entire collection.The system is scalable to large amounts of data, is dynamic as it regularly integrates new data, and is fast.It is composed of three main pipelines: Indexing, Query, and Visualization whose objectives are to get documents and their metadata from a source to a database, retrieve the most relevant results to a user query, and produce an interactive interface for exploring the document collection, respectively.

Indexing
In this work we decided to focus on how NLP, particularly sentence embeddings, could help organize, explore, and retrieve text documents in the CI domain.Thus, we have not developed the precedent tasks of data collection and preprocessing.Nevertheless, it is essential to point out that the system's quality is extremely reliant on these steps, as if we feed it non-ideal data, we will get non-ideal results.
Once new documents are fed to the system, their respective embeddings are computed.This process is the basis of our work as it allows the encoding of the semantic identity of the document onto a vector of a given dimensionality.This semantic identity describes the subject of the document, and can be used to compare documents between each other i.e., documents with the same subject will be close in the semantic space and vice-versa.We used SBERT [24], a derivative of the Transformer-based BERT model, to embed the documents using a pre-trained encoder trained on reducing the distance between queries and relevant results in the MS MARCO dataset [2].This step produced vectors of 768 dimensions, which we then reduced to 2 dimensions using the Uniform Manifold Approximation and Projection (UMAP) [20] algorithm.This aspect is another crucial component of MapIntel as it allows the organization and localization of the entire document collection in a 2-dimensional map, which can be used to explore and interact with the data.
We also applied a topic modeling technique called BERTopic [10], based on the work of [1].Topic modeling unveils the latent semantic structure of the data and unlike some of the classical techniques such as LDA [4] and pLSA [11], BERTopic leverages the SBERT embeddings and their capacity to encode the semantic attributes of a document to find the most representative topics of a corpus.BERTopic clusters the documents to find the densest areas of the semantic space while identifying outliers.The primary assumption behind BERTopic is that each dense area in the semantic space is generated by a latent topic shared among the documents that comprise it.Finally, a class-based variant of TF-IDF (c-TF-IDF) is used to extract an importance value of each word for each cluster, which can be used to represent each topic as the set of its most important words.Another advantage of BERTopic over the classical approaches is that we can choose the number of topics by merging less representative topics.
Finally, we loaded the documents including their metadata, SBERT embeddings, UMAP embeddings, and topics into a database.We used Open Distro for Elasticsearch 1 -an open-source, RESTful, distributed search and analytics engine based on Apache Lucene2 -to store the data, organize it in an index, and perform full-text search on it.We can think of the described approach as an Indexing Pipeline -Figure 1 -that extracts new raw documents from a data source, pre-processes and manipulates them, stores the results in a database, and indexes the documents for future search tasks.

Query
Finding meaningful information within a large amount of data is a sizable part of the CI task.The ability to retrieve relevant documents from a large collection of news articles through natural language queries empowers the CI analyst with an easy and intuitive interface to scan the environment.
MapIntel provides a searching functionality that can leverage the SBERT embeddings by projecting the query string onto the same vector space as the corpus and computing its k-nearest neighbors i.e. finding the k documents whose embedding vectors are closest to the query embedding vector.Since the embedding vectors encode the semantic identity of each document, this method provides semantically relevant results for a given query.Furthermore, we employ a highly performant and scalable similarity search engine by implementing Approximate Nearest Neighbors (ANN) search based on Hierarchical Navigable Small World Graphs [18].The kNN search can also be combined with binary filters that help the user obtain focused results based on characteristics of the documents such as publication date and topic.
Once again, we can think of the search functionality as a pipeline, illustrated in Figure 1, where we feed a query string and some binary filters, and obtain documents ordered by their relevance to the query.We employ a Retrieve and Re-rank pipeline based on the works of [21,13] composed by a "Retrieval Bi-Encoder + ANN" node that performs kNN semantic search, and by a "Re-Ranker Cross-Encoder" node consisting of a BERT model fine-tuned on the MS MARCO dataset that receives a document and query pair as input and predicts the probability of the document being relevant to the query.
The pipeline works by taking advantage of the characteristics of both nodes.The Bi-Encoder, together with ANN search, can retrieve fairly relevant candidates while dealing efficiently with a large collection of records.The Cross-Encoder is not as efficient since it has to be performed independently for each document, given a query.However, since attention is performed across the query and the document, the performance is higher in the second node [12].Therefore, we combined both nodes by retrieving a large set of candidates from the entire collection using the Bi-Encoder, and filtering the most relevant candidates with the Cross-Encoder while removing noisy results.
With this pipeline, we can provide relevant documents to the user given a query and binary filters while ranking them according to a relevancy score.As an additional feature, we can input a document instead of a query, allowing us to search for semantically similar documents within the collection.

Visualization
We conceptualized a visual interface that organizes and displays the documents to facilitate the environment scanning task, giving the user the ability to browse the data and zoom on particular regions of the semantic space.The interface uses the UMAP algorithm to reduce the dimensionality of the original semantic space to a 2-dimensional representation that reliably preserves the original topology.
The methodology employed to produce the interface is described in Figure 1 (Visualization Pipeline).It begins by taking the same inputs passed to the Query pipeline: a query, and a set of filters.The common inputs create a connection between the two modules -when the user queries the database, the query text is projected onto the 2-dimensional map and the filters define which documents are displayed in the map.In addition to the common inputs, we require a relative sample size that defines the percentage of randomly chosen documents (after applying the filters) to be displayed in the map.This step is necessary as interaction with the map is hindered by a large number of data points, resulting in a slow and unresponsive experience.Notice that the sample size does not affect the query results, as the search is always performed on the entire collection.
The map provides a means to explore the documents and the different semantic cohorts present within the collection.We color-code the points with the documents' topics identified in the Indexing stage, allowing us to visualize the latent semantic structure of the data, and when hovered, the points display their corresponding title and content attributes.

Evaluation
Our methodology addresses the issues of information dispersion and overload impacting the CI analysts' tasks.The proposed system provides searching and browsing capabilities, contributing to an easier understanding of the business environment by supporting analysts in seeking specific information while promoting undirected information encountering.In this section, we elaborate our choices in the design of the MapIntel system with the results of our experiments and analyze the different components of the system individually.

Experimental Setup
We evaluate our system quantitatively using the 20 newsgroups [23] dataset and the document labels provided.This dataset consists of around 18,000 newsgroups posts on 20 topics divided into 6 main groups: "Computer," "Recreation," "Science," "Miscellaneous," "Politics," and "Religion."We opted to use this dataset because of the presence of labels that describe the semantic meaning of each document, allowing us to have a reference which we can compare the identified topics with.
Given the inherent difficulty of evaluating the system in its entirety, we decided to deal with each component separately, however since every component of our system depends on the vector representation of the documents, we cannot guarantee an orthogonal evaluation of the components.We focus our experiments in comparing 2 of the main components of the MapIntel system: the Sentence Embeddings and the Topic Model.For the former component, we evaluated the Paragraph Vector (or Doc2Vec) [15] and the SBERT [24] models, whereas for the latter we focused on LDA [4], BERTopic [10], and Contextualized Topic Model (CTM) [3].
We use three main metrics to guide our model comparison.To evaluate the quality of the two-dimensional projections, we use the Accuracy of a kNN classifier on the 20 newsgroup labels given these embeddings [20] (we present the average over the range k = {10, 20, 40, 80, 160}).To evaluate how well the assigned topics correlate with the true labels, we use the Normalized Mutual Information (NMI) -the closer to 1 the value of this metric, the better we can capture the true topical nature of the documents, reflected by their labels.Finally, to measure the quality of the words that describe each topic, we apply the Topic Coherence (C v ) [25] metric, indicating whether the words that compose a given topic support each other.
We performed hyperparameter tuning using a multi-objective approach to optimize the three metrics specified previously.We used the Tree-structured Parzen Estimator (TPE) algorithm [22] for sampling the hyperparameter space at each trial of the optimization process.For each trial, we evaluated the sampled hyperparameters using a 5-fold cross-validation approach where the folds preserve the percentage of samples of each class.In total, 100 trials were evaluated.

Results
Our results based on the setup described above are shown in Table 1.For each trial, we report the average results and standard deviations over the crossvalidation folds.The table contains the best trials for each of the Topic/Embedding model combinations according to the average of the three objective metrics, which we applied MinMax scaling to avoid any impact of the metrics scale on the choice of the best model.We can see that the combination that uses BERTopic and SBERT outperform the others with respect to both NMI and C v while having a within standard deviation kNN Classifier Accuracy to the best value.Another interesting observation is that combinations using SBERT have generally better results.To facilitate results reproduction efforts, we open-sourced the code developed for the experiments at github.com/NOVA-IMS-Innovation-and-Analytics-Lab/mapintel_research.Additionally, we present the UMAP 2-dimensional maps of the documents in the 20 newsgroups dataset.Figure 2 shows the comparison between the distribution of the original labels and the topics assigned by the best performing model according to the MinMax average score for the train data.Likewise, Figure 3 shows the same comparison for the test data and demonstrates the ability of the model to generalize to unseen samples.We can see that the identified topical cohorts are mostly matching with the original groups, indicating that the embeddings have learned the original labels in a fully unsupervised way.Additionally, it is possible to see that semantically similar topics are located close to each other in the map.This is the case of all the computation related topics such as window.server.windows.motif.displayand format.files.graphics.file.gif.Finally, there is also an agreement between the topic meaning given by the top 5 words describing the topics and the original label description.For example, the same points that have the label sci.space also have the topic space.launch.nasa.orbit.shuttle.An important characteristic of BERTopic is that it is able to identify noise, leading to a topic assignment where a portion of the observation are classified as outliers.This produces a cleaner map to explore the documents at the loss of samples that are not given a topic.In Figure 2 (right) the percentage of documents classified into the aforementioned category is 51.4% -these are the light grey points scattered across the map.

Conclusion
In this paper, we presented MapIntel, a new system for extracting knowledge from large corpora of text documents.MapIntel differentiates from previous systems in that it leverages Transformer-based document embeddings to provide efficient, natural language searching of documents.The use of Transformer-based embeddings allows to harness the semantic attributes of the documents, which can then be explored in a 2-dimensional map, produced using UMAP.Additionally, MapIntel also organizes the documents in topical cohorts, providing yet another framework for the interaction of the user with the corpus.The system is centered around the concept of information encountering [8], providing browsing and searching capabilities to acquire information and promote serendipity.Map-Intel is aimed at supporting Competitive Intelligence analysts by providing a tool that facilitates the exploration and monitoring of the competitive environment from textual data.

Table 1 .
Hyperparameter tuning best trials per topic and embedding model according to the MinMax average of the multiple objective metrics.