Saúde digital : previsão da incidência da gripe com base no Twitter

Rosa, Cristiana Filipa Cruz

http://hdl.handle.net/10362/60301

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
TGI0185.pdf		1.63 MB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

Rosa, Cristiana Filipa Cruz

Orientador(es)

Henriques, Roberto André Pereira

Bernardo, Ivo

Resumo(s)

As redes sociais fazem parte do quotidiano da grande parte da população mundial, onde existe uma forte partilha de conteúdos sobre diversos assuntos. Por esta razão, as redes sociais tornaram-se num repositório de dados, de onde é possível retirar informação valiosa e explorar os interesses da população em tempo-real (Recuero, 2005). Pensemos no seguinte: quantas vezes vemos notícias no telejornal das quais já tínhamos tomado conhecimento através do Facebook ou Twitter? É neste ponto que percebemos que talvez alguns acontecimentos que impactam a população podiam ser detetados previamente. Posto isto, o objetivo deste trabalho passa por utilizar as publicações do Twitter (tweets) relacionadas com a gripe, e perceber se estas mantêm uma relação com a incidência desta doença, uma das que mais preocupa a saúde pública portuguesa. Este tema torna-se particularmente relevante quando olhamos para a pandemia de gripe que ocorreu em 2009 e que se alastrou mundialmente (Centers for Disease Control and Prevention, 2010). Se esta gripe fosse prevista atempadamente, os países poderiam ter tido mais tempo para se prepararem, recolherem os recursos necessários para combater o surto e avisar a população dos procedimentos a tomar, reduzindo o número de afetados e consequente propagação. A metodologia do estudo assenta em técnicas de Data e Text Mining. Começamos pela definição dos termos relacionados com a gripe para filtragem de tweets, seguida da recolha dos dados através de uma API (Application Programming Interface) e seu pré-processamento. Para obter todos os registos do dataset classificados, de modo a ser possível posterior aplicação estatística para as comparações desejadas, foram testados vários algoritmos de classificação - Random Forest, Naïve Bayes e Regressão Logística - tendo-se obtido melhores resultados com Random Forest. Este algoritmo foi então utilizado para classificar todo o dataset utilizando um subconjunto dos dados classificados manualmente como treino. Na análise de resultados foram feitas diferentes comparações entre dados oficiais e dados do Twitter tendo em conta duas taxonomias diferentes para classificação e o desfasamento temporal, ou seja, considerando que a incidência no Twitter é detetada antes da incidência oficial. A relação foi testada aplicando regressão linear e concluímos que existe uma capacidade de previsão da taxa de incidência gripal através dos dados do Twitter, sendo esta dependente tanto do desfasamento temporal com da taxonomia aplicada.

Social networks are part of the daily lives of a big part of the world population, where there is a strong sharing of content about several subjects. For this reason, social networks have become a repository of data, from which it is possible to extract valuable information and exploit the interests of the population in real time (Recuero, 2005). Consider the following: how often do we see news on the newscast which we already knew through Facebook or Twitter? At this point, we realize that perhaps some events that impact the population could be detected earlier from different news channels. Therefore, the objective of this work is to use Twitter publications (tweets) related to the flu to understand if these ones have a relationship with the incidence of this disease, one of the most worrying of Portuguese public health. This issue becomes particularly relevant when we look at the global flu pandemic of 2009 (Centers for Disease Control and Prevention, 2010). If flu dissemination was predicted in a timely way, perhaps countries would have more time to prepare, collect the resources needed to fight the outbreak and talk with the population about the procedures to take, reducing the number of affected and consequent spread of the disease. The methodology of this study is based on Data and Text Mining techniques. We started by defining the terms related to flu, and then apply them for filtering, followed by data collection through an API (Application Programming Interface) and its preprocessing. To have all dataset records classified, in order to be possible later statistical application to perform the desired comparisons, Random Forest, Naïve Bayes and Logistic Regression were tested, obtaining better results with Random Forest. This algorithm was then used to classify the entire dataset using a subset of the data manually classified as training. In the analysis of results, different comparisons were made between official data and Twitter data considering two different taxonomies for classification and time lag, that is considering that the incidence on Twitter is detected before the official incidence. The relationship was tested with linear regression and we concluded that there is a capacity of prevision of flu incidence through Twitter data, being this prevision dependent both on time lag and applied taxonomy.

Descrição

Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business Intelligence

Palavras-chave

Saúde Gripe Twitter Text mining Data Mining Health Flu

URI

http://hdl.handle.net/10362/60301

Coleções

NIMS - Dissertações de Mestrado em Gestão da Informação (Information Management)

Licença CC

cclicense-by

Ver registo completo