| Nome: | Descrição: | Tamanho: | Formato: | |
|---|---|---|---|---|
| 4.83 MB | Adobe PDF |
Autores
Orientador(es)
Resumo(s)
Customers increasingly rate, review and research products online, (Jansen 2010). Consequently,
websites containing consumer reviews are becoming targets of opinion spam. Now-a-days, people
are paid money to write fake positive review online, to misguide customer and to augment sales
revenue. Alternatively, people are also paid to pose as customers and to post negative fake reviews
with the objective to slash competitors. These have caused menace in social media and often
resulting in customer being baffled.
In this study, we have explored multiple aspects of deception classification. We have explored four
kinds of treatments to input i.e., the reviews using Natural Language Processing – lemmatization,
stemming, POS tagging and a mix of lemmatization and POS Tagging. Also, we have explored how
each of these inputs responds to different machine learning models – Logistic Regression, Naïve
Bayes, Support Vector Machine, Random Forest, Extreme Gradient Boosting and Deep Learning
Neural Network.
We have utilized the gold standard hotel reviews dataset created by (Ott, Choi, et al. 2011) & (Ott,
Cardie and Hancock, Negative Deceptive Opinion Spam 2013). Also, we used restaurant reviews
dataset and doctors’ reviews dataset used by (Li, et al. 2014). We explored the usability of these
models in similar domain as well as across different domains. We trained our model with 75% of
hotel reviews dataset and check the accuracy of classification on similar dataset like 25% of unseen
hotel reviews and on different domain dataset like unseen restaurant reviews and unseen doctors’
reviews. We perform this to create a robust model which can be applied on same domain and across
different domains.
Best accuracy for testing dataset of hotels achieved by us was at 91% using Deep Learning Neural
Network. Logistic regression, support vector machine and random forest had similar results like
neural network. Naïve Bayes also had similar accuracy; however, it had more volatility in cross
domain accuracy performance. Accuracy of extreme gradient boosting was weakest among all the
models that we explored.
Our results are comparable and at times exceeding performance of other researchers’ work.
Additionally, we have explored various models (Logistic Regression, Naïve Bayes, Support Vector
Machine, Random Forest, Extreme gradient boosting, Neural network) vis a vis various input
transformation method using Natural Language Processing (lemmatized unigrams, stemmed, POS
tagging and a mix of lemmatization and POS Tagging).
Descrição
Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics
Palavras-chave
Online deception Deep Learning Natural Language Processing Neural Network Logistics Regression Naïve Bayes Support Vector Machine Random Forest Extreme Gradient Boosting
