Logo do repositório
 
A carregar...
Miniatura
Publicação

Using PubChem’s database with data mining and machine learning algorithms for the prediction of EGFR inhibitors: a comparative study

Utilize este identificador para referenciar este registo.
Nome:Descrição:Tamanho:Formato: 
TGI0131.pdf2.54 MBAdobe PDF Ver/Abrir

Orientador(es)

Resumo(s)

Data Mining and Machine Learning algorithms and methods have become increasingly important for several industries due to the amount of available data that has grown exponentially in recent years and led to the need of effective ways of gaining insights from that data. In this study, these methods are applied to the prediction of Epidermal Growth Factor Receptor inhibitors using data extracted from PubChem’s database. PubChem is a freely accessible chemical repository that contains information submitted from several different sources, and that comprises three databases, one of which provides information about BioAssays, that is, assays with the purpose of screening numerous compounds for activity on a particular biological target. In this work, the dataset used to train and evaluate the developed models resulted from the information gathered from the assays performed to identify inhibitors of EGFR and the source for the features used to characterize the compounds was PubChem’s own chemical descriptor, the Substructure Fingerprint. The work comprises a literature review on this subject and the implementation of a methodology that tests the performance of different types of classifiers for the problem at hand, namely Naïve Bayes, Decision Tree, Logistic Regression, !-Nearest Neighbors, Support Vector Machine, Multilayer Perceptron, Random Forest, Extremely Randomized Trees, Bagging, Boosting and Voting. Considering both the evaluated quality metrics and the model’s computational burden, the Multilayer Perceptron was considered the best model, although some of the other models had close performances. It was concluded that the used methodology and developed models had good quality, as did PubChem’s Substructure Fingerprint as a descriptor, but that there was still room for improvement that could be achieved with further experimentation on different aspects of the methodology.

Descrição

Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business Intelligence

Palavras-chave

Data mining Machine learning Epidermal growth factor receptor PubChem

Contexto Educativo

Citação

Projetos de investigação

Unidades organizacionais

Fascículo