Novel Data Mining Methods for Virtual Screening of Biological Active Chemical Compounds

Handle URI:
http://hdl.handle.net/10754/621873
Title:
Novel Data Mining Methods for Virtual Screening of Biological Active Chemical Compounds
Authors:
Soufan, Othman M. ( 0000-0002-4410-1853 )
Abstract:
Drug discovery is a process that takes many years and hundreds of millions of dollars to reveal a confident conclusion about a specific treatment. Part of this sophisticated process is based on preliminary investigations to suggest a set of chemical compounds as candidate drugs for the treatment. Computational resources have been playing a significant role in this part through a step known as virtual screening. From a data mining perspective, availability of rich data resources is key in training prediction models. Yet, the difficulties imposed by big expansion in data and its dimensionality are inevitable. In this thesis, I address the main challenges that come when data mining techniques are used for virtual screening. In order to achieve an efficient virtual screening using data mining, I start by addressing the problem of feature selection and provide analysis of best ways to describe a chemical compound for an enhanced screening performance. High-throughput screening (HTS) assays data used for virtual screening are characterized by a great class imbalance. To handle this problem of class imbalance, I suggest using a novel algorithm called DRAMOTE to narrow down promising candidate chemicals aimed at interaction with specific molecular targets before they are experimentally evaluated. Existing works are mostly proposed for small-scale virtual screening based on making use of few thousands of interactions. Thus, I propose enabling large-scale (or big) virtual screening through learning millions of interaction while exploiting any relevant dependency for a better accuracy. A novel solution called DRABAL that incorporates structure learning of a Bayesian Network as a step to model dependency between the HTS assays, is showed to achieve significant improvements over existing state-of-the-art approaches.
Advisors:
Bajic, Vladimir B. ( 0000-0001-5435-4750 )
Committee Member:
Kalnis, Panos ( 0000-0002-5060-1360 ) ; Arold, Stefan T. ( 0000-0001-5278-0668 ) ; Gojobori, Takashi ( 0000-0001-7850-1743 ) ; Schonbach, Christian
KAUST Department:
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Program:
Computer Science
Issue Date:
23-Nov-2016
Type:
Dissertation
Appears in Collections:
Dissertations

Full metadata record

DC FieldValue Language
dc.contributor.advisorBajic, Vladimir B.en
dc.contributor.authorSoufan, Othman M.en
dc.date.accessioned2016-11-24T08:43:17Z-
dc.date.available2016-11-24T08:43:17Z-
dc.date.issued2016-11-23-
dc.identifier.urihttp://hdl.handle.net/10754/621873-
dc.description.abstractDrug discovery is a process that takes many years and hundreds of millions of dollars to reveal a confident conclusion about a specific treatment. Part of this sophisticated process is based on preliminary investigations to suggest a set of chemical compounds as candidate drugs for the treatment. Computational resources have been playing a significant role in this part through a step known as virtual screening. From a data mining perspective, availability of rich data resources is key in training prediction models. Yet, the difficulties imposed by big expansion in data and its dimensionality are inevitable. In this thesis, I address the main challenges that come when data mining techniques are used for virtual screening. In order to achieve an efficient virtual screening using data mining, I start by addressing the problem of feature selection and provide analysis of best ways to describe a chemical compound for an enhanced screening performance. High-throughput screening (HTS) assays data used for virtual screening are characterized by a great class imbalance. To handle this problem of class imbalance, I suggest using a novel algorithm called DRAMOTE to narrow down promising candidate chemicals aimed at interaction with specific molecular targets before they are experimentally evaluated. Existing works are mostly proposed for small-scale virtual screening based on making use of few thousands of interactions. Thus, I propose enabling large-scale (or big) virtual screening through learning millions of interaction while exploiting any relevant dependency for a better accuracy. A novel solution called DRABAL that incorporates structure learning of a Bayesian Network as a step to model dependency between the HTS assays, is showed to achieve significant improvements over existing state-of-the-art approaches.en
dc.language.isoenen
dc.subjecthigh-throughput screeningen
dc.subjectData Miningen
dc.subjectvirtual screeningen
dc.subjectFeature Selectionen
dc.subjectmultilabel learningen
dc.titleNovel Data Mining Methods for Virtual Screening of Biological Active Chemical Compoundsen
dc.typeDissertationen
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
thesis.degree.grantorKing Abdullah University of Science and Technologyen_GB
dc.contributor.committeememberKalnis, Panosen
dc.contributor.committeememberArold, Stefan T.en
dc.contributor.committeememberGojobori, Takashien
dc.contributor.committeememberSchonbach, Christianen
thesis.degree.disciplineComputer Scienceen
thesis.degree.nameDoctor of Philosophyen
dc.person.id113152en
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.