Speaker "Paul Kowalczyk" Details Back



Predictive Models for Neglected Disease Drug Discovery: Built with R / Shared using Shiny.


We present our efforts building multiple predictive machine learning models (e.g., random forests, k-nearest neighbors, support vector machines, self-organizing maps, naïve Bayes) in support of neglected tropical diseases drug discovery. Screening data, retrieved from ChEMBL-NTD (, was used to construct and validate the various models. Programs, written using the R software environment (, were used for data retrieval, curation, visualization, analysis, mining, and reporting. These models are made freely and publicly available using Shiny, a web application framework for R ( We will demonstrate the full spectrum of steps in the model building process. Each of the models has been collected into a compendium, a “container” for all those elements that make up a model and its associated description: the primary data, the annotated computational code, figures, tables and derived data together with textual documentation and conclusions. Each compendium is meant to stand as an instance of reproducible research. All compendia will be freely available. We will, further, present our work towards publishing these models using Shiny. These Shiny applications make the functionality of the R scripts (i.e., the predictive models) available to interested parties, regardless of their knowledge of R. Free and open access to predictive models supporting neglected diseases drug discovery is meant to complement the research activities of all investigators, and in particular, those with limited access to computational tools and algorithms.


PhD in Physical Chemistry from Rensselaer Polytechnic Institute; Postdoctoral fellowship with IBM Data Systems Division; computational chemist (QSAR, QSPR, ligand-based and structure-based pharmacophore development, cheminformatics) at Sterling Winthrop Research Institute, Procept, Pfizer and Scynexis; currently a data scientist at Syngenta Biotechnology, Inc, using data visualization, analysis and mining tools to build descriptive, predictive and proscriptive models.