Screening for linearly and nonlinearly related variables in predictive cheminformatic models

Skip to Navigation

EarlyView Article

  • Published: Feb 7, 2018
  • Author: Bahram Hemmateenejad, Knut Baumann

Abstract

For a long time, feature selection has been a hot topic in the statistical‐related literature and has become increasingly frequent and important in various research fields. Feature screening methods using marginal correlation show potential problems. Another issue that hinders to select an important variable is the shading effect of a highly influential variable on the variables of lower importance. Feature selection can be even more complex in the presence of nonlinear relations. To overcome these limitations, an innovative method for selecting linearly and nonlineary correlated variables is presented. It works based on the hyphenation of nonparametric variable ranking methods with nonparametric regression methods through an iterative regression based on residuals. Here, maximal information coefficient and distance correlation are used to rank the variables. The algorithm starts with modeling the relationship between response and the top‐ranking variable by using multivariate adaptive regression splines method. In the next iterations, the top‐ranking variables are selected based on relationship with subsequent residuals. The validation of the method is discussed by using 2 nonlinear simulated data. The method is further validated by analysis of 2 real cheminformatic data sets including toxicity of 1571 industrial chemicals and aqueous solubility of a diverse set of 1708 organic molecules.

Social Links

Share This Links

Bookmark and Share

Microsites

Suppliers Selection
Societies Selection

Banner Ad

Click here to see
all job opportunities

Most Viewed

Copyright Information

Interested in spectroscopy? Visit our sister site spectroscopyNOW.com

Copyright © 2018 John Wiley & Sons, Inc. All Rights Reserved