Correlation and redundancy on machine learning performance for chemical databases

Skip to Navigation

EarlyView Article

  • Published: Feb 19, 2018
  • Author: Hongzhi Li, Wenze Li, Xuefeng Pan, Jiaqi Huang, Ting Gao, LiHong Hu, Hui Li, Yinghua Lu


Variable reduction is an essential step for establishing a robust, accurate, and generalized machine learning model. Variable correlation and redundancy/total correlation are the primary considerations in many variable reduction methods given that they directly impact model performances. However, their effects vary from one class of databases to another. To clarify their effects on regression models on the basis of small chemical databases, a series of calculations are performed. Regression models are built on features with various correlation coefficients and redundancies by 4 machine learning methods: random forest, support vector machine, extreme learning machine, and multiple linear regression. The results suggest that the correlation is, as expected, closely related to the prediction accuracy; ie, generally, the features with large correlation coefficients regarding to response variables achieve better regression models than those with lower ones. However, for the redundancy, no trends on the performances of regression models are disclosed. This may indicate that for these chemical molecular databases, the redundancy might not be a primary concern.

Social Links

Share This Links

Bookmark and Share


Suppliers Selection
Societies Selection

Banner Ad

Click here to see
all job opportunities

Most Viewed

Copyright Information

Interested in spectroscopy? Visit our sister site

Copyright © 2018 John Wiley & Sons, Inc. All Rights Reserved