摘要
针对两组数据进行了比较讨论,试图说明在QSAR/QSPR研究中经常碰到的一个基本问题。第一组为一散布度(diver- sity)很大分子结构多样化的大样本数据;第二组则是按照分子结构相似度筛选出来的散布度较小结构相似的小样本数据。对于第一组数据,因数据集分散,全局模型难以完全描述物质结构特征与其性质之间的关系,所得回归结果很差(检验集相关系数Q2=0.68、平均预报偏差(RMSEP)=40.65)。试采用新近提出的局部懒惰回归(Local lazy regression,LLR)对其进行改善,但实际结果是局部模型的效果更差(Q2=0.60、RMSEP=45.05)。继对散布度较小且相对均匀(结构相似)的数据集用LLR方法建立局部模型,此时得到的预报结果(Q2=0.90、RMSEP=24.66)却明显优于全局模型(Q2=O.86、RMSEP=29.37)。
Two datnsets were compared with each other to illustrate a basic problem in the research field of QSAR/QSPR. One of the datasets was a big dataset of large structural diversity, the other was a small dataset which was screened by structural similarity. For the first dataset, the global model couldn't recognize the relationship between structural features of molecules and their properties because of the great structural diversity, the result of regression was not good with Q^2= 0.68 and the RMSEP ( root mean square error of prediction) =40.65 for global model. And then, a new method called local lazy regression (LLR), which obtains a prediction for a query molecule using its local neighborhood rather than considering the whole data set, was used to try to improve the effect of prediction. However, the result of LLR was even worse ( Q^2= 0.60, RMSEP = 45.05). But for the second dataset, the result from LLR model ( Q^2= 0.90, RMSEP = 24.66) was much better than the one from global model ( Q^2=0.86, RMSEP = 29.37).
出处
《计算机与应用化学》
CAS
CSCD
北大核心
2007年第1期83-86,共4页
Computers and Applied Chemistry
基金
国家自然科学基金资助项目(20475066
20235020)