摘要
在数据挖掘问题中,一个基本假设是训练集样本与测试集样本的数据分布一致,但随着数据量逐渐增加,如何在海量数据中找出具有代表意义的数据也变得尤为困难。对现有的数据选择方法研究发现,传统的简单随机抽样和渐进抽样等数据选择方法,由于没有和数据挖掘工具进行结合,采样结果具有偶然性和不确定性,抽样数据很难保证数据挖掘的基本假设,这也使得最终模型的泛化误差较大。为了解决数据采样过程中类间的不平衡问题,提出一种基于双决策树的结构化数据采样方法。首先通过C4.5算法生成一棵决策树,借助决策树在数据源中选择适合的数据和数据采集点,同时通过使用另一棵决策树对选择出的数据集的质量进行评估来达到高效率和高质量的数据采样。实验表明,与简单随机抽样相比,新采样数据下训练的模型准确率有明显提高。
In data mining, a basic assumption is that the data distribution of training set samples are consistent with that of test set samples. But as data volumes increase, how to find out representative data in huge amounts of data becomes particularly difficult. By studying existing data selection methods, we find that it is difficult to evaluate their sampling effect because they are not integrated with the data mining tool, such as simple random sampling and progressive sampling. Due to contingency factors and uncertainty, it is difficult to guarantee the basic assumptions of data mining, which also makes the generalization error of the model larger. In order to solve these problems, we put forward a structured data sampling method based on double decision tree. Firstly, we generate a decision tree with the C4.5 algorithm, which is used to select appropriate data and data collection points in the data source. Then, we generate another decision tree to evaluate the quality of the selected data set and achieve data sampling of high efficiency and high quality. Experiments show that compared with random sampling, the accuracy of the model based on our sampling is improved obviously.
作者
陈力
费洪晓
丁海伦
成琳
翟纪宇
CHEN Li;FEI Hong-xiao;DING Hai-lun;CHENG Lin;ZHAI Ji-yu(School of Geosciences and Info-Physics,Central South University,Changsha 410075;School of Software,Central South University,Changsha 410075,China)
出处
《计算机工程与科学》
CSCD
北大核心
2019年第1期130-135,共6页
Computer Engineering & Science
基金
国家自然科学基金(61602525)
中南大学2017年本科生自由探索项目(201710533267
ZY20170769)
关键词
决策树
数据采样
机器学习
decision tree
data sampling
machine learning