摘要
机器学习作为数据挖掘中一种重要的工具,不只是对人的认知学习过程的探索,还包括对数据的分析处理。面对大量数据的挑战,目前一部分学者专注于机器学习算法的改进和开拓,另一部分研究人员则致力于样本数据的选择和数据集的缩减,这两方面的研究工作是并行的。训练样本数据选择是机器学习的一个研究热点,通过对样本数据的有效选择,提取更具有信息量的样本,剔除冗余样本和噪声数据,从而提高训练样本质量,进而获得更好的学习性能。文中就目前存在的样本数据选择方法进行综述研究,从基于抽样的方法、基于聚类的方法、基于近邻分类规则的方法这三大类以及其他相关数据选择方法4个方面对目前存在的方法进行总结和分析对比,并对训练样本数据选择方法存在的问题和未来研究方向提出一些总结和展望。
Machine learning,as an important tool in data mining,not only explores the cognitive learning process of human beings,but also includes the analysis and processing of data.Faced with the challenge of massive data,at present,some researches focus on the improvement and development of machine learning algorithm,while others focus on the selection of sample data and the reduction of data set.The two aspects of researches work in parallel.The selection of training sample data is a research hotspot of machine learning.By effectively selecting sample data,extracting more informative samples,eliminating redundant samples and noise data,thus improving the quality of training samples and obtaining better learning performance.In this paper,the exis-ting methods of sample data selection are reviewed,and the existing methods are carried out in four aspects:sampling-based me-thod,cluster-based method,nearest neighbor classification rule-based method and other related data selection methods.Summarize and analyze the comparison,and put forward some conclusions and prospects for the problems existing in the training sample data selection method and future research directions.
作者
周玉
任钦差
牛会宾
ZHOU Yu;REN Qin-chai;NIU Hui-bin(School of Electric Power,North China University of Water Resources and Electric Power,Zhengzhou 450011,China)
出处
《计算机科学》
CSCD
北大核心
2020年第S02期402-408,共7页
Computer Science
基金
河南省高等学校青年骨干教师培养计划(2018GGJS079)
国家自然科学基金(U1504622,31671580)。
关键词
训练样本
数据选择
机器学习
神经网络
支持向量机
Training sample
Data selection
Machine learning
Neural networks
Support vector machines