期刊文献+

基于凸边界的学习样本抽取方法 被引量:2

Learning sample extraction method based on convex boundary
下载PDF
导出
摘要 学习样本的质量和数量对于智能数据分类系统至关重要,但在数据分类系统中没有一个通用的良好方法用于发现有意义的样本。以此为动机,提出数据集合凸边界的概念,给出了快速发现有意义样本集合的方法。首先,利用箱型函数对学习样本集合中的异常和特征不全样本进行清洗;接着,提出数据锥的概念,对归一化的学习样本进行锥形分割;最后,对每个锥形样本子集进行中心化,以凸边界为基础提取距离凸边界差异极小的样本构成凸边界样本集合。实验在12个UCI数据集上进行,并与高斯朴素贝叶斯(GNB)、决策树(CART)、线性判别分析(LDA)、提升算法(AdaBoost)、随机森林(RF)和逻辑回归(LR)这六种经典的数据分类算法进行对比。结果表明,各个算法在凸边界样本集合的训练时间显著缩短,同时保持了分类性能。特别地,对包含噪声数据较多的数据集,如剖腹产、电网稳定性、汽车评估等数据集,凸边界样本集合能使分类性能得到提升。为了更好地评价凸边界样本集合的效率,以样本变化率和分类性能变化率的比值定义了样本清洗效率,并用该指标来客观评价凸边界样本的意义。清洗效率大于1时说明方法有效,且数值越高效果越好。在脉冲星数据集合上,所提方法对GNB算法的清洗效率超过68,说明所提方法性能优越。 The quality and quantity of learning samples are very important for intelligent data classification systems.But there is no general good method for finding meaningful samples in data classification systems.For this reason,the concept of convex boundary of dataset was proposed,and a fast method of discovering meaningful sample set was given.Firstly,abnormal and incomplete samples in the learning sample set were cleaned by box-plot function.Secondly,the concept of data cone was proposed to divide the normalized learning samples into cones.Finally,each cone of sample subset was centralized,and based on convex boundary,samples with very small difference from convex boundary were extracted to form convex boundary sample set.In the experiments,6 classical data classification algorithms,including Gaussian Naive Bayes (GNB),Classification And Regression Tree (CART),Linear Discriminant Analysis (LDA),Adaptive Boosting (AdaBoost),Random Forest (RF) and Logistic Regression (LR),were tested on 12 UCI datasets.The results show that convex boundary sample sets can significantly shorten the training time of each algorithm while maintaining the classification performance.In particular,for datasets with many noise data such as caesarian section,electrical grid,car evaluation datasets,convex boundary sample set can improve the classification performance.In order to better evaluate the efficiency of convex boundary sample set,the sample cleaning efficiency was defined as the quotient of sample size change rate and classification performance change rate.With this index,the significance of convex boundary samples was evaluated objectively.Cleaning efficiency greater than 1 proves that the method is effective.The higher the numerical value,the better the effect of using convex boundary samples as learning samples.For example,on the dataset of HTRU2,the cleaning efficiency of the proposed method for GNB algorithm is over 68,which proves the strong performance of this method.
作者 顾依依 谈询滔 袁玉波 GU Yiyi;TAN Xuntao;YUAN Yubo(School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China)
出处 《计算机应用》 CSCD 北大核心 2019年第8期2281-2287,共7页 journal of Computer Applications
基金 浙江省省级重点研发计划项目(2019C03004)~~
关键词 机器学习 数据分类 样本选择 凸锥 边界样本 machine learning data classification sample selection convex cone boundary sample
  • 相关文献

参考文献2

二级参考文献22

  • 1庄东,陈英.基于加权近似支持向量机的文本分类[J].清华大学学报(自然科学版),2005,45(S1):1787-1790. 被引量:16
  • 2吴宗亮,窦衡.一种广义最小二乘支持向量机算法及其应用[J].计算机应用,2009,29(3):877-879. 被引量:5
  • 3于湘涛,卢文秀,褚福磊.基于PSO算法的模糊PSVM及其在旋转机械故障诊断中的应用[J].振动与冲击,2009,28(11):183-186. 被引量:4
  • 4张猛,付丽华,王高峰.模糊临近支持向量机[J].计算机工程与应用,2005,41(5):37-39. 被引量:2
  • 5BRADLEY P S, MANGASARIAN O L. Massive data discrimination via linear support vector machines [ J]. Optimization Methods and Software, 2000, 13(1) : 1 - 10.
  • 6BURGES C J C. A tutorial on support vector machines for pattern recognition [ J]. Data Mining and Knowledge Discovery, 1998, 2 (2): 121-167.
  • 7VAPNIK V N. The nature of statistical learning theory[ M]. Berlin: Springer-Verlag, 1995.
  • 8FUNG G, MANGASARIAN O L. Proximal support vector machine classifiers [ C]// KDD 2001 : Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2001 : 77 - 82.
  • 9FUNG G, MANGASARIAN O L. Incremental support vector ma- chine classification[ C]//Proceedings of the 2nd SIAM International Conference on Data Mining. Philadelphia: University of Wisconsin, 2002 : 247 - 260.
  • 10AGARWAL D K. Shrinkage estimator generalizations of proximal support vector machine[ C]//Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2002:173 - 182.

共引文献9

同被引文献13

引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部