一种新的支持向量机主动学习策略被引量：10

A novel support vector machine active learning strategy

下载PDF

导出

摘要本文提出一种新的支持向量机(support vector machine,SVM)主动学习策略,称为Dix_SVMactive.通过定义新的数据置信度度量来挑选最有价值样本进行人工标注,并在每次迭代中对训练集的平衡度进行调整,以获得更好的泛化能力.在UCI标准数据集上的测试结果表明,与基于随机选样的SVMactive和传统SVMactive(Tong SVMactive)方法相比,本文算法不仅可以提高分类精度,而且能减少人工标注的工作量. This paper proposes a new strategy of active learning for support vector machine （SVM）, which is called Dix-SVMactive. Generally, the shorter the distance between the sample and the hyperplane, the more uncertainty and more information the sample contains, and thus it is of more value. Active learning is an iterative process, so the convergence speed should also be considered. In this paper, by defining a new confidence measure parameter about samples, the most valuable samples will be selected to be marked artificially. The confidence of a given unlabeled sample, which can be regarded as the value of the sample, is defined as the ratio of the mean value of the distance between the presented sample and the labeled samples to the distance between the presented sample and the hyperplane. While, the mean value of the distance between the presented sample and the labeled samples can measure the redundancy rate of the given sample to labeled samples, and the distance between the presented sample and the hyperplane can express the uncertainty of the sample. In general, the bigger the former and the smaller the latter, the bigger is the confidence of the sample. Additionally, the set of labeled sample obtained after each loop may be unbalanced, which means the hyperplane may be a little far away from one kind of samples and more close from another kind of samples. In this situation, according to the proposed approach to select samples, the number of samples close to the hyperplane will be more than that far from the hyperplane, and this may be lead to bad generalization performance. To avoid the unbalance of dataset, after each loop the proposed algorithm will test the balance degree of the dataset, which is the ratio of the number of minority samples to that of majority ones. When the ratio is not greater than a given threshold e, the dataset will be regarded as unbalanced. At this time, some samples belonging to the majority samples will be deleted by some strategy like clustering to make numbers of two classes samples be equal. During each iterative step, the balance degree of the selected dataset will be adjusted so as to obtain good generalization ability. Summarily, the confidence of each sample is computed firstly, and then the first a few samples will be added into the training dataset according to the confidence in descend sort. At last, the balance of the training dataset in each loop will be adjusted. The experiment results on University of California Irvine benchmatk datasets demonstrate that the proposed approach can not only improve the classification precision, but also reduce the workload of marking samples artificially compared to some common used approaches, e. g. , the SVMactive, which is based on the random sample, and the Tong SVMactive approach.

作者白龙飞王文剑郭虎升

机构地区山西大学计算机与信息技术学院山西大学计算智能与中文信息处理教育部重点实验室

出处《南京大学学报（自然科学版）》 CSCD 北大核心 2012年第2期182-189,共8页 Journal of Nanjing University（Natural Science）

基金国家自然科学基金(60975035) 教育部新世纪人才支持计划项目(NCET-07-0525) 教育部博士点基金(20091401110003) 山西省自然科学基金(2009011017-2) 山西省研究生创新项目(20103021)

关键词支持向量机主动学习置信度 support vector machine, active learning, confidence

分类号 TP181 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献16

1Simon H A,Lea G. Problem solving and rule education:A unified view knowledge and organ-ization[J].Erbuam,1974,(02):63-73.
2韩光,赵春霞,胡雪蕾.一种新的SVM主动学习算法及其在障碍物检测中的应用[J].计算机研究与发展,2009,46(11):1934-1941. 被引量：14
3Dagan I,Engelson S. Committee-based sampling for training probabilistic classifiers[A].Tahoe City:Morgan Kavfmann,1995.150-157.
4Lewis W,Gale A. A sequential algorithm for training text classifiers (uncertainty sampling)[A].Lodon:Springer-Verlag,1994.3-12.
5Tong S,Koller D. Support vector machine ac- tive learning with applications to text Classifica- tion[J].Journal of Machine Learning Research,2001.45-66.
6Schohn G,Cohn D. Less is more: Active learn- ing with support vector machines[A].San Francisco:Morgan Kaufmann Publishers,2000.45-66.
7Seung H S,Opper M,Sompolinsky H. Query by committee[A].University of Clifornia:Association for Computing Machinery,1992.287-294.
8Freund Y,Seung H S,Samir E. Selective sampling using the query by committee algo- rithm[J].Machine Learning,1997,(23):133-168.
9Vladimir N V. The nature of statistical learning theory[M].New York:springer-verlag,2000.1-334.
10Vapnik V. Statictical learning theory[M].New York:wiley,1998.11-23.

二级参考文献26

1凌俊斌,庄卫华,刘鲁西.图像检索中的主动学习及其可测量性[J].计算机技术与发展,2006,16(2):132-134. 被引量：3
2田春娜,高新波,李洁.基于嵌入式Bootstrap的主动学习示例选择方法[J].计算机研究与发展,2006,43(10):1706-1712. 被引量：8
3Lee W, Stolfo S J, Mok K W. A data mining framework for building intrusion detection models. Proceedings of the 1999 IEEE Symposium on Security and Privacy. Oakland: IEEE Computer Society, 1999, 120-132.
4Almgren M, Jonsson E. Using active learning in intrusion detection. Proceedings of the 17^th IEEE Symposium on Security Foundations Workshop. IEEE Computer Society, 2004, 88-98.
5Lee W, Fan W, Miller M, et al. Toward costsensitive modeling for intrusion detection and response. Journal of Computer Security, 2002, 10(1/2) : 5-22.
6Fan W, Lee W, Stolfo S J, et al. A multiple model cost-sensitive approach for intrusion detection. Proceedings of the 11^th European Conference on Machine Learning. Berlin: Springer- Verlag, 2000, 1810:3-14.
7Margineantu D D. Active cost-sensitive learning. http://www. ijcai. org/papers/post-0525. pdf. 2005.
8Nguyen H T, Smeulders A. Active learning using pre-clustering. Proceedings of the 21^th International Conf on Machine Learning. San Diego: ACM Press, 2004, 79-86.
9Muslea I, Minton S, Knoblock C A. Active learning with multiple views. Journal of Artificial Intelligence Research, 2006, 27 : 203-233.
10Lewis D D, Gale W A. A sequential algorithm for training text classifiers. Proceedings of the 17^th ACM International Conference on Research and Development in Information Retrieval. Berlin: Springer, 1994.

共引文献19

1刘志杰,王崇骏.一个基于复合攻击路径图的报警关联算法[J].南京大学学报（自然科学版）,2010,46(1):56-63. 被引量：2
2吴涛,王崇骏,谢俊元.基于部分可观测马尔可夫决策过程的网络入侵意图识别研究[J].南京大学学报（自然科学版）,2010,46(2):122-130. 被引量：3
3靳燕.基于权值控制的误分类算法研究[J].山西师范大学学报（自然科学版）,2010,24(2):29-33. 被引量：2
4周晓剑,马义中,朱嘉钢.SMO算法的简化及其在非正定核条件下的应用[J].计算机研究与发展,2010,47(11):1962-1969. 被引量：10
5贾俊芳.基于层次聚类的主动学习方法——HC_AL[J].计算机应用,2011,31(8):2134-2137. 被引量：2
6江彤,唐明珠,阳春华.基于不确定性采样的自训练代价敏感支持向量机研究[J].中南大学学报（自然科学版）,2012,43(2):561-566. 被引量：5
7刘三民,王彩霞,孙知信.一种基于SVM后验概率的网络流量识别方法[J].计算机工程,2012,38(17):171-173.
8龙珑,邓伟.绿色网络博文倾向性分析算法研究[J].计算机应用研究,2013,30(4):1095-1098. 被引量：1
9龙珑,邓伟.绿色网络网页正文内容提取算法[J].计算机工程,2013,39(7):252-256. 被引量：1
10高华,赵春霞,张浩峰.一种阴影区域的可通行性检测方法[J].计算机研究与发展,2013,50(11):2304-2314. 被引量：1

同被引文献145

1龙军,殷建平,祝恩,赵文涛.主动学习研究综述[J].计算机研究与发展,2008,45(z1):300-304. 被引量：31
2赵英刚,陈奇,何钦铭.一种基于支持向量机的直推式学习算法[J].江南大学学报（自然科学版）,2006,5(4):441-444. 被引量：8
3韩冰,高新波,姬红兵.一种基于选择性集成SVM的新闻音频自动分类方法[J].模式识别与人工智能,2006,19(5):634-639. 被引量：5
4赵悦,穆志纯.基于QBC的主动学习研究及其应用[J].计算机工程,2006,32(24):23-25. 被引量：5
5Tikhonov A A, Arsenin V Y. Solutions of ill posed problems. New York Wiley, 1977.
6Cortes C, Vapnik V. Support vector networks. Machines Learning, 1995,20(3) : 273 -297.
7Vapnik V. An overview of statistical learning theory. IEEE Transactions on Neural Networks, 1999,10 (5) 988-999.
8Scholkopf t3, Smola A J. [.earning with kernels. Cambridge, MA : MIT Press, 2002.
9Bayro Corrochano E J, Arana-Daniel N. Clifford support vector machines for classification, regression,and recurrence. IEEE Transaction on Neural Networks,2010,21(11) :1731-1746.
10Yang J B, Ong C J. Feature selection using probabilistic prediction of support vector Regression. IEEE Transactions on Neural Networks, 2011,22(6) : 954-962.

引证文献10

1张仕光,胡清华,谢宗霞,米据生.基于Beta噪声模型支持向量回归及其应用[J].南京大学学报（自然科学版）,2013,49(4):418-424.
2孟光胜,赵志宇.基于两层主动学习策略的SVM分类方法[J].河南师范大学学报（自然科学版）,2014,42(2):158-162. 被引量：1
3谢科.融合协同训练和两层主动学习策略的SVM分类方法[J].湖南师范大学自然科学学报,2014,37(1):93-97. 被引量：1
4郭虎升,王文剑.基于主动学习的模式类别挖掘模型[J].计算机研究与发展,2014,51(10):2148-2159. 被引量：4
5黄毅,陈湘军,阮雅端,陈启美.低清晰视频的“白化-稀疏特征”车型分类算法[J].南京大学学报（自然科学版）,2015,51(2):257-263. 被引量：3
6张鹏,黄毅,阮雅端,陈启美.基于稀疏特征的交通流视频检测算法[J].南京大学学报（自然科学版）,2015,51(2):264-270. 被引量：4
7赵建华,刘宁.结合主动学习策略的半监督分类算法[J].计算机应用研究,2015,32(8):2295-2298. 被引量：7
8胡亚慧,李石君,余伟,杨莎,方其庆.一种结合文化和因子分解机的快速评分预测方法[J].南京大学学报（自然科学版）,2015,51(4):826-833. 被引量：4
9杨文柱,田潇潇,王思乐,张锡忠.主动学习算法研究进展[J].河北大学学报（自然科学版）,2017,37(2):216-224. 被引量：11
10程梦卓,董兰芳.面向消化内科辅助诊疗的生成式对话系统[J].计算机系统应用,2019,28(10):53-60.

二级引证文献33

1刘振宇,李钦富,杨硕,邓应强,刘芬,赖新明,白雪珂.一种基于主动学习和多种监督学习的情感分析模型[J].中国电子科学研究院学报,2020,15(2):171-176. 被引量：2
2李勃,阮雅端,陈启美.“网络视频识别、挖掘、汇聚技术及其系列应用”专栏前言[J].南京大学学报（自然科学版）,2015,51(2):217-218.
3张鹏,陈湘军,阮雅端,陈启美.采用稀疏SIFT特征的车型识别方法[J].西安交通大学学报,2015,49(12):137-143. 被引量：13
4蔡柳,恵飞,叶敏,康科,赵祥模.基于不确定抽样的半监督城市土地功能分类方法[J].吉林大学学报（信息科学版）,2016,34(4):550-555. 被引量：1
5胡亚慧,杨莎,刘晶,余伟,李石君,王俊,方其庆.URTP:一种基于用户-区域-时间-商品的因子分解推荐模型[J].计算机科学,2016,43(9):107-110. 被引量：1
6王子豪,徐桂琼.基于高阶偏差的因子分解机推荐算法[J].计算机应用研究,2017,34(2):339-342. 被引量：5
7张鹏,刘寅,栾国强,刘行,丁晓玉,程根.基于图约束和预聚类的主动学习算法在威胁情景感知中的研究[J].计算机应用研究,2017,34(5):1544-1547. 被引量：1
8王军,刘三民,刘涛.面向概念漂移的数据流分类研究分析[J].绵阳师范学院学报,2017,36(5):80-89.
9贾伟,华庆一,张敏军,陈锐,姬翔,王博.改进极限学习机的移动界面模式半监督分类[J].计算机工程与应用,2018,54(2):11-19. 被引量：7
10胡小娟,刘磊,邱宁佳.基于主动学习和否定选择的垃圾邮件分类算法[J].电子学报,2018,46(1):203-209. 被引量：16

1滚动字幕[J].八小时以外,2004(7):46-46.
2于海霞.高校机房的管理与维护[J].中国科技信息,2006(03A):31-31. 被引量：5
3评刊意见反馈[J].个人电脑,2006,12(9):246-246.
4杨彦春.不要让“完美主义”控制孩子[J].中华家教,2016,0(1):86-86.
5GE推出Predix云——专为工业数据和分析开发的云服务[J].变频器世界,2015,0(8):29-29.
6刘美春.脑-机接口系统的类协同式半监督学习[J].科学技术与工程,2013,21(19):5508-5512. 被引量：1
7提示[J].电子制作．电脑维护与应用,2005(4):31-31.
8武金刚.Windows 7“问题步骤记录器”建制系统操作全过程[J].网友世界,2009(19):4-5.
9北四环组.仙剑奇侠传三（下）[J].大众软件,2003(17):144-153.
10柳坚.生僻汉字巧输入[J].家庭电脑世界,2004(06S):38-38.

南京大学学报（自然科学版）

2012年第2期

浏览历史

内容加载中请稍等...

一种新的支持向量机主动学习策略被引量：10

参考文献16

二级参考文献26

共引文献19

同被引文献145

引证文献10

二级引证文献33

相关作者

相关机构

相关主题

浏览历史

一种新的支持向量机主动学习策略 被引量：10

参考文献16

二级参考文献26

共引文献19

同被引文献145

引证文献10

二级引证文献33

相关作者

相关机构

相关主题

浏览历史

一种新的支持向量机主动学习策略被引量：10