基于KL距离的非平衡数据半监督学习算法被引量：11

A Semi-Supervised Learning Algorithm from Imbalanced Data Based on KL Divergence

下载PDF

导出

摘要在实际应用中,由于各种原因时常无法直接获得已标识反例,导致传统分类方法暂时失灵,因此,基于正例和未标识集的半监督学习顿时成了理论界研究的热点.研究者们提出了不同的解决方法,然而,这些方法都不能有效处理非平衡的分类问题,尤其当隐匿反例非常少或训练集中的实例分布不均匀时.因此,提出了一种基于KL距离的半监督分类算法——LiKL:依次挖掘出未标识集中的最可靠正例和反例,接着使用训练好的增强型分类器来分类.与其他方法相比,不仅提高了分类的查准率和查全率,而且具有鲁棒性. In many real applications,it＇s often difficult or quite expensive to get labeled negative examples for learning,such as Web search,medical diagnosis,earthquake identification and so on. This situation makes the traditional classification techniques work ineffectively,because the precondition that every class has to own its labeled instances is not met. Therefore,the semi-supervised learning method from positive and unlabeled data becomes a hot topic in the literature. In the past years,researchers have proposed many methods,but they can＇t cope well with the imbalanced classification problem,especially when the number of hidden negative examples in the unlabeled set is relatively small or the distribution of training examples in the training set becomes quite different. In this paper,a novel KL divergence-based semi-supervised classification algorithm,named LiKL （i.e. semi-supervised learning algorithm from imbalanced data based on KL divergence）,is proposed to tackle this special problem. The proposed approach firstly finds likely positive examples existing in the unlabeled set,and successively finds likely negative ones,followed by an enhanced logistic regression classifier to classify the unlabeled set. The experiments show that the proposed approach not only improves precision and recall,but also is very robust,compared with former work in the literature.

作者许震沙朝锋王晓玲周傲英

机构地区复旦大学计算机科学技术学院华东师范大学海量计算研究所上海市智能信息处理重点实验室

出处《计算机研究与发展》 EI CSCD 北大核心 2010年第1期81-87,共7页 Journal of Computer Research and Development

基金国家自然科学基金项目(60673137 60773075) 国家"八六三"高技术研究发展计划基金项目(2009AA01Z149) 上海市教委科技创新项目(10ZZ33)

关键词半监督学习非平衡 KL距离朴素贝叶斯 LOGISTIC回归 semi-supervised learning imbalance KL divergence nave Bayesian logistic regression

分类号 TP391 [自动化与计算机技术—计算机应用技术] TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献11

1Manevitz L M, Yousef M, Cristianini N, et al. One-class SVMs for document classification [J]. Journal of Machine Learning Research, 2001, 2 : 139-154.
2Yu H, Han J, Chang K. PEBL: Positive examples based learning for Web page classification using SVM [C]//Proc of the 8th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 2002: 239-248.
3Li X, Liu B, Ng S. Learning to identify unexpected instances in the test set [C]//Proc of the 20th IJCAI. San Francisco: Morgan Kaufmann, 2007:2802-2807.
4Sha C, Xu Z, Wang X, et al. Directly identify unexpected instances in the test set by entropy maximization [C]//Proc of APWEB/WAIM 2009. Berlin: SPringer, 2009: 659-664.
5黎铭,周志华.基于多核集成的在线半监督学习方法[J].计算机研究与发展,2008,45(12):2060-2068. 被引量：12
6Manning C D, Raghavan P, Schutze H. An Introduction to Information Retrieval [M]. Cambridge, Cambridge University Press, 2007: 117-119.
7Maimon O, Rokach L. The Data Mining and Knowledge Discovery Handbook [M]. Berlin: Springer, 2005:853-867.
8Gyorfi L, Gyorfi Z, Vajda I. Bayesian decision with rejection [J]. Problems of Control and Information Theory, 1979, 8 (5) : 445-452.
9McCallum A, Nigam K. A comparison of event models for naive Bayes text classification [C]//Proc of AAAI-98 Workshop on Learning for Text Categorization. Menlo Park, CA: AAAI, 1998:41-48.
10Landwehr N, Hall M, Frank E. Logistic model trees [C]// Proc of the 14th European Conf on Machine Learning. Berlin: Springer, 2003:241-252.

二级参考文献30

1Scholkopf B, Herbrich R, Smola A J. A generalized representer theorem [C] //Proe of the 14th Annual Conf on Learning Theory. Berlin: Springer, 2001:416-426.
2Blake C, Keogh E, Merz C J. UCI repository of machine learning databases [OL]. [2008-11-10]. http://www. ics. uci. edu/-mlearn/ MLRepository. html.
3Bays D. UCI KDD archive [OL].[2008-11-10]. http:// kdd. ies. uci. edu/.
4Crammer K, Dekel O, Shalev-Shwartz S, et al. Online passive-aggressive algorithms [C] //Thrun S, Saul L K, Scholkopf B, eds. Advances in Neural Information Processing Systems 16. Cambridge, MA: MIT Press, 2006.
5Kivinen J, Smola A J, Williamson R C. Online learning with kernels [J]. IEEE Trans on Signal Processing, 2004, 52(8):2165-2176.
6Herbster M, Pontil M. Prediction on a graph with a perceptron [C] //Scholkopf B, Platt J C, Hoffman T, eds. Advances in Neural Information Processing Systems 19. Cambridge, MA: MIT Press, 2007:577-584.
7Cheng L, Vishwanathan S V N, Schuurmans D, et al. Implicit online learning with kernels [C]//Scholkopf B, Platt J C, Hoffman T, eds. Advances in Neural Information Processing Systems 19. Cambridge, MA.. MIT Press, 2007 : 249-256.
8McDonald R, Crammer K, Pereira F. Online large-margin training of dependency parsers [C] //Proc of the 43rd Annual Meeting of the Association for Computational Linguistics. Morristown, N J: ACL Press, 2005:91-98.
9McDonald R. Discriminative sentence compression with soft syntactic constraints [C] //Proc of the llth Conf of the European Chapter of the Association for Computational Linguistics. Morristown, NJ: ACL Press, 2006:297-304.
10Ciaramita M, Murdock V, Plachouras V. Online learning from click data for sponsored search [C] //Proc of the 17th Int Conf on World Wide Web. New York, NJ: ACM, 2008.. 227-236.

共引文献11

1吴枫,仲妍,吴泉源.基于增量核主成分分析的数据流在线分类框架[J].自动化学报,2010,36(4):534-542. 被引量：12
2孔祥南,黎铭,姜远,周志华.一种针对弱标记的直推式多标记分类方法[J].计算机研究与发展,2010,47(8):1392-1399. 被引量：13
3汪洪桥,孙富春,蔡艳宁,陈宁,丁林阁.多核学习方法[J].自动化学报,2010,36(8):1037-1050. 被引量：156
4何丽莉,王克淼,白洪涛,胡成全.基于CMP的多种并行蚁群算法及比较[J].吉林大学学报（理学版）,2010,48(5):787-792. 被引量：3
5姚林朋,王辉,钱勇,黄成军,郑文栋,江秀臣.基于半监督学习的XLPE电缆局部放电模式识别研究[J].电力系统保护与控制,2011,39(14):40-46. 被引量：19
6朱岩,景丽萍,于剑.一种利用近邻和信息熵的主动文本标注方法[J].计算机研究与发展,2012,49(6):1306-1312. 被引量：4
7王萍,王迪,冯伟.基于流形正则化的在线半监督极限学习机[J].上海交通大学学报,2015,49(8):1153-1158. 被引量：6
8余清清,曾健民,林德贵.基于KLCCF和SPG-GMKL的财务危机预警模型[J].计算机与数字工程,2015,43(12):2209-2214. 被引量：2
9赵国朕,宋金晶,葛燕,刘永进,姚林,文涛.基于生理大数据的情绪识别研究进展[J].计算机研究与发展,2016,53(1):80-92. 被引量：45
10陈辞.基于知识图谱的军事知识演化技术研究[J].舰船电子工程,2019,39(6):22-27. 被引量：3

同被引文献99

1贾自艳,何清,张海俊,李嘉佑,史忠植.一种基于动态进化模型的事件探测和追踪算法[J].计算机研究与发展,2004,41(7):1273-1280. 被引量：58
2俞鸿魁,张华平,刘群,吕学强,施水才.基于层叠隐马尔可夫模型的中文命名实体识别[J].通信学报,2006,27(2):87-94. 被引量：157
3宋丹,王卫东,陈英.基于改进向量空间模型的话题识别与跟踪[J].计算机技术与发展,2006,16(9):62-64. 被引量：23
4Pan S J, Yang Q. A survey on transfer learning [J]. IEEE Trans on Knowledge and Data Engineering, 2010, 22(10): 1345-1359.
5Vapnik V. An overview of statistical learning theory [J]. IEEE Trans on NeuraI Networks, 1999, 10(5): 988-999.
6Shi Y, Lan Z, Liu W, et al. Extending semi-supervised learning methods for inductive transfer learning [C] //Proc of the 9th IEEE Int Conf on Data Mining. Los Alamitos: IEEE Computer Society, 2009:483-492.
7Burges C J C. A tutorial on support vector machines for pattern recognition [J]. Data Mining and Knowledge Discovery, 1998, 2(2): 121-167.
8Dai W, Yang Q, Xue G, et al. Boosting for transfer learning [C] //Proc of the 24th Int Conf on Machine Learning. New York: ACM, 2007: 193-200.
9Pan S J, Kwok J T, Yang Q. Transfer learning via dimensionality reduction [C] //Proc of AAAI. Menlo Park, CA: AAAI, 2008: 677-682.
10Xie S, Fan W, Peng J, et al. Latent space domain transfer between high dimensional overlapping distributions [C] // Proc of the 18th Int Conf on World Wide Web. New York: ACM, 2009:91-100.

引证文献11

1洪佳明,印鉴,黄云,刘玉葆,王甲海.TrSVM:一种基于领域相似性的迁移学习算法[J].计算机研究与发展,2011,48(10):1823-1830. 被引量：17
2刘培奇,孙捷焓.基于LDA主题模型的标签传递算法[J].计算机应用,2012,32(2):403-406. 被引量：5
3陆广泉,谢扬才,刘星,张师超.一种基于KNN的半监督分类改进算法[J].广西师范大学学报（自然科学版）,2012,30(1):45-49. 被引量：7
4魏景璇,鲁燃,张艳辉.基于动态阈值和命名实体的双重过滤话题追踪[J].计算机应用研究,2015,32(4):982-985. 被引量：6
5白宁,吴涛.基于虚拟中心约减的非平衡分类方法[J].自动化技术与应用,2015,34(7):47-49.
6底晓强,邱金,李锦青,毕琳,杨华民,赵建平,张凤荣.基于LDA的群组聊天行为研究[J].情报科学,2017,35(12):45-49.
7崔丽娜.基于聚类的非平衡K-NN分类方法[J].现代计算机,2017,23(22):6-9.
8林越,刘廷章,黄莉荣,奚晓晔,潘建.基于双向KL距离聚类算法的变压器状态异常检测[J].广西师范大学学报（自然科学版）,2018,36(4):20-26. 被引量：6
9赵玉明,舒红平,魏培阳,刘魁.面向稀疏数据集的聚类算法[J].科学技术与工程,2020,20(2):659-663. 被引量：5
10孙世卓,李峰,司佳佳,李雅琳,蒋硕.航天器蓄电池组在轨健康状态评价方法研究[J].电源技术,2020,44(11):1658-1661. 被引量：1

二级引证文献47

1陈雪云,郭躬德,陈黎飞,卢伟胜.GwMKnn:针对类属性数据加权的MKnn算法[J].计算机系统应用,2013,22(8):103-108. 被引量：1
2许敏,王士同,史荧中.一种新的面向迁移学习的L_2核分类器[J].电子与信息学报,2013,35(9):2059-2065. 被引量：1
3王雪松,潘杰,程玉虎,曹戈.基于相似度衡量的决策树自适应迁移[J].自动化学报,2013,39(12):2186-2192. 被引量：9
4王峰,靳小波,于俊伟,王贵财.V-最优直方图及其在车牌分类中的应用研究[J].广西师范大学学报（自然科学版）,2013,31(3):138-143. 被引量：1
5倪彤光,王士同,应文豪,邓赵红.迁移组概率学习机[J].电子学报,2013,41(11):2207-2215. 被引量：2
6屈庆涛,刘其成,牟春晓.基于N-Gram语言模型的并行自适应新闻话题追踪算法[J].山东大学学报（工学版）,2018,48(6):37-43. 被引量：10
7倪彤光,王士同,史荧中,张景祥.面向共享数据的迁移组概率学习机[J].控制与决策,2014,29(8):1363-1371.
8何锦群,刘朋杰.基于LDA的文本分类算法[J].天津理工大学学报,2014,30(4):28-31.
9倪彤光,王士同.适用于不确定类标签数据学习的迁移支持向量机[J].控制与决策,2014,29(10):1751-1757. 被引量：3
10杨兴明,吴克伟,孙永宣,谢昭.可迁移测度准则下的协变量偏移修正多源集成方法[J].电子与信息学报,2015,37(12):2913-2920. 被引量：2

1倪彤光,王士同,应文豪,邓赵红.迁移组概率学习机[J].电子学报,2013,41(11):2207-2215. 被引量：2
2倪维健,刘彤,曾庆田,赵华,汤建渝.基于非平衡数据分类的单文档自动文摘方法[J].计算机工程与科学,2012,34(4):162-166. 被引量：2
3曹振新,朱云龙,赵明扬,尹朝万,李富明.混流装配线负荷平衡与投产排序的优化研究[J].信息与控制,2004,33(6):660-664. 被引量：24
4陈松贵,李峰,王飞燕.基于SBA和KNN的多类分类算法[J].计算机工程与设计,2008,29(22):5786-5788. 被引量：1
5乔增伟,孙卫祥.一种基于支持向量机决策树多类分类器[J].计算机应用与软件,2009,26(11):227-230. 被引量：10
6周小丽,曹振新.混流装配线的规划设计与仿真研究[J].机床与液压,2008,36(4):344-348. 被引量：4

计算机研究与发展

2010年第1期

浏览历史

内容加载中请稍等...

基于KL距离的非平衡数据半监督学习算法被引量：11

参考文献11

二级参考文献30

共引文献11

同被引文献99

引证文献11

二级引证文献47

相关作者

相关机构

相关主题

浏览历史

基于KL距离的非平衡数据半监督学习算法 被引量：11

参考文献11

二级参考文献30

共引文献11

同被引文献99

引证文献11

二级引证文献47

相关作者

相关机构

相关主题

浏览历史

基于KL距离的非平衡数据半监督学习算法被引量：11