基于网络社区结构的训练集非均衡程度度量方法

Approach to Evaluate the Imbalanced Degree of the Training DataSets Using Community Structure

下载PDF

导出

摘要在机器学习和数据挖掘实际应用中,针对分类训练集的选取,通常要求训练集中每一类所包含的数据在数量上要尽可能的"均衡".本文以非均衡训练集与分类学习效率关系研究为依据,给出了"均匀度"和"内聚度"两种类型的训练集非均衡程度因素的概念;"均匀度"是用来描述训练集类之间(between-class)的非均衡程度,其含义是指训练集不同类之间数据数量的非均衡程度;"内聚度"是用来描述训练集类内部(within-class)的非均衡程度,指训练集中不同类在空间分布上的线性相关程度,通过训练集数据之间的相关程度,构建出训练集的网络结构,运用一种能体现训练集内聚性的网络拓扑结构的指标-网络社区结构作为度量,提出了基于网络社区模块结构的非均衡训练集度量方法,并指出了高均匀度和高内聚度是选取"优良"分类训练集的关键因素.通过对UCI标准训练集的实验,结果验证本方法作为选取训练集标准的有效性. In present application of machine learning and data mining , the criterion to choose the right training datasets is the assumption that the number of data in different class is the only fact of the degree of class imbalance. Based on empirical studying the relationship between class imbalance and learning algorithms, in this paper, equality and cohesion of an imbalanced dataset which are the two important facts of the degree of class imbalance are proposed, equality is the between-class imbalance which means the number of data in different class, cohesion is the within-class imbalances whicht means that the distribution of the data within each class is also relevant. A new approach using high equality and high Cohesion is proposed to evaluate the degree of class imbalance. A new approach using the class distribution is proposed to evaluate the degree of class imbalance, the main idea is based on the community structure of data set which is a very valuable and crucial to understand the class distribution structure. New approach can help us to choose training data in real-world situations. by experiment study on UCI datasets, the newly approach is proved reasonable and viable.

作者岳训迟忠先葛平俱莫宏伟郝艳友

机构地区大连理工大学计算机科学工程系山东农业大学信息科学与工程学院哈尔滨工程大学自动化学院

出处《小型微型计算机系统》 CSCD 北大核心 2007年第8期1427-1433,共7页 Journal of Chinese Computer Systems

基金国家自然科学基金项目(60305007)资助

关键词训练集非均衡问题复杂网络网络社区结构均匀度内聚度 class imbalance problem complex network community structure equality cohesion

分类号 TP314 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献26

1Daskalaki S,Kopanas I,Avouris N.Evaluation of classifiers for an uneven class distribution problem[J].Applied Artificial Intelligence,2006,20(1):1-37.
2Weiss G.Mining with rarity:a unifying framework[J].SIGKDD Explorations,2004,6(1):7-19.
3Japkowicz N,Stephen S.The class imbalance problem:a systematic study[J].Intelligent Data Analysis,2002,6(5):429-450.
4Sofia Visa,Anca Ralescu.Issues in mining imbalanced data sets-a review paper[C].Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference,MAICS-.2005,Dayton,April 16-17,2005,67-73.
5Foster Probost.Machine learning from imbalanced data sets 101[C].Invited paper for the AAAI'2000 Workshop on Imbalanced Data Sets.(2000)
6Chawla N,Japkowicz N,Kolcz A.ICML'2003 workshop on learning from imbalanced data sets (II),2003[C].Proceedings available at http://www.site.uottawa.ca/～nat/Workshop2003/workshop2003.html.
7Zhang Qi-rui,Zhang Ling,Dong Shou-bin,et al.Effects of category distribution in a training set on text categorization[J].Journal of Tsinghua University(Sci&Tech),2005,45(SJ):1802-1805.
8Weiss G.Provost (2003).The effect of class distribution on classifier learning:an empirical study[R].Technical Report ML-TR-44,Department of Computer Science,Rutgers University.August 2,2001
9Gary M Weiss,Haym Hirsh.A quantitative study of small disjuncts[C].In:Proceedings of the Seventeenth National Conference on Artificial Intelligence,665-670.Menlo Park,CA:AAAI Press 2000.
10Nitesh Chawla.C4.5 and imbalanced datasets:investigating the effect of sampling method,probalistic estimate,and decision tree structure[C].ICML-KDD'2003 Workshop:Learning from Imbalanced Data Sets,2003

二级参考文献5

1Hull D A.Improving text retrieval for the routing problem using latent semantic indexing[].Proceedings of the th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.1994
2Sebastiani F.Machine learning in automated text categorization[].ACM Computing Surveys.2002
3杨芙清,梅宏,李克勤.软件复用与软件构件技术[J].电子学报,1999,27(2):68-75. 被引量：513
4周欣,孙家骕,杨芙清.青鸟C++程序理解工具[J].计算机工程,2000,26(11):80-81. 被引量：10
5周欣,陈向葵,孙家骕,杨芙清.面向对象系统中基于度量的可复用构件获取机制[J].电子学报,2003,31(5):649-653. 被引量：15

共引文献38

1赵小明,孙晓璇,李攀,胡绍波.基于决策树分类算法的平行志愿填报及研究[J].思想战线,2010,36(S1):348-351. 被引量：2
2罗景,张路,孙家骕.构件提取技术综述[J].计算机科学,2005,32(12):1-7. 被引量：4
3齐晶晶,郭跟成.基于信息熵的软件构件度量方法[J].计算机应用,2006,26(5):1183-1185. 被引量：3
4赵伟,张路,梅宏,孙家骕.一种基于功能需求层次凝聚的程序聚类方法[J].软件学报,2006,17(8):1661-1668. 被引量：12
5郑明秀,唐常杰,姜玥,杨明根.基于位向量的复用对象构件挖掘[J].计算机工程与应用,2006,42(25):35-38.
6周荃,王崇骏,王王君,陈世福.PC4.5:用于不均衡数据集的C4.5改进算法[J].计算机辅助工程,2006,15(3):23-26. 被引量：2
7周立兵,柳景超.贝叶斯理论在垃圾邮件过滤中的应用分析[J].网络安全技术与应用,2006(11):68-69. 被引量：1
8龚红仿,高捷,李军义.基于带权EDPN迭代的面向对象系统测试技术[J].计算机工程与应用,2007,43(2):118-120.
9齐晶晶,郭跟成.基于图的谱分割技术的面向对象软件系统分解[J].计算机应用研究,2007,24(1):54-57.
10赵凤英,王崇骏,陈世福.用于不均衡数据集的挖掘方法[J].计算机科学,2007,34(9):139-141. 被引量：5

1刘井莲,王大玲,赵卫绩,冯时,张一飞.一种面向度中心性及重叠网络社区的发现算法[J].计算机科学,2016,43(3):33-37. 被引量：9
2杨博,刘杰,刘大有.基于随机网络集成模型的广义网络社区挖掘算法[J].自动化学报,2012,38(5):812-822. 被引量：11
3岳训,迟忠先,莫宏伟,王志军,郝艳友.人工免疫网络模型的数据特征提取性能评价技术[J].小型微型计算机系统,2007,28(5):886-890. 被引量：1
4朱恒民,胡炜,马静,魏静.社区结构对微博舆论话题传播的影响研究[J].系统仿真学报,2016,28(7):1506-1513. 被引量：2
5凌莉.“轻型臂WEE” 引领中国协作机器人新纪元[J].中国科技产业,2016(4):43-47. 被引量：3
6英维思过程系统任命新亚太区总裁及市场营销副总裁[J].自动化博览,2008(7):3-3.
7陈国强,张西广,张新刚.应用离散量子粒子群的复杂网络社区检测[J].计算机工程与应用,2011,47(17):45-46.
8范晖,夏清国,乌伟.基于节点效益纳什均衡的网络社区发现算法[J].计算机工程与设计,2016,37(10):2775-2779. 被引量：1
9吕振,李苏雪,张传亭,袁东风.一种基于结构信息的改进CNM算法[J].山东大学学报（工学版）,2017,47(1):37-41.
10即插式3D录影 Weeview发布Eye-Plug智能手机外接3D摄像头[J].照相机,2016,0(8):88-88.

小型微型计算机系统

2007年第8期

浏览历史

内容加载中请稍等...

基于网络社区结构的训练集非均衡程度度量方法

参考文献26

二级参考文献5

共引文献38

相关作者

相关机构

相关主题

浏览历史