基于信息熵与动态聚类的文本特征选择方法被引量：3

Text feature selection method based on information entropy and dynamic clustering

下载PDF

导出

摘要根据科技文献的结构特点,搭建了一个四层挖掘模式,提出了一种应用于科技文献分类的文本特征选择方法。该方法首先依据科技文献的结构将其分为四个层次,然后采用K-means聚类对前三层逐层实现特征词提取,最后再使用Aprori算法找出第四层的最大频繁项集,并作为第四层的特征词集合。在该方法中,针对K-means算法受初始中心点的影响较大的问题,首先采用信息熵对聚类对象赋权的方式来修正对象间的距离函数,然后再利用初始聚类的赋权函数值选出较合适的初始聚类中心点。同时,通过为K-means算法的终止条件设定标准值,来减少算法迭代次数,以减少学习时间;通过删除由信息动态变化而产生的冗余信息,来减少动态聚类过程中的干扰,从而使算法达到更准确更高效的聚类效果。上述措施使得该文本特征选择方法能够在文献语料库中更加准确地找到特征词,较之以前的方法有很大提升,尤其是在科技文献方面更为适用。实验结果表明,当数据量较大时,该方法结合改进后的K-means算法在科技文献分类方面有较高的性能。 By means of a four-mining model which is constructed based on the structural characteristics of scientific liter- atures, a text feature selection method is proposed to apply in classification of scientific literatures. The proposed method firstly divides scientific literature into four layers according to its structure, and then selects features progressively for the former three layers by K-means algorithm, and finally finds out the maximum frequent itemsets of fourth layer by Aprori algorithm to act as a collection of fourth layer features. Meanwhile, K-means algorithm is also improved which firstly uses information entropy empower the clustering objects to correct the distance function, and then employs empowerment func- tion value to select the optimal initial clustering center, and subsequently reduces algorithm iterations and learning time by setting the standard value for termination condition of the algorithm and reduces interference of dynamic clustering by removing redundant information from the changing information to make the algorithm achieve more accurate and efficient clustering effect. So, it is possible for this proposed method to find features more accurately in the literature corpus. Exper- imental results show that the proposed method is feasible and effective, and has higher performance in scientific litera- ture classification which is compared with the previous methods.

作者唐立力

机构地区重庆工商大学融智学院

出处《计算机工程与应用》 CSCD 北大核心 2015年第19期152-157,共6页 Computer Engineering and Applications

关键词 K-MEANS算法动态聚类特征选择信息熵 k-means algorithm dynamic clustering feature selection information entropy

分类号 TP301 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献15

1Lee J,Kim D W.Mutual information-based multi-label feature selection using interaction information[J].Expert Systems with Applications,2015,42(4):2013-2025.
2Fan Baojie,Cong Yang,Du Yingkui.Discriminative multitask objects tracking with active feature selection and drift correction[J].Pattern Recognition,2014,47(12):3828-3840.
3Wu Xiaodong.Online feature selection with streaming features[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(5):1178-1192.
4Seeja K R.Feature selection based on closed frequent itemset mining:A case study on SAGE data classification[J].Nurocomputing,2015,151(3):1027-1032.
5Dernoncourt D.Analysis of feature selection stability on high dimension and small sample data[J].Computational Statistics and Data Analysis,2014,71(3):681-693.
6Tabakhi S.An unsupervised feature selection algorithm based on ant colony optimization[J].Engineering Applications of Artificial intelligence,2014,32(2):112-123.
7Abdullah S.An exponential Monte-Carlo algorithm for feature selection problems[J].Computers and Industrial Engineering,2014,67(1):160-167.
8Boutsidis C,Zouzias A.Randomized dimensionality reduction forκ-means clustering[J].IEEE Transactions on Information Theory,2015,61(2):1045-1062.
9Sun Jiangyan.An improved k-means clustering algorithm for the community discovery[J].Journal of Software Engineering,2015,9(2):242-253.
10Xiang Yaguang.Apriori algorithm for economic data mining in sports industry[J].Computer Modelling and New Technologies,2014,18(12):451-455.

二级参考文献23

1杨打生,郭延芬.一种特征选择的信息论算法[J].内蒙古大学学报（自然科学版）,2005,36(3):341-345. 被引量：1
2Kira K, Rendell L. The Feature Selection Problem: Traditional Methods and a New Algorithm[C]//Proc. of AAAI'92. San Jose, USA: Is. n.], 1992.
3John G H, Kohavi R, Pfleger K. Irrelevant Features and the Subset Selection Problem[C]//Pr0c. of the 1 lth International Conference on Machine Learning. IS. l.]: Morgan Kauffmann Publishers, 1994 121-129.
4Peng Huangchuan, Long Fuhui, Ding C. Feature Selection Based on Mutual Information: Criteria of Max-dependency, Max- relevance, and Min-redundancy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(8): 1226-1238.
5Koller D, Sahami M. Toward Optimal Feature Selection[C]//Proc. of International Conference on Machine Learning. [S. 1.]: Morgan Kaufmarm Publishers, 1996: 284-292.
6Yu Lei, Liu Huan. Feature Selection for High-dimensional Data: A Fast Correlation-based Filter Solution[C]//Proc. of the 20th International Conference on Machine Learning. Washington D. C., USA: AAAI Press, 2003.
7Sotoca J, Pla F. Supervised Feature Selection by Clustering Using Conditional Mutual Information-based Distances[J]. Pattern Recognition, 2010, 43(6): 2068-2081.
8Au W, Chan K C C, Wong A K C, et al. Attribute Clustering for Grouping, Selection, and Classification of Gene Expression Data[J]. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2005, 2(2): 83-101.
9Kwak N,Choi C H.Input feature selection for classifica- tion problems[J].IEEE Transactions on Neural Networks, 2002,13(1) : 143-159.
10Estevez P A, Tesmer M, Perez C A, et al.Normalized mutual information feature selection[J].IEEE Transactions on Neural Networks,2009,20(2) : 189-201.

共引文献8

1包理群,侯志伟,李祥林.一种客户端短信过滤的样本特征库生成方法[J].计算机工程,2014,40(1):199-202.
2潘果.基于正则化互信息改进输入特征选择的分类算法[J].计算机工程与应用,2014,50(15):25-29. 被引量：3
3钱慎一,朱艳玲,朱颢东.基于K-Means和Apriori算法的多层特征提取方法[J].华中师范大学学报（自然科学版）,2015,49(3):357-362. 被引量：3
4李俊州,武莹.基于改进K-medoids算法的科技文献特征选择方法[J].华中师范大学学报（自然科学版）,2015,49(4):541-545. 被引量：1
5钱慎一,朱艳玲,朱颢东.基于多层挖掘策略的特征选择及在科技文献分类中的应用[J].兰州理工大学学报,2015,41(6):109-113. 被引量：1
6邓小盾.基于机器学习的P2P网络流问题的研究[J].电子设计工程,2017,25(8):109-114. 被引量：1
7杨秋翔,孙涵.基于权值向量矩阵约简的Apriori算法[J].计算机工程与设计,2018,39(3):690-693. 被引量：15
8李顺勇,王改变,余曼.基于相似性特征聚类的加权无监督特征选择算法[J].贵州师范大学学报（自然科学版）,2021,39(1):49-57. 被引量：2

同被引文献23

1肖新标,金学松,温泽峰.钢轨扣件失效对列车动态脱轨的影响[J].交通运输工程学报,2006,6(1):10-15. 被引量：45
2王磊,莫玉龙,戚飞虎.基于Canny理论的边缘提取改善方法[J].中国图象图形学报（A辑）,1996,1(3):191-195. 被引量：42
3朱颢东,钟勇.基于并行二进制免疫量子粒子群优化的特征选择方法[J].控制与决策,2010,25(1):53-58. 被引量：8
4罗智中.基于线段扫描的碎纸片边界检测算法研究[J].仪器仪表学报,2011,32(2):289-294. 被引量：35
5何鹏飞,周宗潭,胡德文.基于蚁群优化算法的碎纸拼接[J].计算机工程与科学,2011,33(7):67-73. 被引量：25
6孟佳娜,林鸿飞,李彦鹏.基于特征贡献度的特征选择方法在文本分类中应用[J].大连理工大学学报,2011,51(4):611-615. 被引量：9
7王凌,张冰,陈锡爱.基于计算机视觉的钢轨扣件螺母缺失检测系统[J].计算机工程与设计,2011,32(12):4147-4150. 被引量：23
8刘赫,张相洪,刘大有,李燕军,尹立军.一种基于最大边缘相关的特征选择方法[J].计算机研究与发展,2012,49(2):354-360. 被引量：9
9许贵阳,史天运,任盛伟,韩强,王登阳.基于计算机视觉的车载轨道巡检系统研制[J].中国铁道科学,2013,34(1):139-144. 被引量：70
10张建朋,陈福才,李邵梅,刘力雄.基于密度与近邻传播的数据流聚类算法[J].自动化学报,2014,40(2):277-288. 被引量：28

引证文献3

1黄章树,叶志龙.基于改进的CHI统计方法在文本分类中的应用[J].计算机系统应用,2016,25(11):136-140. 被引量：13
2刘宪国,贾子钰,刘万军,韩敏.一种基于灰度值矩阵的文档复原方法研究[J].计算机应用研究,2016,33(12):3901-3904. 被引量：4
3李爽,李柏林,狄仕磊,罗建桥.基于信息熵加权词包模型的扣件图像检测[J].计算机工程与应用,2017,53(21):185-189. 被引量：3

二级引证文献20

1骆魁永.一种面向不均衡数据集的CHI特征选择改进算法[J].商丘师范学院学报,2021,37(6):9-13.
2郑步青,邹红霞,王琳,王桢.网络舆情主动感知技术探析[J].兵器装备工程学报,2017,38(8):131-135. 被引量：2
3刘赐德,黄志祥,管一弘,赵建军.基于文字特征和边缘特征的文本碎纸片拼接[J].信息技术,2018,42(1):20-23. 被引量：2
4朱敏玲,吴海艋,石磊.粗糙集规则匹配算法及其在文本分类中的应用[J].计算机系统应用,2018,27(4):131-137. 被引量：1
5高宝林,周治国,杨文维,肖泽力.基于类别和改进的CHI相结合的特征选择方法[J].计算机应用研究,2018,35(6):1660-1662. 被引量：8
6邢楠,张建奇,刘鹏飞,曹芙蓉.一种新颖的破碎文件重构方法[J].西安电子科技大学学报,2018,45(4):34-39.
7赵乐,张兴旺.面向LDA主题模型的文本分类研究进展与趋势[J].计算机系统应用,2018,27(8):10-18. 被引量：8
8蔡佳慧,唐国平,黄镇涛.基于“双向判断”的横纵切中英文碎纸片拼接[J].信息技术,2018,42(11):15-20.
9谢斌红,马非,潘理虎,张英俊.煤矿安全隐患信息自动分类方法[J].工矿自动化,2018,44(10):10-14. 被引量：9
10黄梦莹,张晓滨.融合CHI与信息增益的情感文本特征选择[J].西安工程大学学报,2018,32(6):713-717. 被引量：3

1钱慎一,朱艳玲,朱颢东.基于多层挖掘策略的特征选择及在科技文献分类中的应用[J].兰州理工大学学报,2015,41(6):109-113. 被引量：1
2代桂平,王勇,侯亚荣.基于遗传算法的TSP问题求解算法及其系统[J].微计算机信息,2010,26(4):15-16. 被引量：14
3高迎,王丽君,王锡钢.Simutem:一个中文信息检索系统[J].鞍山师范学院学报,2001,3(3):82-85.
4庞宁.基于网页特征的特征词提取技术[J].西南民族大学学报（自然科学版）,2014,40(1):137-141.
5张阳,何丽,朱颢东.一种改进的K-means动态聚类算法[J].重庆师范大学学报（自然科学版）,2016,33(1):97-101. 被引量：14
6吕月娥.中文科技期刊数据库文献分类与检索[J].临沂师范学院学报,2008,30(6):104-107.
7杨玉梅.基于信息熵改进的K-means动态聚类算法[J].重庆邮电大学学报（自然科学版）,2016,28(2):254-259. 被引量：20
8李俊州,武莹.基于改进K-medoids算法的科技文献特征选择方法[J].华中师范大学学报（自然科学版）,2015,49(4):541-545. 被引量：1
9邹娟,周经野,邓成,高南莎.特征词提取中同义处理的新方法[J].中文信息学报,2005,19(6):44-49. 被引量：10
10黄卫华,陈军.协作群相似关系可计算判定理论研究[J].石油化工自动化,2006,42(6):18-20.

计算机工程与应用

2015年第19期

浏览历史

内容加载中请稍等...

基于信息熵与动态聚类的文本特征选择方法被引量：3

参考文献15

二级参考文献23

共引文献8

同被引文献23

引证文献3

二级引证文献20

相关作者

相关机构

相关主题

浏览历史

基于信息熵与动态聚类的文本特征选择方法 被引量：3

参考文献15

二级参考文献23

共引文献8

同被引文献23

引证文献3

二级引证文献20

相关作者

相关机构

相关主题

浏览历史

基于信息熵与动态聚类的文本特征选择方法被引量：3