期刊文献+

文本聚类中的改进特征权重算法 被引量:2

Improvement of Feature Weight Formula in Text Clustering
下载PDF
导出
摘要 本文提出了一种新的基于词频和文档频率的特征词权重计算方法ETFC.首先构造了新的函数作为特征词的类别区分度,加强了低文档频数特征词的类别区分能力.然后运用k-means算法进行聚类实验.结果表明,改进后的权重算法ETFC比现有的权重算法TFIDF和TFC在聚类纯度和算法的稳定性方面均有较大提高,从而表明改进策略是可行的. A new algorithm ETFC for feature weight of words based on word and document frequency is put forward in this paper. Firstly, we propose an exponential function to distinguish the text categories so as to enhance the feature weight of the words that appear in a few documents. Then, k-means algorithm is applied in text clustering and some experiments are conducted. The results show that the ETFC method improves the efficiency and stability, and therefore it is superior to the traditional TFIDF and TFC methods in text clustering to a certain extent.
出处 《工程数学学报》 CSCD 北大核心 2012年第4期523-528,共6页 Chinese Journal of Engineering Mathematics
基金 中央高校基本科研业务费专项资金(xjj2009068)~~
关键词 文本聚类 特征选择 权重 K均值聚类 text clustering feature selection weight k-means
  • 相关文献

参考文献5

二级参考文献15

  • 1Ricardo Baeza Yates,Berthier Ribeiro Neto.现代信息检索.北京:机械工业出版社,2004
  • 2李国辉,汤大权,武德峰.信息组织和检索.北京:科学出版社,2003
  • 3Arvind Arasu,Junghoo Cho,Hector Garcia M.Searching the Web.ACM Transactions on Internet Technology.Auguest 2001,1(1):2~43
  • 4Sui Zhifang,Chen Yirong,Hu Junfeng.The research on the automatic term extraction in the domain of information science and technology.Institute of Computing Linguistics,Peking University
  • 5Cay S Horstmann,Gary Cornell.Java 2核心技术.北京:机械工业出版社,2003
  • 6梁久祯,兰东俊.基于先验知识的网页特征压缩与线性分类器设计[C].第十二届全国神经计算学术大会讨论文集.北京:人民邮电出版社,2002:494-501.
  • 7Rudolph G.Convergence Properties of Canonical Genetic Algorithms[J].IEEE Trans.on Neural Networks,1994,5(1):96-101.
  • 8Yiming Y.An Evaluation of Statistic Approaches to Text Categorization[J].Information Retrieval,1999,1(1/2):69-90.
  • 9Salton G,Wong A,Yang C.A Vector Space Model for Automatic Indexing[J].Communications of ACM,1975,18(11):613-620.
  • 10Mnic D,Grobelnik M.Feature Selection for Unbalanced Class Distribution and Naive Bayees[C].Proceedings of the 6^th International Conference on Machine Learning.Blrf:Morgan Kaufmann,1999:258-267.

共引文献258

同被引文献16

  • 1刘霄,邵健,庄越挺.基于主题模型的网络突发热点事件检测[A].第七届和谐人机环境联合学术会议(HHME2011)论文集.2011:6.
  • 2Diao Q M,Jiang J,Zhu F D. Finding Bursty Topics from Mi-croblogs [C]. In :Proceedings of ACL ,2012:536-544.
  • 3Hirsch I E. An Index to Quantify an Individual's Scientific Re-search Output[ J]. Proceedings of the Nationalacademy of Sci-ences of the United States of America, 2005,102(46) : 16569-16572.
  • 4Du Y Y, He Y X, Tian Y, et al . Microblog Bursty Topic De-tection Based on User Relationship [ C ]// Proceedings of the2011 6 th IEEE Joint International Information Technology andArtificial Intelligence Conference. Piscataway: IEEE, 2011:260-263.
  • 5Yanhui Gu, Zhenglu Yang, Guandong Xu. Exploration on efficient similar sentences extraction [ J ]. World Wide Web,2014,17 ( 4 ) :595 - 626.
  • 6Koby Crammer, Mark Dredze, Fernando Pereira. Confidence - Weighted Linear Classification for Text Categorization [ J ]. Journal of Ma- chine Learning Research, 2012(13) : 1891 - 1926.
  • 7徐文杰,陈庆奎.基于余弦向量法的Web数据并行抓掘系统[J].计算机工程,2009,35(7):64-67. 被引量:2
  • 8黄颖.LDA及主题词相关性的新事件检测[J].计算机与现代化,2012(1):6-9. 被引量:4
  • 9华秀丽,朱巧明,李培峰.语义分析与词频统计相结合的中文文本相似度量方法研究[J].计算机应用研究,2012,29(3):833-836. 被引量:42
  • 10李侃,周世斌,刘玉树.统计流形扩散核的文本分类方法[J].模式识别与人工智能,2012,25(2):339-345. 被引量:3

引证文献2

二级引证文献18

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部