期刊文献+

一种基于主题的Web文本聚类算法 被引量:1

A Clustering Algorithm for Web Document Based on Theme
下载PDF
导出
摘要 设计了一种基于主题的Web文本聚类方法(HTBC):首先根据文本的标题和正文提取文本的主题词向量,然后通过训练文本集生成词聚类,并将每个主题词向量归类到其应属的词类,再将同属于一个词类的主题词向量对应的文本归并到用对应词类的名字代表的类,从而达到聚类的目的.算法分四个步骤:预处理、建立主题向量、生成词聚类和主题聚类.同时,对HTBC与STC、AHC、KMC算法从聚类的准确率和召回率上做了比较,实验结果表明,HTBC算法的准确率较STC、AHC和KMC算法要好. A clustering method-HTBC was devised based on theme.It extracts the Keywords according to the title and the main body of the document,trains the text sets to generate the word clustering,classifies each keyword to responding word cluster,combines the same thesis attribute to word cluster and finally realizes clustering.There are four steps for HTBC such as pretreatment,constructing the theme vector,generating the word cluster and theme clustering.The experimental data indicate HTBC are better than K-Means,AHC and STC in terms of accuracy and recall ratio after comparision.
作者 袁晓峰
出处 《成都大学学报(自然科学版)》 2010年第3期249-252,共4页 Journal of Chengdu University(Natural Science Edition)
关键词 HTBC算法 WEB文本聚类 主题 搜索引擎 互信息 HTBC Web document clustering theme search engine mutual information
  • 相关文献

参考文献8

二级参考文献41

  • 1唐振民,靳从,杨静宇,李远复.一种用于自动标引系统的主题词自动切分方法[J].南京理工大学学报,1995,19(5):401-404. 被引量:2
  • 2牛凯.中文科技文献计算机自动标引系统的研究[J].情报学报,1995,14(1):16-26. 被引量:2
  • 3靳从,樊春丽,杨静宇.主题词自动标引中的知识处理方法[J].情报理论与实践,1996,19(2):30-33. 被引量:3
  • 4黄昌宁 等.对自动分词的反思[A]..语言计算与基于内容的文本处理[C].北京:清华大学出版社,2003,7.26-38.
  • 5唐振民,南京理工大学学报,1995年,19卷,5期,401页
  • 6Apte C, Damerau F J, and Weiss S M. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 1994, 12:233- 251.
  • 7Yang Yiming, and Pedersen J O. A comparative study on feature selection in text categorization. In- Proceedings of the 14^th International Conference on Machine Learning (ICML-97), 1997. 412 - 420.
  • 8Hwee Tou Ng, Wei Boon Goh, and Kok Leong Low. Feature selection, perceptron learning, and a usability case study for text categorization. In: Proceedings of the 20^th ACM International Conference on Research and Development in Information Retrieval (SIGIR-97), 1997. 67 - 73.
  • 9Schutze H, Hull D A, and Pedersen J O. A comparison of classifiers and document representations for the routing problem. In: Proceedings of the 18^th ACM International Conference on Research and Development in Information Retrieval (SIGIR-95). 1995. 229 - 237.
  • 10Li Y H, and Jain A K. Classification of text document. The Computer Journal, 1998, 41(8) :537 - 546.

共引文献316

同被引文献8

  • 1赵世奇,刘挺,李生.一种基于主题的文本聚类方法[J].中文信息学报,2007,21(2):58-62. 被引量:23
  • 2赵鹏,蔡庆生.一种基于《知网》的中文文本聚类算法的研究[J].计算机工程与应用,2007,43(12):162-163. 被引量:7
  • 3LI Y. Text document clustering based on frequent word meaning sequences [J]. Data and Knowledge Engineering, 2008, 64(1):381-404.
  • 4YI B, WANG Y, CHEN X, et al. Extracting hot topics from microblogging based on keywords detection and text clustering[J]. Applied Mechanics and Materials, 2013, 303-306:2289-2293.
  • 5LI X. A new text clustering algorithm based on improved k_means[J]. Journal of Software, 2012, 7(1):95-101.
  • 6GUPTA N, SAXENA P C, GUPTA J P. Automatic generation of initial value k to apply K-means method for text documents clustering [J]. International Journal of Data Mining, Modelling and Management, 2011, 3(1):18-41.
  • 7ZHENG Y, SHU J, CHUN L, et al. A text hybrid clustering algorithm based on HowNet semantics [J]. Key Engineering Materials, 2011, 474-476:2071-2078.
  • 8KWALE F M. A critical review of k means text clustering algorithm[J]. International Journal of Advanced Research in Computer Science, 2013, 4(9):27-34.

引证文献1

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部