期刊文献+

基于STC的中文文本聚类算法 被引量:2

An STC-based Chinese text clustering method
下载PDF
导出
摘要 提出了一种文档聚类方法,对用户的检索结果中类似的文档进行聚类,提供目录结构,辅助用户浏览检索结果,首先分析了现有的文本聚类方法,讨论了它们的优势和不足,然后提出了基于后缀树的中文文本聚类算法,并详细描述了该算法的原理和构造使用过程,及在算法实现的过程中遇到的关键问题及解决方案。 This article proposes a document clustering method, which clusteres the result of the user' s search, gives the directory structure of those results and helpes the user to explore the results. The article first analyzes the classical text clustering algorithms, and points out their advantages and disadvantages. A suffix -tree based Chinese text clustering method is proposed and discusses the main idea and the construction of this algorithm. Then some problems of the realization are discussed and the corre- sponding solution is given.
出处 《上海师范大学学报(自然科学版)》 2006年第5期21-26,共6页 Journal of Shanghai Normal University(Natural Sciences)
关键词 后缀树 文本聚类 文本处理 suffix tree clustering text clustering text processing
  • 相关文献

参考文献11

  • 1MOTRO H. Infoseek CEO[R]. CNBC, May 7, 1998.
  • 2ZAMIR O E. Clustering Web Documents: A Phrase- Based Method for Grouping Search Engine Resuhs[D]. PhD Thesis, University of Washington, 1999.
  • 3SALVADOR S, CHAN P. Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms[A]. ICTAI[C]. 2004, 576- 584.
  • 4RIJSBERGEN VAN C J. Information Retrieval[M]. London: Butterworths, 1979.
  • 5RICARDO BAEZA- YATES, BERTHIER RIBEIRO- NETO. Modem Information Retrieval[M]. Addison Wesley Longman, 2001.
  • 6MURTAGH F. A Survey of Recent Advances in Hierarchical Clustering Algorithms [J]. Computer Journal, 1983,26 (4):354-359.
  • 7CALIFF M E, MOONEY R J. Bottom - Up Relational Learning of Pattern Matching Rules for Information Extraction[J].Journal of Machine Learning Research, 2003,4: 177-210.
  • 8HUYNH N, HON W, LAM T, SUNG W. Approximate string matching using compressed suffix arrays [A]. Proceeding of the 15th Symposium on Combinatorial Pattern Matching[C]. 2004,157-169.
  • 9EHRENFEUCHT A, HAUSSLER D. A new distance metric on strings computable in linear tirae [M]. Discrete Applied Math, 1988, 40.
  • 10GUSFIELD D. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology[M].Cambridge University Press, 1997.

同被引文献15

  • 1刘泉凤,陆蓓,王小华.文本挖掘中聚类算法的比较研究[J].计算机时代,2005(6):7-8. 被引量:8
  • 2郭莉,张吉,谭建龙.基于后缀树模型的文本实时分类系统的研究和实现[J].中文信息学报,2005,19(5):16-23. 被引量:12
  • 3苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:387
  • 4李江波,周强,陈祖舜.汉语词典的快速查询算法研究[J].中文信息学报,2006,20(5):31-39. 被引量:25
  • 5Yang Jian-wu.Chinese web page clustering algorithm based on the suffix tree[D].WUJNS,2004.
  • 6Li Yanjun.High performance text document clustering[D].UMI, 2007.
  • 7Zhang Hua-Ping,Liu Qun, Cheng Xue-Qi,et al.Chinese lexicla analysis using hierarchical hidden Markov model [C]. Sapporo Japan: Second SIGHAN Workshop Affiliated with 41th ACL, 2003.
  • 8Yang Jian-wu.A Chinese web page clustering algorithm based on the suffix tree[D].WUJNS,2004.
  • 9Li Yanjun. Text document clustering based on frequent word meaning sequences[J].Data&Knowledge Engineering,2008,64: 381-404.
  • 10Doucet A,Ahonen-Myka H.Non-contiguous word sequences for information retrieval[C].Proceedings of the 42nd Annual Meeting of .the Association for Computational Linguistics (ACL-2004) Workshop on Multiword Expressions and Integrating Processing,2004:88-95.

引证文献2

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部