基于STC的中文文本聚类算法被引量：2

An STC-based Chinese text clustering method

下载PDF

导出

摘要提出了一种文档聚类方法,对用户的检索结果中类似的文档进行聚类,提供目录结构,辅助用户浏览检索结果,首先分析了现有的文本聚类方法,讨论了它们的优势和不足,然后提出了基于后缀树的中文文本聚类算法,并详细描述了该算法的原理和构造使用过程,及在算法实现的过程中遇到的关键问题及解决方案。 This article proposes a document clustering method, which clusteres the result of the user＇ s search, gives the directory structure of those results and helpes the user to explore the results. The article first analyzes the classical text clustering algorithms, and points out their advantages and disadvantages. A suffix -tree based Chinese text clustering method is proposed and discusses the main idea and the construction of this algorithm. Then some problems of the realization are discussed and the corre- sponding solution is given.

作者王国强郑海清牛军钰

机构地区上海市杨浦区业余大学复旦大学计算机科学与工程系

出处《上海师范大学学报（自然科学版）》 2006年第5期21-26,共6页 Journal of Shanghai Normal University(Natural Sciences)

关键词后缀树文本聚类文本处理 suffix tree clustering text clustering text processing

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献11

1MOTRO H. Infoseek CEO[R]. CNBC, May 7, 1998.
2ZAMIR O E. Clustering Web Documents: A Phrase- Based Method for Grouping Search Engine Resuhs[D]. PhD Thesis, University of Washington, 1999.
3SALVADOR S, CHAN P. Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms[A]. ICTAI[C]. 2004, 576- 584.
4RIJSBERGEN VAN C J. Information Retrieval[M]. London: Butterworths, 1979.
5RICARDO BAEZA- YATES, BERTHIER RIBEIRO- NETO. Modem Information Retrieval[M]. Addison Wesley Longman, 2001.
6MURTAGH F. A Survey of Recent Advances in Hierarchical Clustering Algorithms [J]. Computer Journal, 1983,26 (4):354-359.
7CALIFF M E, MOONEY R J. Bottom - Up Relational Learning of Pattern Matching Rules for Information Extraction[J].Journal of Machine Learning Research, 2003,4: 177-210.
8HUYNH N, HON W, LAM T, SUNG W. Approximate string matching using compressed suffix arrays [A]. Proceeding of the 15th Symposium on Combinatorial Pattern Matching[C]. 2004,157-169.
9EHRENFEUCHT A, HAUSSLER D. A new distance metric on strings computable in linear tirae [M]. Discrete Applied Math, 1988, 40.
10GUSFIELD D. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology[M].Cambridge University Press, 1997.

同被引文献15

1刘泉凤,陆蓓,王小华.文本挖掘中聚类算法的比较研究[J].计算机时代,2005(6):7-8. 被引量：8
2郭莉,张吉,谭建龙.基于后缀树模型的文本实时分类系统的研究和实现[J].中文信息学报,2005,19(5):16-23. 被引量：12
3苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量：387
4李江波,周强,陈祖舜.汉语词典的快速查询算法研究[J].中文信息学报,2006,20(5):31-39. 被引量：25
5Yang Jian-wu.Chinese web page clustering algorithm based on the suffix tree[D].WUJNS,2004.
6Li Yanjun.High performance text document clustering[D].UMI, 2007.
7Zhang Hua-Ping,Liu Qun, Cheng Xue-Qi,et al.Chinese lexicla analysis using hierarchical hidden Markov model [C]. Sapporo Japan: Second SIGHAN Workshop Affiliated with 41th ACL, 2003.
8Yang Jian-wu.A Chinese web page clustering algorithm based on the suffix tree[D].WUJNS,2004.
9Li Yanjun. Text document clustering based on frequent word meaning sequences[J].Data&Knowledge Engineering,2008,64: 381-404.
10Doucet A,Ahonen-Myka H.Non-contiguous word sequences for information retrieval[C].Proceedings of the 42nd Annual Meeting of .the Association for Computational Linguistics (ACL-2004) Workshop on Multiword Expressions and Integrating Processing,2004:88-95.

引证文献2

1陈爽,陈福,杜天苍.一种启发式网络信息采集系统设计与实现[J].北京石油化工学院学报,2007,15(4):38-42.
2林庆,袁晓峰,吴旻.中文Web文档聚类算法研究[J].计算机工程与设计,2009,30(20):4759-4761. 被引量：3

二级引证文献3

1买买提依明·哈斯木,维尼拉·木沙江.基于后缀树的维吾尔文网页聚类算法的研究与实现[J].电脑知识与技术,2010,6(9):7072-7073.
2邹志华,田生伟,禹龙,冯冠军.改进的维吾尔语Web文本后缀树聚类[J].中文信息学报,2013,27(2):118-126. 被引量：1
3骆绍烨.一种基于用户兴趣的STC改进算法[J].江南大学学报（自然科学版）,2015,14(1):85-89.

1余强,彭原.STC可编程数据采集器的研制[J].自动化与仪器仪表,1995(2):30-33.
2汤九斌.西文UNIX系统用户浏览Internet网中文信息的方法[J].中国计算机用户,1997(3):25-26.
3张瑞.浅谈局域网建设中若干问题及解决方案[J].科技致富向导,2011(36):287-287.
4王亚军.基于嵌入式应用修改Linux内核的技术分析[J].武警学院学报,2007,23(12):90-92. 被引量：1
5韦修玲.传输网络中的问题及解决方案[J].现代商贸工业,2009,21(3):253-253. 被引量：4
6张丽坤,杨勇.电子商务安全问题及解决方案探讨[J].全国商情,2011(2X):107-109. 被引量：4
7王世辉.个人电脑的2000问题及解决方案[J].青岛远洋船员学院学报,1999,20(4):91-93.
8陈荣.物联网安全问题及解决方案[J].网络安全技术与应用,2013(12):73-73. 被引量：2
9熊鹰.变频器驱动电路常见问题及解决方案[J].变频器世界,2005(9):125-126.
10林山.企业信息安全问题及解决方案[J].计算机科学,2006,33(B12):63-64.

上海师范大学学报（自然科学版）

2006年第5期

浏览历史

内容加载中请稍等...

基于STC的中文文本聚类算法被引量：2

参考文献11

同被引文献15

引证文献2

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

基于STC的中文文本聚类算法 被引量：2

参考文献11

同被引文献15

引证文献2

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

基于STC的中文文本聚类算法被引量：2