期刊文献+

基于句子级最大频繁单词集的Web文档聚类研究 被引量:1

Research on Web Document Clustering Based on Sentential Maximum Frequent Word Sets
下载PDF
导出
摘要 Web文档聚类是Web挖掘的一个重要研究方向。现有的挖掘算法得到的频繁模式不仅维数高,而且不能很好反映文档表达的语义信息。为了得到更精确的聚类结果,本文提出一种基于句子级的最大频繁单词集挖掘方法来挖掘文档特征项。在此基础上,先初步聚类后依据类间距离和类内链接强度阈值合并或拆分类,最终实现文档聚类。在此过程中,使用可变精度粗糙集模型计算每个类的特征向量。实验结果表明,本文提出的算法优于传统的文档聚类算法。 Web document clustering is an important research direction in Web mining area. Frequent pattern acquired form existing mining algorithms not only hashigh dimension, but can't reflects semantic information expressed form document well. For gaining more precise clustering result, this paper presents a mining algorithm based on sentential maximum frequent words set to mine document characteristic items. Based on then, documents are clustered elementarily at first. Then classes are incorporated or separated according to distance between classes and join intension in class. At the end, documents clustering is achieved. Variable precision rough set model is used to compute eigenvector of each class. The experiment results indicate the algorithm presented in this paper is better than traditional document clustering algorithms.
出处 《计算机科学》 CSCD 北大核心 2007年第7期154-157,164,共5页 Computer Science
关键词 WEB文档聚类 粗糙集 关联规则 最大频繁单词集 Web document cluster, Rough set, Association rules, Maximum frequent words set
  • 相关文献

参考文献15

  • 1王建会,申展,胡运发.一种实用高效的聚类算法[J].软件学报,2004,15(5):697-705. 被引量:26
  • 2Hearst M A, Pedersen J. Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results. In:Proc. of the 19th Annual Int'l ACM/SIGIR Conf. Zurich, 1996. 76-84
  • 3Willet P. Recent Trends in Hierarchic Document Clustering: A Critical Review. Information Processing and Mangement, 1988, 24(5):577-597
  • 4Rocchio J J. Document Retrieval Systems--Optimization and Evaluation: [PhD dissertation]. Harvard University, Cambridge, MA, 1966
  • 5Cutting D R,Pedersen J O,Karger D R, et al. Scatter/Gather:A Cluster-based Approach to Browsing Large Document Collections In: Proc. of the 15^th Annual Int'l ACM/SIGIR Conf. Copenhagen, 1992. 318-329
  • 6Xu Jian Suo. Wang Li. TCBLHT: A New Method of Hierarchical Text Clustering. In: Proceedings of 4^th International Conference on Machine Learning and Cybernetics, 2005. 2178-2181
  • 7Dumais ST,Furnas GW, Landauer TK, et al. Using Latent Semantic Analysis to Improve Information Retrieval. In:Proceedings of CHI'88,1988. 281-285
  • 8Agrawal R, Srikant R. Fast Algorithms for Mining Association Rules in Large Databases. In:Proceedings of the 20th International Conference on Very Large Data Bases, 1994. 487-499
  • 9Antonie M, Zaiane O R. Text Document Categorization by Term Association. In:Proc. of IEEE Intl. Conf. on Data Mining,2002, 19-26
  • 10Meretakis D, Fragoudis D, Lu Hongjun, et al. Scalable Association-based Text Classification. In:Proe. of the 2000 ACM CIKM International Conference on Information and Knowledge Management 2000,6-11

二级参考文献2

共引文献25

同被引文献7

引证文献1

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部