摘要
Web文档聚类是Web挖掘的一个重要研究方向。现有的挖掘算法得到的频繁模式不仅维数高,而且不能很好反映文档表达的语义信息。为了得到更精确的聚类结果,本文提出一种基于句子级的最大频繁单词集挖掘方法来挖掘文档特征项。在此基础上,先初步聚类后依据类间距离和类内链接强度阈值合并或拆分类,最终实现文档聚类。在此过程中,使用可变精度粗糙集模型计算每个类的特征向量。实验结果表明,本文提出的算法优于传统的文档聚类算法。
Web document clustering is an important research direction in Web mining area. Frequent pattern acquired form existing mining algorithms not only hashigh dimension, but can't reflects semantic information expressed form document well. For gaining more precise clustering result, this paper presents a mining algorithm based on sentential maximum frequent words set to mine document characteristic items. Based on then, documents are clustered elementarily at first. Then classes are incorporated or separated according to distance between classes and join intension in class. At the end, documents clustering is achieved. Variable precision rough set model is used to compute eigenvector of each class. The experiment results indicate the algorithm presented in this paper is better than traditional document clustering algorithms.
出处
《计算机科学》
CSCD
北大核心
2007年第7期154-157,164,共5页
Computer Science
关键词
WEB文档聚类
粗糙集
关联规则
最大频繁单词集
Web document cluster, Rough set, Association rules, Maximum frequent words set