摘要
针对Web文档的结构及其特征,提出了一种新的加权后缀树聚类方法WSTC。首先,根据Web文档的HTML标签,把文档划分为具备不同重要性等级的段,段划分成句子,句子分割为词。其次,用句子替代文档构造后缀树,把其重要性等级作为结构权融入后缀树的节点,形成文档集的加权后缀树模型。最后,在选择和合并基类过程中,综合利用节点包含的文档数、句子数、短语长度和结构权。仿真实验表明,WSTC算法比传统STC算法取得了更好的聚类效果。
For Web documents clustering,a novel Weighted Suffix Tree Clustering(WSTC) method was proposed.First,according to the structure and HTML tags of Web documents,different parts of documents were assigned different levels of significance as structure weights;each part was partitioned into some sentences which were partitioned into some words.Second,the weighted suffix tree of documents set was built with sentences and structure weights stored in the nodes.Finally,the documents count,sentences count,phrase length and structure weights of each internal node were employed in the process of identifying and merging base clusters.The evaluation experimental results indicate that WSTC is much more effective on clustering Web documents than original STC.
出处
《系统仿真学报》
CAS
CSCD
北大核心
2011年第3期474-479,共6页
Journal of System Simulation
基金
国家科技支撑计划(2007BAH08B04)
重庆市科技支撑计划(2008AC20084)
关键词
后缀树
后缀树聚类
WEB文档聚类
Web文档结构
权重计算
suffix tree
suffix tree clustering
web document clustering
web document structure
weight computing