摘要
为解决高频特征对文章的主题信息反映不够全面,无法获得高质量聚类结果的问题,同时为获得聚类后各类别反映信息的精确描述,采用词汇链反映文章所描述的主题信息,并依据文本间词汇链的相似度进行聚类.将聚类后属于同一类别并反映相同主题信息的词汇链进行融合,通过分析各词汇链所描述的主题信息在不同类别内的分布来抽取能够充分反映各类别主题的关键词集合.实验证明该方法比应用高频特征进行聚类的效果好,同时由于分析了主题信息在各类别内的分布情况,使抽取的类别关键词能够很好地体现每个类别所侧重描述的信息.
In order to solve the problem that features with high frequency can not be used to obtain clustering results with high quality cause of their incomplete reflection of the document' s topic, and to obtain the nice description of information described by each cluster, lexical chains were used to reflect the document' s topic and similar documents were clustered according to the similarity between different lexical chains. Then the lex- ical chains describing the same topic information in the same cluster were combined. Via analyzing the distri- bution of each topic clew among different clusters, the keyword sets that can completely reflect the topic of each cluster were extracted. Experimental results demonstrate that clustering results obtained by this method outperform those obtained by using features with high frequency to cluster documents, and the extracted keyword sets can reflect the emphasis information of each cluster,
出处
《哈尔滨工业大学学报》
EI
CAS
CSCD
北大核心
2009年第3期53-57,共5页
Journal of Harbin Institute of Technology
基金
国家自然科学基金重点资助项目(60435020)
国家高技术研究发展计划资助项目(2006AA01Z1972007AA01Z172)
关键词
知网
词汇链融合
主题层次聚类
hownet
combination of lexical chains
hierarchical clustering based on topic