摘要
针对FTC文本聚类算法未考虑词语之间语义联系以及硬划分聚类的缺陷,提出了一种结合语义的改进FTC文本聚类算法SFTC。SFTC基于知网把文本的关键词集映射成概念集合,采用FP-Growth算法在概念层次上挖掘频繁项集并以此生成候选簇。考虑到文本具有多主题性,定义了簇间相似度度量公式,在生成结果簇的过程中通过判断相似度大小来决定簇间是否应该存在重叠,实现了文本聚类在一定程度上的软划分。实验结果表明,SFTC算法具有更高的聚类准确度和更高的运行效率。
To solve the problems of neglecting the semantic relation among different words and hard-partition clustering in FTC, an improved FTC text clustering algorithm combined with semantics which is called SFTC is proposed. First, by using HowNet, the keywords set of all documents is mapped into a concept set. The set of frequent term sets is found from the concept set by FP-Growth. The cover of each frequent term set is regarded as a candidate cluster. Second, a formula for computing the similari- ty between different clusters is defined. To determine weather the overlap should be existed between different clusters, similarity is measured while getting final clusters. By this way, an elastic classification is gotton. Experimental results show that SFTC improves the cluster quality and has better efficiency.
出处
《计算机工程与设计》
CSCD
北大核心
2014年第2期515-519,共5页
Computer Engineering and Design
基金
山西省科技基础条件平台基金项目(2011091002-0102)
山西大同大学青年科研基金项目(2010Q13)
关键词
文本聚类
频繁项集
知网
簇相似度
软划分
text clustering
frequent term set
HowNet
cluster similarity
elastic classification