
结合语义的改进FTC文本聚类算法 被引量:5

Improvement on FTC text clustering algorithm combined with semantics
摘要 针对FTC文本聚类算法未考虑词语之间语义联系以及硬划分聚类的缺陷,提出了一种结合语义的改进FTC文本聚类算法SFTC。SFTC基于知网把文本的关键词集映射成概念集合,采用FP-Growth算法在概念层次上挖掘频繁项集并以此生成候选簇。考虑到文本具有多主题性,定义了簇间相似度度量公式,在生成结果簇的过程中通过判断相似度大小来决定簇间是否应该存在重叠,实现了文本聚类在一定程度上的软划分。实验结果表明,SFTC算法具有更高的聚类准确度和更高的运行效率。 To solve the problems of neglecting the semantic relation among different words and hard-partition clustering in FTC, an improved FTC text clustering algorithm combined with semantics which is called SFTC is proposed. First, by using HowNet, the keywords set of all documents is mapped into a concept set. The set of frequent term sets is found from the concept set by FP-Growth. The cover of each frequent term set is regarded as a candidate cluster. Second, a formula for computing the similari- ty between different clusters is defined. To determine weather the overlap should be existed between different clusters, similarity is measured while getting final clusters. By this way, an elastic classification is gotton. Experimental results show that SFTC improves the cluster quality and has better efficiency.
出处 《计算机工程与设计》 CSCD 北大核心 2014年第2期515-519,共5页 Computer Engineering and Design
基金 山西省科技基础条件平台基金项目(2011091002-0102) 山西大同大学青年科研基金项目(2010Q13)
关键词 文本聚类 频繁项集 知网 簇相似度 软划分 text clustering frequent term set HowNet cluster similarity elastic classification
  • 相关文献





  • 1郭景峰,赵玉艳,边伟峰,李晶.基于改进的凝聚性和分离性的层次聚类算法[J].计算机研究与发展,2008,45(z1):202-206. 被引量:15
  • 2李星毅,包从剑,施化吉.数据仓库中的相似重复记录检测方法[J].电子科技大学学报,2007,36(6):1273-1277. 被引量:25
  • 3Jain A K,Murty M N,Flynn P J. Data clustering:a re view[J]. ACM Comput Surv,1999,31(3):264.
  • 4Jain A K,Duin R P W, Mao J C. Statistical pattern recognition:a review[J]. IEEE Trans Pattern Analysis Machine Intelligence, 2000,22 ( 1 ) : 4.
  • 5Levenshtein V I. Binary codes capable of correcting de- letions, insertions and reversals [J]. Soviet Physics- Doklady, 1966,10(8) :707.
  • 6Pan J S,Qiao Y L,Sun S H. A fast K nearest neigh- bors classification algorithm[J]. IEICE Trans Funda- mentals of Electronics Communications & Computer Sciences, 2004,87 (4) : 961.
  • 7Ahmed K E,Panagiotis G I, Vassilios S V. Duplicate record detection., a survey[J]. IEEE Trans Knowledge Data Engineering, 2007,19 (1) : 1.
  • 8BEGEMAN G, KELLER P, SMADJIA F.Automated tag clustering: improving search and exploration in the tag space[C].In:Collaborative Web Tagging Workshop,15th International World Wide Web Conference, Edinburgh,UK,2006(5):22-26.
  • 9CHOW TOMMY W S,ZHANG Haijun,Rahman M K M.A new document representation using term frequency and vectorized graph connectionists with application to document retrieval[J].Expert Systems with Applications,2009,36(10):12 023-12 035.
  • 10Guo Qinglin,Zhang Ming.Multi-documents automaticabstracting based on text clusteringand semantic analysis[J].Knowledge-Based Systems,2009,22(3):482-485.










使用帮助 返回顶部