摘要
短文本聚类一直是信息提取领域的热门话题,大规模的短文本数据中存在"长尾现象",传统算法对其聚类时会面临特征纬度高,小类别信息丢失的问题,针对对上述问题的研究,本文提出一种频繁项协同剪枝迭代聚类算法(Frequent Itemsets collaborative Pruning iteration Clustering framework, FIPC).该算法将迭代聚类框架与K中心点算法相结合,运用协同剪枝策略,实现对小类别文本聚类,实验结果证明该聚类算法能够有效的提高小类别短文本信息聚类的精确度,并能避免聚类中类簇重叠的问题.
Short texts clustering is a popular topic in the field of information extraction.There is a'long tail phenomenon'when the scale of data is large,which causes high dimensions of features and information loss of small class.To solve these problems,this study proposes a Frequent Itemsets collaborative Pruning iteration Clustering framework(FIPC).This framework combines the iterative clustering framework with the K-mediods algorithm,using the collaborative pruning strategy to cluster text of small class.The result of experiments shows that the FIPC framework can achieve text clustering of small class with high accuracy,and avoid the problem of overlapping clusters.
作者
宋中山
张广凯
尹帆
帖军
SONG Zhong-Shan;ZHANG Guang-Kai;YIN Fan;TIE Jun(School of Computer Science,South-Central University for Nationalities,Wuhan 430074,China)
出处
《计算机系统应用》
2019年第4期139-144,共6页
Computer Systems & Applications
基金
国家科技支撑计划项目子课题(2015BAD29B01)
农业部软科学研究课题(D201721)
中央高校基本科研业务费专项资金(CZY18016)~~
关键词
文本聚类
长尾现象
频繁模式
K中心点算法
text clustering
long tail phenomenon
frequent mode
K-mediods algorithm