摘要
该文提出面向文本距离并独立于聚类过程的聚类重构策略。提出邻近域的概念并阐述了邻近域规则,设计了高斯加权邻近域算法。利用高斯函数根据样本与聚簇中心的距离为样本赋权,计算聚簇间距。基于邻近域权重对文本聚类的结果实施重构。使用拆分算子拆分稀疏聚簇并调整异常样本;使用合并算子合并相似聚簇。实验显示聚簇重构机制能够有效地提高聚类的准确率及召回率,增加聚簇密度,使得形成的聚类结果更加合理。
This paper illustrates a distance oriented reorganization strategy in which clusters could be reorganized in independence from clustering process.The concept of Nearest Domain is proposed and Nearest Domain rules are elaborated.Then Gauss Weighing Algorithm is designed to re-wieght a text by the distance from cluster kernel.At last,Nearest Domain Weights will separates sparse clusters and adjusts abnormal texts while combines similar ones.Clustering experiment shows that reorganization process effectively improves the accuracy and recall rate and makes result more reasonable by increasing the inner density of clusters.
出处
《中文信息学报》
CSCD
北大核心
2016年第2期189-195,共7页
Journal of Chinese Information Processing
基金
国家自然科学基金(61362028)
关键词
文本聚类
聚簇重构
邻近域规则
高斯加权
text clustering
cluster reorganization
nearest domain rule
Gauss weighing