摘要
针对FIHC文本聚类算法基于频繁词集实现聚类而未考虑词语间潜在语义联系的缺陷,对FIHC算法进行了有效改进。通过把基于知网的语义相似度计算方法归并到FIHC的Score函数中,有效的改善了score函数单纯的基于向量空间模型的不足。通过实现证明,改进后的FIHC算法明显的提高了聚类质量。
Because FIHC is a clustering algorithm which is based on frequent item sets, the dimensionality of the document set is drastically reduced. But due to the reason of without considering potential semantic relationship among words, the clustering precision can not be further improved. In this paper, we merge the word semantic similarity computing which is based on CNKI into the Score function in FIHC. In this way, the defect of score function can be improved. Experimental results show that the improved FIHC has bet- ter cluster quality.
出处
《山西大同大学学报(自然科学版)》
2014年第1期4-7,共4页
Journal of Shanxi Datong University(Natural Science Edition)
基金
山西省科技基础条件平台项目[2011091002-0102]
山西大同大学青年科研基金项目[2010Q13]