摘要
针对顺序 IB(sIB)算法在文本聚类上存在的诸如易陷入局部优解、效率较低等问题,基于模拟退火方法,提出一种优化的顺序文本聚类算法(SA-isIB).该算法根据一个合理的退火序列,从基本 sIB 算法产生的初始聚类结果中随机选取一定比例的文本,对其类标记进行随机修改并重新对解进行优化,经过退火过程后,得到比 sIB 算法精度更高的文本聚类结果.文本数据集上的实验结果表明,SA-isIB 能有效提高 sIB 算法用于文本聚类的精度.
To solve the problems of local optima and low efficiency in sequential information bottleneck (sIB) algorithm for document clustering, an improved sIB algorithm is proposed, namely SA-isIB. By a reasonable annealing sequence, a certain proportional of documents are selected randomly from the initial clustering solution of basic sIB algorithm. Then the clustering labels of selected documents are revised and the solution is optimized iteratively. After the process of simulated annealing, higher accuracy document clustering solutions are obtained. Experimental results on document datasets show that by using SA-isIB algorithm the accuracy of sIB algorithm for document clustering is improved efficiently.
出处
《模式识别与人工智能》
EI
CSCD
北大核心
2008年第3期417-423,共7页
Pattern Recognition and Artificial Intelligence
基金
国家自然科学基金资助项目(No.60674001
60773048)
关键词
文本聚类
信息瓶颈理论
模拟退火
基于模拟退火的迭代顺序IB(SA—isIB)算法
Document Clustering, Information Bottleneck (IB) Theory, Simulated Annealing,Simulated Annealing-Iterative Sequential Information Bottleneck (SA-isIB) Algorithm