摘要
随着信息检索技术的迅猛发展,针对检索系统的改进已逐渐成为研究的热点.聚类是一种有效的改进策略,通过对检索结果进行聚类,可以使用户快速地定位到自己感兴趣的检索信息所在的类别.然而,传统的检索聚类算法要么运行效率低下,要么类别划分能力不强,使它们无法真正地用于检索系统中.针对此问题,提出了一种新颖的检索聚类算法,该算法首先通过极大极小值理论从检索返回的文档集中抽取多个聚点,并依此形成初始文档类划分结果.在此基础上,算法对初始文档类的特征集合进行细化调整以使类别的划分更加精确;同时对不满足收敛条件的文档类进行层次分裂以解决信息的分层描述问题.实验表明:此算法的时间复杂度与现有的检索聚类技术相差不多,并且由于对特征集合进行迭代调整使得类别的划分更加准确合理.
Due to the fast advance of information retrieval technique, information overload has become a headache problem to Internet users. In order to alleviate user's inconvenience to distinguish useful information from massive junk information, the research for improving retrieval system has gradually become hotter and hotter. Up to now, many techniques have been proposed for automatically categorizing and organizing Web information for users. Among them, clustering is one of the most extensively employed tools. Through clustering retrieval information, Internet users can quickly find out where their interesting retrieval results locate. Unfortunately, traditional clustering algorithms are either ineffective or inefficient for this task. As a result, a novel algorithm specially designed for clustering retrieval information is proposed. This algorithm applies maximum-minimum principle to extract accumulation points to form initial clusters at first. Experiment results show that, this initial cluster partitioning is approximate to the optimal partitioning and only needs small iterative adjustment steps to get convergence. After that, it iteratively adjusts feature set of each cluster to let cluster partitioning more and more precise. Simultaneously, it hierarchically separates the clusters, which don^t meet convergence condition, into some sub-clusters to possess the merit of hierarchically representing information. Experiment results also demonstrate that time complexity of this algorithm is close to the recent techniques for clustering retrieval information. Besides, because of iteratively adjusting feature sets, it enables clustering results to be more precise and reasonable.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2013年第7期1452-1463,共12页
Journal of Computer Research and Development
基金
国家自然科学基金面上项目(61073127)
中央高校基本科研业务费专项基金项目(HIT.NSRIF.2013066)
中国博士后科学基金面上项目(2013M530156)
教育部-微软语言语音重点实验室开放基金项目
关键词
信息检索聚类技术
权值调整
极大极小理论
快速聚类
自组织映射
clustering technique for information retrieval
weight adjustment
maximum-minimum principle
fast clustering self-organizing-mapping (SOM)