期刊文献+

面向信息检索的快速聚类算法 被引量:12

A Fast Clustering Algorithm for Information Retrieval
下载PDF
导出
摘要 随着信息检索技术的迅猛发展,针对检索系统的改进已逐渐成为研究的热点.聚类是一种有效的改进策略,通过对检索结果进行聚类,可以使用户快速地定位到自己感兴趣的检索信息所在的类别.然而,传统的检索聚类算法要么运行效率低下,要么类别划分能力不强,使它们无法真正地用于检索系统中.针对此问题,提出了一种新颖的检索聚类算法,该算法首先通过极大极小值理论从检索返回的文档集中抽取多个聚点,并依此形成初始文档类划分结果.在此基础上,算法对初始文档类的特征集合进行细化调整以使类别的划分更加精确;同时对不满足收敛条件的文档类进行层次分裂以解决信息的分层描述问题.实验表明:此算法的时间复杂度与现有的检索聚类技术相差不多,并且由于对特征集合进行迭代调整使得类别的划分更加准确合理. Due to the fast advance of information retrieval technique, information overload has become a headache problem to Internet users. In order to alleviate user's inconvenience to distinguish useful information from massive junk information, the research for improving retrieval system has gradually become hotter and hotter. Up to now, many techniques have been proposed for automatically categorizing and organizing Web information for users. Among them, clustering is one of the most extensively employed tools. Through clustering retrieval information, Internet users can quickly find out where their interesting retrieval results locate. Unfortunately, traditional clustering algorithms are either ineffective or inefficient for this task. As a result, a novel algorithm specially designed for clustering retrieval information is proposed. This algorithm applies maximum-minimum principle to extract accumulation points to form initial clusters at first. Experiment results show that, this initial cluster partitioning is approximate to the optimal partitioning and only needs small iterative adjustment steps to get convergence. After that, it iteratively adjusts feature set of each cluster to let cluster partitioning more and more precise. Simultaneously, it hierarchically separates the clusters, which don^t meet convergence condition, into some sub-clusters to possess the merit of hierarchically representing information. Experiment results also demonstrate that time complexity of this algorithm is close to the recent techniques for clustering retrieval information. Besides, because of iteratively adjusting feature sets, it enables clustering results to be more precise and reasonable.
出处 《计算机研究与发展》 EI CSCD 北大核心 2013年第7期1452-1463,共12页 Journal of Computer Research and Development
基金 国家自然科学基金面上项目(61073127) 中央高校基本科研业务费专项基金项目(HIT.NSRIF.2013066) 中国博士后科学基金面上项目(2013M530156) 教育部-微软语言语音重点实验室开放基金项目
关键词 信息检索聚类技术 权值调整 极大极小理论 快速聚类 自组织映射 clustering technique for information retrieval weight adjustment maximum-minimum principle fast clustering self-organizing-mapping (SOM)
  • 相关文献

参考文献23

  • 1Azcarraga A P,Yap T N Jr,Tan J,et al. Evaluatingkeyword selection methods for WEBSOM text archives [J].IEEE Trans on Knowledge and Data Engineering,2004,16(3):380-383.
  • 2Peter P,Patricia S,Marti H,et al. Scatter/gather browsingcommunicates the topic structure of a very large textcollection [C] //Proc of ACM S1GCHI Conf on HumanFactors in Computing Systems. New York:ACM,1996:213-220.
  • 3Beil F,Ester M,Xu X. Frequent term-based text clustering[C] //Proc of ACM SIGKDD Conf on Knowledge Discoveryand Data Mining. New York:ACM,2002:436-442.
  • 4Zamir O,Etzioni O. Web document clustering:A feasibilitydemonstration [C] //Proc of ACM SIGIR Conf on Researchand Development in Information Retrieval. New York:ACM,1998:46-54.
  • 5Valdes P R,Pericliev V,Pereira F. Concise,intelligible,and approximate profiling of multiple classes [J].International Journal of Human Computer Systems,2000,53(3); 411-436.
  • 6Li Yuanhong,Dong Ming,Hua Jing. Localized featureselection for clustering [J]. Pattern Recognition Letters,2008,29(1):10-18.
  • 7Xu Yongdong,Xu Zhiming,Wang Xiaolong,et al. Usingmultiple features and statistical model to calculate text unitssimilarity [C] //Proc of Int Conf on Machine Learning andCybernetics. Piscataway,NJ:IEEE,2005:3834-3839.
  • 8Tsutsumi K,Nakajima K. Maximum/minimum detection bya module-based neural network with redundant architecture[C] //Proc of Int Joint Conf on Neural Networks.Piscataway. NJ:IEEE,1999:558-561.
  • 9Kohonen T. Self-Organizing Maps [ M]. 2nd ed. Berlin:Springer,1997.
  • 10Kohonen T,Kaski S,Lagus K,et al. Self organization of amassive document collection [J]. IEEE Trans on NeuralNetworks,2000,11(3):574-585.

二级参考文献14

  • 1吴郢.结构自适应自组织神经网络的研究与应用.清华大学自动化硕士论文[M].,1997..
  • 2吴郢,硕士学位论文,1997年
  • 3Zheng Y,IEEE Trans Neural Networks,1996年,7卷,1期,87页
  • 4Ester M, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. the 2nd Int'l Conf. Knowledge Discovering in Databases and Data Mining(KDD 96). Menlo Park, CA: AAA I Press, 1996.
  • 5Zhan W, et al. STING: A statistical information grid approach to spatial data mining. In: Proc. the 23rd VLDB Conf. Athens. San Francicso: Morgan Kaufmann, 1997. 186~ 195.
  • 6K. Beyer, J. Goldstein, R. Ramakhrisnan, et al. Nearest neighbor' meaningful. In: Proc. the 7th Int'l Conf. Database Theory ( ICDT' 99), http://citeseer.ist.psu.edu/605885.html,1999.
  • 7A. Hinneburg, C. C. Aggarwal, D. A. Keim. What is the neareast neighbor in high dimensional spaces. In: Proc. the 26th Int'l Conf. Very Large Data Bases, San Francisco, 2000.
  • 8Maria Halkidi, Michalis Vazirgiannis. Clustering validity assessment: Finding the optimal partitioning of a data set. IEEE Int'l Conf. Data Mining, California, USA, 2001.
  • 9Zhang T, et al. Birch: An efficient data clustering method for very large databases. In: Proc. ACM SIGMOD Int'l Conf.Management of Data, Montreal. New York: ACM Press, 1996.73 ~ 84.
  • 10Guha S, Rastogi R, Shin K. CURE: An efficient clustering algorithm for large databases. In: Proc. ACM SIGMOD Int'l Conf. Management of Data, Seattle. New York: ACM Press,1998. 73~84.

共引文献30

同被引文献104

引证文献12

二级引证文献43

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部