期刊文献+

基于潜在语义索引和自组织映射网的检索结果聚类方法 被引量:4

Search Result Clustering Method Based on SOM and LSI
下载PDF
导出
摘要 随着互联网的不断发展和数据量的不断增加,搜索引擎的作用日益明显,用户更多地依靠搜索引擎来查找需要的信息.利用潜在语义索引(LSI)理论和自组织映射神经网络(SOM)理论,提出了一种文本聚类的新方法——LSOM.该方法应用SOM网络来实现检索结果文本聚类,不必预先给定类别个数,具有聚类灵活和精度高等特点;同时,该方法应用LSI理论来建立向量空间模型,在词条的权重中引入了语义关系,对于高维的文本特征向量,消减原词条矩阵中包含的噪声,提高聚类速度.LSOM使用一种新的类别标签提取方法,并将提取的标签用于解决SOM基本类划分问题,算法在类别标签和聚类效果评价指标上都比已有的算法有所提高. Along with the constant development of the Internet and the ever-increasing amount of data, the role of search engines has become increasingly evident. More users rely on search engines to find the information needed. In order to cluster the search results more effectively, thus facilitating the positioning of information among the original unstructured results, the authors propose a text clustering algorithm--the LSOM algorithm, which is based on the self-organizing map (SOM) and the latent semantic index (LSI) theory. It requires no predefined number of clusters and has the advantages of flexibility and preciseness. For high-dimensional texts feature space, LSI is performed to discover a new low-dimensional semantic space, in which the semantic relationship between features is strengthened while the noisy features in the original space are weakened or eliminated. In addition, the clustering process is more efficient due to the effective dimension reduction. In LSOM, a cluster label extraction method is also developed. The extracted labels are further used in resolving the cluster boundary detection problem, which is non-trivial when SOM is applied in text clustering. Experimental results show that the LSOM algorithm performs better than those existing counterparts in evaluation measures of both cluster label and F-measure.
出处 《计算机研究与发展》 EI CSCD 北大核心 2009年第7期1176-1183,共8页 Journal of Computer Research and Development
基金 国家自然科学基金项目(60675034) 国家"八六三"高技术研究发展计划基金项目(2008AA01Z144)~~
关键词 检索结果聚类 潜在语义索引 自组织映射网 标签 边界划分 search result clustering LSI SOM label boundary detection
  • 相关文献

参考文献13

  • 1Osinski S,Weiss D.Conceptual clustering using Lingo algorithm:Evaluation on open directory project data[C] // Proc of the Conf on Intelligent Information Processing and Web Mining.Berlin:Springer,2004:369-377.
  • 2Giannotti F,Nanni M,Pedreschi D.Webcat:Automatic categorization of Web search results[C] //Proc of the 11 th Italian Syrup on Advanced Database Systems.New York:ACM,2003:507-518.
  • 3Geraci F,Pellegrini M,Maggini M,et al.Cluster generation and cluster labeling for Web snippets[G]//LNCS 4209:Proc of SPIRE.Berlin:Springer,2006:25-38.
  • 4Alahakoon D,Halgamuge S K.Dynamic self-organizing maps with controlled growth for knowledge discovery[J].IEEE Trans on Neural Networks,2000,11(3):601-614.
  • 5Yin H,Allinson N M.On the distribution and convergence of feature space in self-organizing maps[J].Neural Computation,1995,7(6):1178-1187.
  • 6Dumais S T,Furnas G W,Landauer T K,et al.Using latent semantic analysis to improve information retrival[C]// Proc of CH188.New York:ACM,1988:281-285.
  • 7Deerwester S,Susan S T,Furnas S T,et al.Indexing by latent semantic[J].Journal of American Society for Information Science,1990,41(5):391-407.
  • 8Kolda T G,Leary O'.Large latent semantic indexing via a semi-discrete matrix decomposition,UMCP-CSD CS-TR-3713[R].Maryland:University of Maryland,1996.
  • 9Furnas G W,Deerwester S,Dumais S T,et al.Information retrieval using singular value decomposition model of latent semantic structure[C] //Proc of SIGIR88.New York:ACM,1988:465-480.
  • 10Park H,Howland P,]eon M.Structure preserving dimension reduction for clustered text data based on the generalized singular value decomposition[J].SIAM Journal on Matrix Analysis and Applicafiom,2003,25(1):165-179.

二级参考文献10

  • 1王志梅,张俊林,李秋山.Web检索结果快速聚类方法的研究与实现[J].计算机工程与设计,2004,25(12):2231-2233. 被引量:2
  • 2Hiroyuki Toda, Ryoji Kataoka. A search result clustering method using informatively named entities [C]. In: Proc of the ACM Workshop on Web Information and Data Management. New York: ACM Press, 2005. 81-86.
  • 3M A Hearst, J O Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results [C]. In: Proc of the ACM Special Interest Group on Information Retrieval Conf. New York: ACM Press, 1996. 76-84.
  • 4F C-iannotti, M Nanni, D Pedreschi, Webcat: Automatic categorization of Web search results [C]. In: Proc of the 11th Italian Syrup on Advanced Database Systems. Italian: Rubbettino Editore, 2003. 507-518.
  • 5Oren Zamir, Oren Etzioni. Web document clustering: A feasibility demonstration [C]. In: Proc of the ACM Special Interest Group on Information Retrieval Conf. New York: ACM Press, 1998. 46-54.
  • 6Florian Beil, Martin Ester, Xiaowei Xu. Frequent term-based text clustering [C]. In: Proc of the 8th ACM Int'l Conf on Knowledge Discovery and Data Mining. New York: ACM Press, 2002. 436-442.
  • 7H Zeng, Q He, Z Chen, et al. Learning to cluster Web search results [C]. In: Proc of the ACM Special Interest Group on Information Retrieval Conf. New York: ACM Press, 2004. 210-217.
  • 8Paolo Ferragina, Antonio Gulli, A personalized search engine based on Web-Snippet hierarchical clustering [C] . In: Proc of the 14th Int'l Conf on World Wide Web, New York: ACM Press, 2005, 801-810.
  • 9X He, H Zha, C Ding, et al. Web document clustering using hyperlink structures [R], Department of Computer Science and Engineering, Pennsylvania State University, Tech Rep: CSE- 01-006, 2001.
  • 10Jianbo Shi, Jitendra Malik, Normalized cuts and image segmentation [J ]. IEEE Trans on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888-905.

共引文献14

同被引文献40

引证文献4

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部