基于潜在语义索引和自组织映射网的检索结果聚类方法被引量：4

Search Result Clustering Method Based on SOM and LSI

下载PDF

导出

摘要随着互联网的不断发展和数据量的不断增加,搜索引擎的作用日益明显,用户更多地依靠搜索引擎来查找需要的信息.利用潜在语义索引(LSI)理论和自组织映射神经网络(SOM)理论,提出了一种文本聚类的新方法——LSOM.该方法应用SOM网络来实现检索结果文本聚类,不必预先给定类别个数,具有聚类灵活和精度高等特点;同时,该方法应用LSI理论来建立向量空间模型,在词条的权重中引入了语义关系,对于高维的文本特征向量,消减原词条矩阵中包含的噪声,提高聚类速度.LSOM使用一种新的类别标签提取方法,并将提取的标签用于解决SOM基本类划分问题,算法在类别标签和聚类效果评价指标上都比已有的算法有所提高. Along with the constant development of the Internet and the ever-increasing amount of data, the role of search engines has become increasingly evident. More users rely on search engines to find the information needed. In order to cluster the search results more effectively, thus facilitating the positioning of information among the original unstructured results, the authors propose a text clustering algorithm--the LSOM algorithm, which is based on the self-organizing map （SOM） and the latent semantic index （LSI） theory. It requires no predefined number of clusters and has the advantages of flexibility and preciseness. For high-dimensional texts feature space, LSI is performed to discover a new low-dimensional semantic space, in which the semantic relationship between features is strengthened while the noisy features in the original space are weakened or eliminated. In addition, the clustering process is more efficient due to the effective dimension reduction. In LSOM, a cluster label extraction method is also developed. The extracted labels are further used in resolving the cluster boundary detection problem, which is non-trivial when SOM is applied in text clustering. Experimental results show that the LSOM algorithm performs better than those existing counterparts in evaluation measures of both cluster label and F-measure.

作者陈毅恒秦兵刘挺王平李生

机构地区哈尔滨工业大学计算机学院信息检索研究室

出处《计算机研究与发展》 EI CSCD 北大核心 2009年第7期1176-1183,共8页 Journal of Computer Research and Development

基金国家自然科学基金项目(60675034) 国家"八六三"高技术研究发展计划基金项目(2008AA01Z144)~~

关键词检索结果聚类潜在语义索引自组织映射网标签边界划分 search result clustering LSI SOM label boundary detection

分类号 TP391.2 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献13

1Osinski S,Weiss D.Conceptual clustering using Lingo algorithm:Evaluation on open directory project data[C] // Proc of the Conf on Intelligent Information Processing and Web Mining.Berlin:Springer,2004:369-377.
2Giannotti F,Nanni M,Pedreschi D.Webcat:Automatic categorization of Web search results[C] //Proc of the 11 th Italian Syrup on Advanced Database Systems.New York:ACM,2003:507-518.
3Geraci F,Pellegrini M,Maggini M,et al.Cluster generation and cluster labeling for Web snippets[G]//LNCS 4209:Proc of SPIRE.Berlin:Springer,2006:25-38.
4Alahakoon D,Halgamuge S K.Dynamic self-organizing maps with controlled growth for knowledge discovery[J].IEEE Trans on Neural Networks,2000,11(3):601-614.
5Yin H,Allinson N M.On the distribution and convergence of feature space in self-organizing maps[J].Neural Computation,1995,7(6):1178-1187.
6Dumais S T,Furnas G W,Landauer T K,et al.Using latent semantic analysis to improve information retrival[C]// Proc of CH188.New York:ACM,1988:281-285.
7Deerwester S,Susan S T,Furnas S T,et al.Indexing by latent semantic[J].Journal of American Society for Information Science,1990,41(5):391-407.
8Kolda T G,Leary O'.Large latent semantic indexing via a semi-discrete matrix decomposition,UMCP-CSD CS-TR-3713[R].Maryland:University of Maryland,1996.
9Furnas G W,Deerwester S,Dumais S T,et al.Information retrieval using singular value decomposition model of latent semantic structure[C] //Proc of SIGIR88.New York:ACM,1988:465-480.
10Park H,Howland P,]eon M.Structure preserving dimension reduction for clustered text data based on the generalized singular value decomposition[J].SIAM Journal on Matrix Analysis and Applicafiom,2003,25(1):165-179.

二级参考文献10

1王志梅,张俊林,李秋山.Web检索结果快速聚类方法的研究与实现[J].计算机工程与设计,2004,25(12):2231-2233. 被引量：2
2Hiroyuki Toda, Ryoji Kataoka. A search result clustering method using informatively named entities [C]. In: Proc of the ACM Workshop on Web Information and Data Management. New York: ACM Press, 2005. 81-86.
3M A Hearst, J O Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results [C]. In: Proc of the ACM Special Interest Group on Information Retrieval Conf. New York: ACM Press, 1996. 76-84.
4F C-iannotti, M Nanni, D Pedreschi, Webcat: Automatic categorization of Web search results [C]. In: Proc of the 11th Italian Syrup on Advanced Database Systems. Italian: Rubbettino Editore, 2003. 507-518.
5Oren Zamir, Oren Etzioni. Web document clustering: A feasibility demonstration [C]. In: Proc of the ACM Special Interest Group on Information Retrieval Conf. New York: ACM Press, 1998. 46-54.
6Florian Beil, Martin Ester, Xiaowei Xu. Frequent term-based text clustering [C]. In: Proc of the 8th ACM Int'l Conf on Knowledge Discovery and Data Mining. New York: ACM Press, 2002. 436-442.
7H Zeng, Q He, Z Chen, et al. Learning to cluster Web search results [C]. In: Proc of the ACM Special Interest Group on Information Retrieval Conf. New York: ACM Press, 2004. 210-217.
8Paolo Ferragina, Antonio Gulli, A personalized search engine based on Web-Snippet hierarchical clustering [C] . In: Proc of the 14th Int'l Conf on World Wide Web, New York: ACM Press, 2005, 801-810.
9X He, H Zha, C Ding, et al. Web document clustering using hyperlink structures [R], Department of Computer Science and Engineering, Pennsylvania State University, Tech Rep: CSE- 01-006, 2001.
10Jianbo Shi, Jitendra Malik, Normalized cuts and image segmentation [J ]. IEEE Trans on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888-905.

共引文献14

1贾荣飞,金茂忠,王晓博.基于用户查询日志的查询聚类[J].北京航空航天大学学报,2010,36(4):500-503. 被引量：4
2靳宇倡,秦启文,安俊秀.网络群体心理趋势智能分析模型研究[J].计算机科学,2010,37(6):273-277. 被引量：3
3于洪,谌强.一种结合K-Means的层次化的搜索结果聚类方法[J].重庆邮电大学学报（自然科学版）,2010,22(3):355-359.
4安俊秀.基于服务器集群的云检索系统的研究与示范[J].计算机科学,2010,37(7):179-182. 被引量：7
5庞观松,张黎莎,蒋盛益,邝丽敏,吴美玲.一种基于名词短语的检索结果多层聚类方法[J].山东大学学报（理学版）,2010,45(7):39-44. 被引量：3
6庞观松,蒋盛益,张黎莎,区雄发,赖旭明.Web搜索结果多层聚类方法研究[J].情报学报,2011,30(5):464-470. 被引量：1
7罗宏,陈黎,王亚强,朱洪波,韩国辉,于中华.基于查询相关性分析的检索结果聚类算法[J].小型微型计算机系统,2011,32(10):2021-2026.
8黄健斌,白杨,康剑梅,钟翔,张鑫,孙鹤立.一种基于同步动力学模型的网络社团发现方法[J].计算机研究与发展,2012,49(10):2198-2207. 被引量：3
9郑诚,李鸿.基于主题模型的K-均值文本聚类[J].计算机与现代化,2013(8):78-80. 被引量：4
10卢仁猛.检索结果聚类算法研究综述[J].计算机光盘软件与应用,2014,17(18):109-110.

同被引文献40

1曾雪强,王明文,陈素芬.一种基于潜在语义结构的文本分类模型[J].华南理工大学学报（自然科学版）,2004,32(z1):99-102. 被引量：27
2吕瑾瑜,周兵.基于JAVA的简单图书查询系统的设计和实现[J].郧阳师范高等专科学校学报,2012,32(6):63-65. 被引量：3
3何明,冯博琴,傅向华.基于Rough集潜在语义索引的Web文档分类[J].计算机工程,2004,30(13):3-5. 被引量：7
4张文进.文本信息检索中的概率模型[J].情报杂志,2005,24(3):107-110. 被引量：7
5王晓黎,王文杰.基于向量空间模型的文本检索系统[J].微电子学与计算机,2006,23(6):188-190. 被引量：18
6刘斌,陈桦.向量空间模型信息检索技术讨论[J].情报杂志,2006,25(7):92-93. 被引量：21
7孙红红.模糊集合理论在信息检索中的应用研究[J].现代情报,2006,26(11):160-162. 被引量：7
8刘桃,刘秉权,徐志明,王晓龙.领域术语自动抽取及其在文本分类中的应用[J].电子学报,2007,35(2):328-332. 被引量：31
9张秋余,刘洋.使用基于SVM的局部潜在语义索引进行文本分类[J].计算机应用,2007,27(6):1382-1384. 被引量：4
10方保镕周继东李医民.矩阵论[M].北京:清华大学出版社,2004..

引证文献4

1孙少波.利用本体集成和特征聚类的网页分类研究[J].现代电子技术,2012,35(14):93-96.
2胡兆芹.传统信息检索模型及其优化策略研究[J].情报探索,2013(2):95-98. 被引量：2
3刘勘,朱芳芳.基于潜在语义索引的科技文献主题挖掘[J].计算机工程与应用,2014,50(24):113-117. 被引量：4
4王玉庆.基于Java的图书查询系统设计与实现[J].信息与电脑,2021,33(2):138-140. 被引量：2

二级引证文献8

1陈开红,王东.外文医学信息检索方法及技巧[J].检验医学与临床,2014,11(17):2487-2488. 被引量：3
2刘宇,张云中,魏瑞斌,谢欢.图书情报学研究进展述评:2010-2013[J].图书馆杂志,2014,33(12):38-48. 被引量：7
3秦春秀,刘杰,刘怀亮,马晓悦.基于知识元的科技文本内容描述框架研究[J].图书情报工作,2017,61(10):116-124. 被引量：23
4张文伟,赵辉.LDA与BTM概率主题模型抽取科学主题效果比较研究[J].情报工程,2020,6(2):66-77. 被引量：9
5于汝意,刘秀磊,刘旭红,张良,王延飞.泛娱乐情报主题的感知研究[J].北京信息科技大学学报（自然科学版）,2020,35(2):58-61. 被引量：2
6王佳珺.基于Java的校园图书管理系统程序设计[J].电脑知识与技术,2022,18(8):60-61. 被引量：3
7香慧敏,白涛,李东亚,马楠.基于词向量与多特征融合的农业文本自动标引研究[J].新疆农业大学学报,2022,45(6):486-492.
8张淑霞.基于Java的政务督办系统的设计与实现[J].工业控制计算机,2023,36(1):134-135. 被引量：1

1李晶,顾国强.一种改进的FCM检索结果聚类算法研究[J].软件产业与工程,2014(5):39-41.
2赵顺,迟呈英.基于LSI和Rough集的文本分类研究[J].鞍山科技大学学报,2005,28(5):346-349. 被引量：2
3龙鹏飞,石奇.XML文档聚类中基于语义的特征词权重计算方法[J].长沙理工大学学报（自然科学版）,2015,12(2):72-77.
4张国煊,郁梅,王小华.基于互信息的汉语短语边界划分[J].杭州电子工业学院学报,1995,15(1):1-5. 被引量：5
5郁梅,张国煊,王小华.基于规则的汉语短语边界划分的研究[J].苏州大学学报（自然科学版）,1994,10(3):226-232. 被引量：2
6柏晗,成颖,柯青.网络检索结果聚类研究综述[J].情报理论与实践,2015,38(10):138-144. 被引量：2
7骆雄武,万小军,杨建武,吴於茜.基于后缀树的Web检索结果聚类标签生成方法[J].中文信息学报,2009,23(2):83-88. 被引量：9
8何龙,安鲁陵,王小平,王远峰.曲面延拓技术及其应用[J].航空制造技术,2015,58(15):39-41. 被引量：4
9陈毅恒,秦兵,宋凡,刘挺,李生.基于ontology抽取优化初始选择的检索结果聚类[J].电子学报,2008,36(B12):166-170. 被引量：8
10庞观松,张黎莎,蒋盛益,邝丽敏,吴美玲.一种基于名词短语的检索结果多层聚类方法[J].山东大学学报（理学版）,2010,45(7):39-44. 被引量：3

计算机研究与发展

2009年第7期

浏览历史

内容加载中请稍等...

基于潜在语义索引和自组织映射网的检索结果聚类方法被引量：4

参考文献13

二级参考文献10

共引文献14

同被引文献40

引证文献4

二级引证文献8

相关作者

相关机构

相关主题

浏览历史

基于潜在语义索引和自组织映射网的检索结果聚类方法 被引量：4

参考文献13

二级参考文献10

共引文献14

同被引文献40

引证文献4

二级引证文献8

相关作者

相关机构

相关主题

浏览历史

基于潜在语义索引和自组织映射网的检索结果聚类方法被引量：4