一种基于统计语义聚类的查询语言模型估计被引量：3

An Estimation of Query Language Model Based on Statistical Semantic Clustering

下载PDF

导出

摘要如何有效生成文档聚类并使用聚类信息提高检索效果是信息检索中的重要研究课题.如果假设文档中存在若干隐含的独立主题,那么文档可以看成是由这些隐含的独立主题混合噪声相互作用的结果.基于这个假设提出了一种基于独立分量分析的语义聚类技术,试图借助于独立分量分析的良好主题区分能力,将一组文档按照实际隐含的主题在语义空间上聚类.在语言模型的框架下,语义主题聚类将由用户初始查询按照一定的度量方式激活.利用激活语义聚类的信息估计一个反馈语义主题模型,并与初始查询模型一起形成新的查询模型.在5个TREC数据集上的实验结果表明:基于统计语义聚类估计的查询模型相比传统的查询模型以及其他基于聚类的语言模型在检索性能上有显著性提高.其主要原因是应用了和用户查询最相似的语义聚类信息来估计查询模型. It is an important research direction in information retrieval to determine how to effectively generate clusters and use the information in clusters.Assuming that a document contains a set of independent hidden topics,a document is viewed as an interaction of independent hidden topics with some noise.A novel semantic clustering technique using independent component analysis is proposed according to this assumption.The perfect topic separation capability of independent component analysis will group a set of documents into different semantic clusters according to the hidden independent components in semantic space.Within language modeling framework,a certain semantic cluster is activated by a user＇s initial query.A new query language model can be estimated by a user＇s initial query model and a feedback semantic topic model which is estimated from the semantic cluster information in an activated semantic cluster.The estimated query model is applied in experiments on five TREC data sets.The experiment results show that the semantic cluster based query model can significantly improve retrieval performance over traditional query models and other cluster based language models.The main contribution of the improved performance comes from the estimation of query model on the semantic cluster that is most similar to a user＇s query.

作者蒲强何大庆杨国纬

机构地区电子科技大学计算机科学与工程学院匹兹堡大学信息科学学院

出处《计算机研究与发展》 EI CSCD 北大核心 2011年第2期224-231,共8页 Journal of Computer Research and Development

基金中国国家留学基金项目美国国家自然科学基金项目(NSF/IIS0704628)

关键词语义聚类独立分量分析查询模型相关模型语言模型伪相关反馈 semantic clustering independent component analysis query model relevance model language model pseudo relevance feedback

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献16

1Ponte J, Croft W B. A language modeling approach to information retrieval [C]//Proc of the 21st ACM SIGIR Conf on Research and Development in Information Retrieval. New York: ACM, 1998.
2Lafferty J, Zhai C. Document language models, query models, and risk minimization for information retrieval [C]// Proc of the 24th ACM SIGIR Conf on Research and Development in Information Retrieval. New York: ACM, 2001.
3Lee K S, Croft W B, Allan J. A cluster-based resampling method for pseudo-relevance feedback [C] //Proc of the 31st ACM SIGIR Conf on Research and Development in Information Retrieval. New York: ACM, 2008.
4Liu X, Croft W B. Cluster-based retrieval using language models [C] //Proe of the 27th ACM SIGIR Conf on Research and Development in Information Retrieval. New York: ACM, 2004.
5曲卫民,张俊林,孙乐.基于主题的汉语语言模型的研究[J].计算机研究与发展,2003,40(9):1368-1374. 被引量：3
6Kalmanovich I G, Kurland O. Cluster-based query expansion [C] //Proc of the 32nd ACM SIGIR Conf on Research and Development in Information Retrieval. New York: ACM, 2009.
7Hyv-rinen A, Karhunen J, Oja E. Independent Component Analysis [M]. New York: John Wiley & Sons, 2001.
8Zhai C, Lafferty J. Model-based feedback in the language modeling approach to information retrieval [C] //Proc of the 10th Int Conf on Information and Knowledge Management (CIKM'01). New York:ACM, 2001.
9Lia Y, Zhai C. Adaptive relevance feedback in information retrieval [C] //Proc of the 18th ACM Int Conf on Information and Knowledge Management (CIKM'09). New York: ACM, 2009.
10Hansen L K, Larsen J, Kolenda T. Blind detection of independent dynamic components [C] //Proc of IEEE Int Conf on Acoustics, Speech, and Signal Processing. New York: IEEE, 2001.

二级参考文献10

1R DeMoil, M Federico. Language model adaptation. In: Keith Pointing ed. Computational Models of Speech Pattern Processing. NATO ASI Series. Berlin: Springer Verlag, 1999. 102～111.
2R Kuhn, R D Mori. A cache-based natural language model for speech reproduction. IEEE Trans on Pattern Analysis and Machine Intelligence, 1990, PAM2-12(6) : 570～583.
3Daniel Gildea, Thomas Hofrnann. Topic-based language models using EM. In: Proc of the 6th European Conf on Speech Communication and Technology (EUROPEANSPEECH ) .Budapest, Hungary: ESCA, 1999. 2167～2170.
4R Iyer, M Ostendorf. Modeling long distance dependence in language: Topic mixtures vs dynamic cache models. In: Proc of ICSLP. Philadelphia, USA: IEEE Press, 1996. 236～239.
5K Seymore, R Roe, enfeld. Using story topics for language model adaptation. In: Proc of Eurospeech'97. Rhodes, Greece: ESCA,1997. 1987～ 1990.
6Kristie Seymore, Stanley Chen, Ronald Rosenfeld. Nonlinear interpolation of topic models for language model adaptation. In: Proc of ICSLP-98. Sydney, Australia: ASSTA, 1998. 2503～2506.
7Stanley F Chen, Kristie Seymore, Ronald Rosenfeld. Topic adaptation for language modeling using unnormalized exponential models. In: ICASSP-98. Seatde, Washhagton: IEEE Press,1998. 681～684.
8P Clarkson, A Robinson. Language model adaptation using mixtures and an exponentially decaying cache. In: Proc of ICASSP-97. Munich, Germany: IEEE Press, 1997. 799～802.
9Ronald Rosenfeld. A maximum entropy approach to adaptive statistical language modeling. Computer Speech and Language,1996, 10: 187～228.
10P Dempster, N M Laivd, D B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 1977, 39:1～3.

共引文献2

1任纪生,王作英.一种新的基于主题的语言模型自适应方法[J].中文信息学报,2006,20(4):82-87. 被引量：3
2许亚堃,刘功申.利用依存关系优化拼音输入法[J].通信技术,2013,46(3):83-86.

同被引文献18

1丁国栋,白硕,王斌.一种基于局部共现的查询扩展方法[J].中文信息学报,2006,20(3):84-91. 被引量：43
2Qiang H, Dawei S, Stefan R. Robust Query-Specific Pseudo Feedback Document Selection for Query Expansion[A]//Proc. of the 30th European Conf. on Information Retrieval (ECIR), 2008[C]. Heidelberg. Springer-Verlag, 2008 : 547-554.
3Ben H, Ladh O. Finding Good Feedback Documents[A]//Proe. of the 18th ACM Conf. on Information and Knowledge Manage- ment(CIKM), 2009 [C]. New York: ACM Press, 2009.. 2011- 2014.
4Karthik R, Raghavendra U, Pushpak B, et al. On Improving Pseudo-Relevance Feedback Using Pseudo-Irrelevant Documents [A]//Proc. of the 32nd European Conf. on Information Retrie- val(ECIR), 2010 [C]. Heidelberg.. Springer-Verlag, 2010: 573- 576.
5Lv Yuan-hua, Zhai Cheng-xiang, Chen Wan. A Boosting Ap- proach to Improving Pseudo-Relevance Feedbaek[A]//Proc. of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2011 [C]. New York: ACM Press, 2011 : 165-174.
6Sakai T, Manabe T, Koyama M. Flexible Pseudo-Relevance Feedback via Selective Sampling[J]. ACM Transactions on Asian Language Information Processing, 2005,4(2) :111-135.
7Kyung S L, Croft W B, James Pu A Cluster-Based Resampling Method for Pseudo-Relevance Feedback[A]//Proc. of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2008 [C]. New York: ACM Press, 2008 .. 235-242.
8Shariq B, Andreas B. Improving Retrievability of Patents with Cluster-Based Pseudo-Relevance Feedback Document Selection [A]//Proc. of the 18th ACM Conf. on Information and Know- ledge Management (CIKM), 2009[C]. New York: ACM Press, 2009: 1863-1866.
9Kevyn C T, Jamie C. Estimation and Use of Uncertainty in Pseudo-Relevance Feedback[A]//Proc. of the 30th Annual In- ternational ACM SIGIR Conference on Research and Develop- ment in Information Retrieval, 2007 [C]. New York1 ACM Press, 2007 .. 303-310.
10叶正.基于网络挖掘与机器学习技术的相关反馈研究[D].大连:大连理工大学,2011.

引证文献3

1钟敏娟,万常选,刘德喜,廖述梅.基于检索结果聚类的XML伪相关文档查找[J].计算机科学,2013,40(10):172-177. 被引量：1
2钟敏娟,万常选,刘德喜,廖述梅,焦贤沛.基于较高质量扩展源和局部词共现模型的XML查询词扩展[J].计算机科学,2014,41(4):200-204.
3孙芯宇,吴江,蒲强.基于稳定性语义聚类的相关模型估计[J].计算机应用,2016,36(5):1313-1318. 被引量：1

二级引证文献2

1丁小军,陈杰,李霖,徐碧通,朱晓姝.一种基于聚类结果稳定性来确定聚类数的方法[J].玉林师范学院学报,2020(3):43-47. 被引量：1
2钟敏娟,万常选,刘德喜,江腾蛟,刘爱红.基于伪反馈的有效XML查询扩展[J].计算机科学与探索,2016,10(12):1673-1682.

1尹莉莉,郑诚,郑小波.时间约束序列模式的有效生成候选项的方法[J].微型机与应用,2011,30(10):69-72.
2顾景文.DWG文件的编程产生[J].微电子学与计算机,1990,7(11):10-13.
3王侃,卢庆龄,彭艳丽.测试用例自动生成的链方法研究与实现[J].装甲兵工程学院学报,2001,15(3):55-58.
4李立刚,侯胜坤,戴永寿,李隆浩,王亚龙.基于DSG模型的测试用例自动生成方法[J].小型微型计算机系统,2015,36(11):2510-2514. 被引量：2
5刘召军.论小学数学课堂的有效生成[J].软件（教育现代化）（电子版）,2013,3(1):307-308.
6郭红梅.浅谈小学数学课堂有效生成的思考[J].软件（教育现代化）（电子版）,2015,5(5):30-31.
7肖瑜.测试用例自动生成方法研究与实现[J].现代电子技术,2008,31(16):100-102. 被引量：1
8刘杰,王振,冯志先,杜军平.一种基于计算智能的组播路由算法[J].通信技术,2015,48(6):699-704.
9徐辉.合理分解，促进有效生成——《用解析法解决问题》教学案例[J].信息教研周刊,2012(9):65-66.
10朱旭东,刘志镜.基于主题隐马尔科夫模型的人体异常行为识别[J].计算机科学,2012,39(3):251-255. 被引量：38

计算机研究与发展

2011年第2期

浏览历史

内容加载中请稍等...

一种基于统计语义聚类的查询语言模型估计被引量：3

参考文献16

二级参考文献10

共引文献2

同被引文献18

引证文献3

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

一种基于统计语义聚类的查询语言模型估计 被引量：3

参考文献16

二级参考文献10

共引文献2

同被引文献18

引证文献3

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

一种基于统计语义聚类的查询语言模型估计被引量：3