
基于聚类语言模型的生物文献检索技术研究 被引量:3

Research on Biological Literature Information Retrieval Based on Cluster Language Model
摘要 近年来研究表明使用主题语言模型增强了信息检索的性能,但是仍然不能解决信息检索存在的一些难点问题,如数据稀疏问题,同义词问题,多义词问题,对文档中不可见项和可见项的平滑问题。这些问题在一些领域相关文献检索中显得尤其重要,比如大规模的生物文献检索。本文提出了一种新的基于聚类的主题语言模型方法进行生物文献检索,这主要包括两个方面工作,一是采用本体库中的概念表示文档,并在此基础上进行模糊聚类,把聚类的结果作为数据集中的主题,文档属于某个主题的概率由文档与聚类的模糊相似度决定。二是采用EM算法来估计主题产生项的概率。把上述方法集成到语言模型中就得到本文的语言模型。本文的语言模型能够准确描述项在不同主题中的分布概率,以及文档属于某个主题的概率,并且利用本体中概念部分地解决了同义词问题,而且项可以由不同的主题产生,这也能够部分解决词的多义问题。本文的方法在TREC 2004/05 Genomics Track数据集上进行了测试,与简单语言模型以及现有主题语言模型相比,检索性能得到一定的提高。 Recent researches present topic language model improves the performance of information retrieval, but many problems still has not been solved include data sparseness problem, synonymy and polysemy problems, smoo thing the seen term or not seen term. All the problems are important to IR, especially in domain literature IR, for example biological literatures. In this paper, a new topic language model based on cluster was proposed. The work mainly included two aspects. First, documents were represented by concepts of ontology, and concept-based cluste ring is done using Fuzzy C Means, the clustering result was considered as the topics of document collections. The probability of a document generating topics is estimated by the similarity between the document and each cluster. Then, the probability of topic generating words is estimated using Expectation Maximization algorithm. At last, Through integrating the above algorithms into the aspect model, our topic language model was formed. This new language model accurately describes the distributed probability of words in different topics and the probability of a document generating a topic. Moreover, it can partly solve synonymy and polysemy problems. The new method was evaluated on TREC 2004/05 Genomics Track collections. Experiments have shown that the retrieval performance has been improved by the new method compared with simple language model.
作者 文健 李舟军
出处 《中文信息学报》 CSCD 北大核心 2008年第1期61-66,122,共7页 Journal of Chinese Information Processing
基金 国家自然科学基金资助项目(60573057)
关键词 计算机应用 中文信息处理 主题语言模型 信息检索 聚类 computer application Chinese information processing topic language model information retrieval cluster
  • 相关文献


  • 1Ponte,J.and Croft,W.B.,A Language Modeling Approach to Information Retrieval[A].In:Proceedings of the 21st ACM SIGIR Conference on Research and Development in IR[C].1998,275-281.
  • 2Zhou X.,Hu X.,Zhang X.,Lin X.,Song I-Y.Context-Sensitive Semantic Smoothing for the Language Modeling Approach to Genomic IR[A].In the 29th Annual International ACM SIGIR Conference (SIGIR 2006)[C].Aug 6-11,2006,Seattle,WA,USA,70-77.
  • 3Hofmann,T.Probabilistic Latent Semantic Analysis[A].In:Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence[C].1999.
  • 4Blei,D.M.,Griffiths,T.L.,Jordan,M.I.,& Tenenbaum,J.B.Hierarchical topic models and the nested Chinese restaurant process[A].In:Advances in Neural Information Processing Systems 16.Cambridge,MA:MIT Press.2004.
  • 5Blei,D.M.,Ng,A.Y.,& Jordan,M.I.Latent Dirichlet Allocation[J].Journal of Machine Learning Research,2003,3,993-1022.
  • 6Liu,X.and Croft,W.B.Cluster-based retrieval using language models[A].In:Proceedings of the 27th ACM SIGIR Conference on Research and Development in IR[C].2004,186-193.
  • 7Zhai,C.and Lafferty,J..A study of smoothing methods for language models applied to information retrieval[J].ACM Transactions on Information Systems,2004,2(2):April.
  • 8Berger,A.and Lafferty,J.D..Information Retrieval as Statistical Translation[A].In:proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval[C].1999,222-229.
  • 9Andreas Hotho,Steffen Staab,and Gerd Stumme.Ontologies improve text document clustering[A].In:Proc.of the ICDM 03,The 2003 IEEE International Conference on Data Mining,2003.541-544.
  • 10Zhou,X.,Zhang,X.,and Hu,X.,The Dragon Toolkit,Data Mining & Bioinformatics Lab,iSchool at Drexel University,http://www.ischool.drexel.edu/dmbio/dragontool[CP/OL].


  • 1J.Ponte and W.B.Croft, A Language Modeling Approach to Information Retrieval[A]. In: Proceedings of the 1998 ACM SIGIR Conference on Research and Development in Infommfion Retrieval[C]. 1998, 275-281.
  • 2A. Berger and J.I.afferty. InfonmlJon retrieval as statistical translation[A]. In: Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval[ C]. 1999,222- 229.
  • 3C Zhai and J Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval[ A].In: Proceedings of the 2001 ACM SIGIR Conference on Research and Development in Infonmfion Retrieval[C].2001.
  • 4Stanley F. Chen and Josha Goodman. An empirical study of smoothing techniques for language modeling[R]. Harvard University, August 1998.
  • 5D.H.Miller, T.Leek and R.Schwartz. A hidden Markov model information retrieval system[A]. In: Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval[C]. 1999,214- 221.
  • 6M. Srikanth and R. Sfihari. Biterm Language Models for Document Retrieval[A]. In: Proceedings of the 2002 ACM SIGIR Conference on Research and Development in Information Retrieval[ C ]. 2002.
  • 7R. Jin, A.G. Hauptmann and C. Zhai. Title Language Model for Information Retrieval[A]. In:Proceedings of the 2002 ACM SIGIR Conference on Research and Development in Infonmfon Retrieval[C]. 2002.
  • 8T.Hofmann. Probabilisfic latent semantic indexing[ A]. In:Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval[ C]. 1999,50- 57.
  • 9S.Deerwester, S.T. Dummais etc. Indexing by latent semantic analysis[J]. Journal of the Society for Information Science, 1990,41(6) :381 - 407.
  • 10NTCIR Workshop(research. nii. ac. jp/ntcir/index - en. html).



  • 1唐培丽,王树明,胡明.基于语义的汉语文献主题词提取算法研究[J].吉林大学学报(信息科学版),2005,23(5):535-540. 被引量:16
  • 2杨建林.基于本体的文本信息检索研究[J].情报理论与实践,2006,29(5):598-601. 被引量:21
  • 3牛栋,李正泉,于贵瑞.陆地生态系统与全球变化的联网观测研究进展[J].地球科学进展,2006,21(11):1199-1206. 被引量:16
  • 4L.R. Rabiner. A Tutorial on Hidden Markov Models andSelected Applications in Speech Recognition. Proceedings of theIEEE. 1989,77(2).
  • 5J.Ponte, W. Croft. A Language Modeling Approach toInformation Retrieval [A].In: proceeding of ACM Research andDevelopment in Information Retrieval (SIGIR) [C]. 1998.
  • 6MartiA.Hearsttexttiling:segmentingtextintomulti-paragraphsubtopicpassages.ComputationalLinguistics, 1997,23(1):33-64.
  • 7VOORHEES E. Query expansion using lexical-semantic rela- tions [ C ] // Proceedings of the 17th annual international ACM SIGIR conference on Research and development in infor- mation retrieval, 1994 : 61-69.
  • 8MAKI W, MCKINLEY L, THOMPSON A. Semantic distance norms computed from an electronic dictionary (wordnet) [ J ]. Behavior Research Methods, Instruments, & Computers, 2004, 36 (3): 421431.
  • 9NAVIGLI R, VELARDI P. An analysis of ontology-based que- ry expansion strategies [ C ]. Workshop on Adaptive Text Ex- traction and Mining in the 14th European Conference on Ma- chine Learning, 2003: 42-49.
  • 10VEGA J C A, GoMEZ-PeREZ A, TELLO A L, et al. (On- to) 2 Agent: an ontology-based WWW broker to select ontologies [ C] //Proceedings of the 13th European Conference on Artifi- cial Intelligence ( ECAI' 98) Workshop Applications of Ontol- ogies and Problems Solving Methods, 1998: 16-24.










使用帮助 返回顶部