期刊文献+

基于百度百科的词语相似度计算 被引量:20

Word Similarity Measurement Based on BaiduBaike
下载PDF
导出
摘要 词语相似度计算是自然语言处理的关键技术之一,是一个被广泛研究的基础课题。传统的词语相似度量方法大多是基于语义知识和基于语料库统计的方法,即这两类方法需要具有层次关系组织的语义词典和大规模的语料库。提出了一种新的基于百度百科的词语相似度量方法,通过分析百度百科词条信息,从表征词条的解释内容方面综合分析词条相似度,并定义了词条间的相似度计算公式,通过计算部分之间的相似度得到整体的相似度。实验结果表明,与已有的相似度计算方法对比,提出的算法更加有效合理。 Research on word similarity measurement has been popular not only in natural language processing but also in other basic research. Traditional word similarity measurements use semantic lexieal or large-scale corpus. We first discussed the background of the applications of word similarity measurement, such as information retrieval, information extraction, text classification, example-based machine translation, etc. Then two strategies of word similarity measure- ment were summarized:one is based on ontology or a semantic taxonomy, the other is based on large collocations of words in corpus. BaiduBaike,an online open encyclopedia, could be used not only as a corpus but also a knowledge re- souree with rich semantic information. Based on BaiduBaike with its rich semantic information and category graph, we proposed a new method to analyze and compute Chinese word similarity from four dimensions: the baike card, the eon- tent of word, the open classification of word and the correlation words. We used language-network to choose top key terms of content of word. Based on vector space mode (VSM) theory, we calculated the similarity between parts of words. We presented a new "multi-path searching" algorithm on BaiduBaike category graph. A comprehensive similarity measuring method based on the four parts was proposed. Experiment results show that the method has a good performane.
出处 《计算机科学》 CSCD 北大核心 2013年第6期199-202,共4页 Computer Science
基金 国家自然科学基金(70871115)资助
关键词 词语相似度 语言网络 百度百科 向量空间模型 Word similarity, Language network, BaiduBaike, VSM
  • 相关文献

参考文献14

  • 1章志凌,虞立群,陈奕秋,罗海飞,邵晓敏.基于Corpus库的词语相似度计算方法[J].计算机应用,2006,26(3):638-640. 被引量:17
  • 2Salton G,Lesk M E.Computer evaluation of indexing and text processing[J].Journal of the ACM,1968,15(1):8-36.
  • 3Rada R.Development and application of a metric on semantic nets[J].IEEE Transactions on System.Man and Cybernetics,1989,19(1):17-30.
  • 4Lee J H.Information retrieval based on conceptual distance in ISA hierarchies[J].Journal of Documentation,1993,49 (2):188-207.
  • 5Sussna M.Word sense disambiguation for free-text indexing using a massive semantic network[C]//Proceedings of the 2nd International Conference on Information and Knowledge Management (CIKM'93).Washington,DC,US,1993:67-74.
  • 6Agirre E,Rigau G.A Proposal for word sense disambiguation using conceptual distance[C]// International Conference/Recent Advances in Natural Language Recessing RANLP.95.Tzigov Chark,Bulgaria,1995:91-98.
  • 7刘群 李素建.基于《知网》的词汇语义相似度计算[C]..第三界汉语词汇语义研讨会[C].台北,2002..
  • 8李素建,张健,黄雄,白硕,刘群.Semantic Computation in a Chinese Question—Answering System[J].Journal of Computer Science & Technology,2002,17(6):933-939. 被引量:30
  • 9Brown P.Word sense disambiguation using tactical methods[C]∥Proceedings of 29th Meeting of the Association For Computational Linguistics (ACL291).1991:210-207.
  • 10胡俊峰,俞士汶.唐宋诗中词汇语义相似度的统计分析及应用[J].中文信息学报,2002,16(4):39-44. 被引量:43

二级参考文献41

  • 1黄昌宁,李涓子.词义排歧的一种语言模型[J].语言文字应用,2000(3):85-90. 被引量:16
  • 2鲁松 白硕.词距离的计算方法.自然语言理解与机器翻译[M].北京:清华大学出版社,2001,7..
  • 3俞士汶 胡俊峰.唐宋诗之词汇自动分析及应用.台北中央研究院第三届汉学会议[M].,..
  • 4Fung B C M,Wang K,Ester M.Hierarchical document clustering//Wang John ed.The Encyclopedia of Data Warehousing and Mining,idea Group.2005:970-975.
  • 5Salton G.The SMART Retrieval System-Experiments in Automatic Document Processing.Englewood Cliffs,New Jersey:Prentice Hall Inc,1971.
  • 6Wang Y,Julia H.Document clustering with semantic analysis//Proceedings of the 39th Hawaii International Conferences on System Sciences.Hawaii,US,2006:54-63.
  • 7Hotho A,Staab S,Stumme G.Wordnet improves text document clustering//Proceedings of the Semantic Web Workshop at SIGIR-2003,26th Annual International ACM SIGIR Conference.Toronto,Canada,2003:541-550.
  • 8Hall P,Dowling G.Approximate string matching.Computing Survey,1980,12(4):381-402.
  • 9Coelho T,Calado P,Souza L,Ribeiro-Neto B,Muntz R.Image retrieval using multiple evidence ranking.IEEETransactions on Knowledge and Data Engineering,2004,16(4):408-417.
  • 10Ko Y,Park J,Seo J.Improving text categorization using the importance of sentences.lnformation Processing and Management,2004,40(1):65-79.

共引文献432

同被引文献169

引证文献20

二级引证文献215

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部