期刊文献+

基于二元背景模型的新词发现 被引量:10

New word detection based on a background bigram model
原文传递
导出
摘要 该文提出一种基于二元背景模型的新词发现方法。采用前、背景语料二元似然比挑选候选二元组(bigram);然后根据频率、刚性、条件概率等基于前景语料的统计量,对二元组进行进一步筛选和扩展,以确定新词边界。用该方法提取出的词既包含新词特征,又可以成词。而且该方法充分利用现有背景生语料却无需分词等标注信息,不依赖词典、分词模型和规则,具有良好的扩展性。为了得到更好的发现效果,还讨论了各统计量阈值的选取策略和垃圾元素剔除策略。该方法在网络小说语料上验证了其有效性。 A new word detection method was developed that first extracts bigrams from the target foreground corpus based on their foreground and background likelihood ratio.Then,it filters and extends the bigrams to qualified new words according to statistical metrics including the frequency,rigidity and conditional probability.The method makes sure that the selected words are actually new based on background knowledge,and fixes the word boundary precisely according to the statistical metrics.The method requires no resources such as word lists,word segmentation models or rules.The methods for determining the thresholds for the different statistical metrics and for cutting the noise bigrams are also discussed.The method has been tested on online novels.
出处 《清华大学学报(自然科学版)》 EI CAS CSCD 北大核心 2011年第9期1317-1320,共4页 Journal of Tsinghua University(Science and Technology)
关键词 新词发现 二元组 背景模型 似然比 new word detection bigram background model likelihood ratio
  • 相关文献

参考文献8

  • 1CHEN Aitao. Chinese word segmentation using minimal linguistic knowledge [C]// SIGHAN'03. Stroudsburg, PA, USA:ACL, 2003:148-151.
  • 2GUO Zhili. Using mutual information to identify new features for text documents of various domains [C]// Proceedings of 17th Pacific Asia Conference on Language, Information and Computation. Singapore: COLIPS Publications, 2003 : 372 - 379.
  • 3WANG Meichu, HUANG Churen, CHEN Kehjiann. The identification and classification of unknown words in Chinese: an n-grams based approach [C]// The Proceedings of the 1994 Kyoto Conference: A Festschrift for Professor Akira Ikeya. Tokyo: The Logico-Linguistic Society of Japan, 1995 113 - 123.
  • 4贾自艳,史忠植.基于概率统计技术和规则方法的新词发现[J].计算机工程,2004,30(20):19-21. 被引量:28
  • 5齐振宇,赵军,杨帆.一种开放式中文命名实体识别的新方法[C]//第五届全国信息检索学术会议论文集.北京:中国中文信息学会,2009:paper60.
  • 6PENG Fuchun, FENG Fangfang, Andrew M. Chinese segmentation and new word detection using conditional random fields [C]// COLING'04. Stroudsburg, PA, USA: ACL, 2004: 562-569.
  • 7韩艳,林煜熙,姚建民.基于统计信息的未登录词的扩展识别方法[J].中文信息学报,2009,23(3):24-30. 被引量:15
  • 8Smadja F. Retrieving collocations from text: Xtract [J]. Computational Linguistics-Special issue on using large corpora: I, 1993, 19(1): 143-177.

二级参考文献18

共引文献40

同被引文献162

引证文献10

二级引证文献78

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部