期刊文献+

基于自适应隐马尔可夫模型的石油领域文档分词 被引量:9

Word Segmentation Based on Adaptive Hidden Markov Model in Oilfield
下载PDF
导出
摘要 中文分词技术是把没有分割标志的汉字串转换为符合语言应用特点的词串的过程,是构建石油领域本体的第一步。石油领域的文档有其独有的特点,分词更加困难,目前仍然没有有效的分词算法。通过引入术语集,在隐马尔可夫分词模型的基础上,提出了一种基于自适应隐马尔可夫模型的分词算法。该算法以自适应隐马尔可夫模型为基础,结合领域词典和互信息,以语义约束和词义约束校准分词,实现对石油领域专业术语和组合词的精确识别。通过与中科院的NLPIR汉语分词系统进行对比,证明了所提算法进行分词时的准确率和召回率有显著提高。 The Chinese word segmentation is the first step in constructing the petroleum field ontology.Documents in petroleum field have their own unique characteristics which make word segmentation more complex.Until now,there is no effective word segmentation algorithm,especially for Chinese characters.Based on the hidden Markovian model,an adaptive hidden Markovian word segmentation model was proposed in this paper,which combines the domain-knowledge dictionary and user-defined information,by introducing the terminology set.The proposed algorithm calibrates word segmentation under semantic constraints and word meaning constraints,and can identify professional terms and character combinations in the field of petroleum accurately.It is also proved that the proposed algorithm achieves remarkable improvements in both accuracy and recall rate in word segmentation,compared to the NLPIR Chinese word segmentation system invented by Chinese Academy of Science.
作者 宫法明 朱朋海 GONG Fa -ruing ZHU Peng- hai(College of Computer & Communication Engineering, China University of Petroleum, Qingdao, Shandong 266580, Chin)
出处 《计算机科学》 CSCD 北大核心 2018年第B06期97-100,共4页 Computer Science
基金 科技部创新方法工作:大数据环境下的油气开采创新方法研究与应用示范(2015IM010300)资助
关键词 中文分词 隐马尔可夫模型 组合词 石油 Chinese word segmentation Hidden Markov model Combined character Petroleum
  • 相关文献

参考文献2

二级参考文献26

  • 1刘群,张华平,俞鸿魁,程学旗.基于层叠隐马模型的汉语词法分析[J].计算机研究与发展,2004,41(8):1421-1429. 被引量:197
  • 2孙茂松,肖明,邹嘉彦.基于无指导学习策略的无词表条件下的汉语自动分词[J].计算机学报,2004,27(6):736-742. 被引量:37
  • 3李双龙,刘群,王成耀.基于条件随机场的汉语分词系统[J].微计算机信息,2006,22(10S):178-180. 被引量:15
  • 4黄昌宁,赵海.中文分词十年回顾[J].中文信息学报,2007,21(3):8-19. 被引量:248
  • 5N. Xue. Chinese Word Segmentation as Character Tagging[J]. Computational Linguistics and Chinese Language Processing, 2003, 8(1), 29-48.
  • 6Ben Taskar, Carlos Guestrin, Daphne Koller. Max- Margin Markov Networks[C]//Proceedings of Neural Information Processing Systems Conference ( NIPS), 2003.
  • 7Huang Chu-Ren, Yo Ting-Shuo, Petr Simon and Hsieh Shu-Kai. A Realistic and Robust Model for Chinese Word Segmentation[C]//Proceedings of the 20th Conference on Computational Linguistics and Speech Processing(ROCLING), 2008.
  • 8汉语信息处理词汇01部分:基本术语(GB12200.1-90)6[s],中国标准出版社,1991.
  • 9Hinton G E,Salakhutdinov R R.Reducing the dimensionality of data with neural networks[J].Science,2006,313(5786):504-507.
  • 10Bengio Y,Schwenk H,Senécal J S,et al.Neural probabilistic language models[M].Innovations in Machine Learning.Springer Berlin Heidelberg,2006:137-186.

共引文献39

同被引文献82

引证文献9

二级引证文献22

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部