摘要
中文分词技术是把没有分割标志的汉字串转换为符合语言应用特点的词串的过程,是构建石油领域本体的第一步。石油领域的文档有其独有的特点,分词更加困难,目前仍然没有有效的分词算法。通过引入术语集,在隐马尔可夫分词模型的基础上,提出了一种基于自适应隐马尔可夫模型的分词算法。该算法以自适应隐马尔可夫模型为基础,结合领域词典和互信息,以语义约束和词义约束校准分词,实现对石油领域专业术语和组合词的精确识别。通过与中科院的NLPIR汉语分词系统进行对比,证明了所提算法进行分词时的准确率和召回率有显著提高。
The Chinese word segmentation is the first step in constructing the petroleum field ontology.Documents in petroleum field have their own unique characteristics which make word segmentation more complex.Until now,there is no effective word segmentation algorithm,especially for Chinese characters.Based on the hidden Markovian model,an adaptive hidden Markovian word segmentation model was proposed in this paper,which combines the domain-knowledge dictionary and user-defined information,by introducing the terminology set.The proposed algorithm calibrates word segmentation under semantic constraints and word meaning constraints,and can identify professional terms and character combinations in the field of petroleum accurately.It is also proved that the proposed algorithm achieves remarkable improvements in both accuracy and recall rate in word segmentation,compared to the NLPIR Chinese word segmentation system invented by Chinese Academy of Science.
作者
宫法明
朱朋海
GONG Fa -ruing ZHU Peng- hai(College of Computer & Communication Engineering, China University of Petroleum, Qingdao, Shandong 266580, Chin)
出处
《计算机科学》
CSCD
北大核心
2018年第B06期97-100,共4页
Computer Science
基金
科技部创新方法工作:大数据环境下的油气开采创新方法研究与应用示范(2015IM010300)资助
关键词
中文分词
隐马尔可夫模型
组合词
石油
Chinese word segmentation
Hidden Markov model
Combined character
Petroleum