摘要
中文分词是地质大数据智能化知识挖掘难以回避的第一道基本工序。基于统计的分词方法受语料影响,跨领域适应性较差。基于词典的分词方法可以直接利用领域词典进行分词,但不能解决未登录词识别问题。在领域语料不足的情况下,为提高地质文本分词的准确率和未登录词识别率,提出一种基于统计的中文地质词语识别方法。该方法基于质串思想构建了地质基本词典库,用以改善统计分词方法在地质文本分词上的适应性。采用重复串查找方法得到地质词语候选集,并使用上下文邻接以及基于位置成词的概率词典,对地质词语候选集进行过滤,最终实现地质词语识别。实验结果表明,使用该方法对地质专业词语识别准确率达到81.6%,比通用统计分词方法提高了近60%。该方法能够识别地质文本中的未登录词,并保证地质分词的准确率,可以应用到地质文本分词工作中。
Chinese word segmentation is the first basic process which is difficult to avoid in the intelligent knowledge mining of geological data.Word extraction based on statistics have poor performance across domain which is affected by corpus,the method based on dictionary can directly use the domain dictionary,but the problem of unlisted words recognition can not be resolved.In the case of insufficient domain corpus,a method of Chinese geological words recognition based on statistics is proposed,aiming at improving the accuracy of geological text segmentation and unlisted words recognition.Using prime string,the paper firstly constructs a base words library in geology,which has better performance across domain,then the geological words candidate set can be obtained by the algorithm of repeated string,and the final words can be recognized by using context adjacency analysis and position word probability to filter the candidate set.The experimental results show that the accuracy of the method is 81.6%,which is nearly 60%higher than that of the general statistical word segmentation method.This method is able to identify the unlisted geological words and ensure the accuracy,which can be applied to geological text segmentation.
作者
王宏
朱学立
曾涛
乔东玉
郭甲腾
WANG Hong;ZHU Xue-li;ZENG Tao;QIAO Dong-yu;GUO Jia-teng(Henan Institute of Geological Survey;Henan Key Laboratory for Metalogenetic Process of Metal Mineral Resource and Resource Utilization,Zhengzhou 450000,China;School of Resources and Civil Engineering,Northeastern University,Shenyang 110000,China)
出处
《软件导刊》
2020年第4期211-218,共8页
Software Guide
基金
国家自然科学基金项目(41671404)
中央高校基本科研业务费项目(N170104019)
中国地质调查局智能地质调查支撑平台建设项目(DD20160355)。
关键词
地质文本
中文分词
质串
重复串
上下文邻接
位置成词概率
geologic text
Chinese word segmentation
prime string
repeated string
context adjacency analysis
position word probability