摘要
中文分词技术作为中文信息处理中的关键基础技术之一,基于深度学习模型的中文分词法受到广泛关注。然而,深度学习模型需要大规模数据训练才能获得良好的性能,而当前中文分词语料数据相对缺乏且标准不一。文中提出了一种简单有效的异构数据处理方法,对不同语料数据加上两个人工设定的标识符,使用处理过的数据应用于双向长短期记忆网络结合条件随机场(Bi-LSTM-CRF)的中文分词模型的联合训练。实验结果表明,基于异构数据联合训练的Bi-LSTM-CRF模型比单一数据训练的模型具有更好的分词性能。
Chinese word segmentation technology is one of the key basic technologies in Chinese information processing.The Chinese word segmentation method based on deep learning model is widely concerned.However,the deep learning model requires large-scale data training to obtain good performance,but the current Chinese sub-word data is relatively lacking and the standards are not the same.This paper proposes a simple and effective method of heterogeneous data processing.Firstly,two artificially-set identifiers are added to different corpus data,and then the processed data is applied to the joint training of Bi-LSTM-CRF Chinese word segmentation model.Experimental results show that the Bi-LSTM-CRF model based on heterogeneous data joint training has better segmentation performance than the single data training model.
作者
姜猛
王子牛
高建瓴
JIANG Meng;WANG Ziniu;GAO Jianling(School of Big Data & Information Engineering,Guizhou University,Guiyang 550025,China;Network and Information Management Center,Guizhou University,Guiyang 550025,China)
出处
《电子科技》
2019年第4期29-32,59,共5页
Electronic Science and Technology
基金
贵州省科学技术基金(黔科合J字[2015]2045)
贵州大学研究生创新基金(研理工2017016)~~