期刊文献+

基于异构数据联合训练的中文分词法 被引量:6

Chinese Word Segmentation Based on Joint Training of Heterogeneous Data
下载PDF
导出
摘要 中文分词技术作为中文信息处理中的关键基础技术之一,基于深度学习模型的中文分词法受到广泛关注。然而,深度学习模型需要大规模数据训练才能获得良好的性能,而当前中文分词语料数据相对缺乏且标准不一。文中提出了一种简单有效的异构数据处理方法,对不同语料数据加上两个人工设定的标识符,使用处理过的数据应用于双向长短期记忆网络结合条件随机场(Bi-LSTM-CRF)的中文分词模型的联合训练。实验结果表明,基于异构数据联合训练的Bi-LSTM-CRF模型比单一数据训练的模型具有更好的分词性能。 Chinese word segmentation technology is one of the key basic technologies in Chinese information processing.The Chinese word segmentation method based on deep learning model is widely concerned.However,the deep learning model requires large-scale data training to obtain good performance,but the current Chinese sub-word data is relatively lacking and the standards are not the same.This paper proposes a simple and effective method of heterogeneous data processing.Firstly,two artificially-set identifiers are added to different corpus data,and then the processed data is applied to the joint training of Bi-LSTM-CRF Chinese word segmentation model.Experimental results show that the Bi-LSTM-CRF model based on heterogeneous data joint training has better segmentation performance than the single data training model.
作者 姜猛 王子牛 高建瓴 JIANG Meng;WANG Ziniu;GAO Jianling(School of Big Data & Information Engineering,Guizhou University,Guiyang 550025,China;Network and Information Management Center,Guizhou University,Guiyang 550025,China)
出处 《电子科技》 2019年第4期29-32,59,共5页 Electronic Science and Technology
基金 贵州省科学技术基金(黔科合J字[2015]2045) 贵州大学研究生创新基金(研理工2017016)~~
关键词 中文分词 深度学习 Bi-LSTM-CRF 异构数据 联合训练 语料库 Chinese word segmentation deep learning Bi-LSTM-CRF heterogeneous data joint training corpus
  • 相关文献

参考文献1

二级参考文献13

  • 1万晓枫,惠孛.基于贝叶斯分类法的智能垃圾短信过滤系统[J].实验科学与技术,2013,11(5):44-47,76.
  • 2Schmidhuber J. Deep learning in neural networks:an over- view [ J]. Neural Networks,2015,61 ( 1 ) :85 - 117.
  • 3Bengio, Ducharme R, Vincent P, et al. A neural probabilistic language model [ J ]. Journal of Machine Learning Research, 2003(3) :1137 - 1155.
  • 4Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space [ C]. Scottsdale, Arizo- na : ICLR Workshop ,2013.
  • 5Hinton G E, Osindero S, The Y W. A fast learning algorithm for deep belief nets [ J ]. Neural Computation, 2006 ( 18 ) : 1527 - 1554.
  • 6Tieleman. Training restricted bohzmann machines using ap- preximations to the likelihood gradient [ C]. Helsinki, Fin- land : ICML, 2008.
  • 7Kazuhiro Shin - ike. A two phase method for determining the number of neurons in the hidden layer of a 3 - Layer neural network [ C ]. Taipei, Taiwan: SICE Annual Conference,2010.
  • 8何蔓微,袁锐,刘建胜,王贵新.垃圾短信的智能识别和实时处理[J].电信科学,2008,24(8):61-64. 被引量:7
  • 9刘金岭,严云洋.基于上下文的短信文本分类方法[J].计算机工程,2011,37(10):41-43. 被引量:13
  • 10李慧,叶鸿,潘学瑞,段震,张燕平.基于SVM的垃圾短信过滤系统[J].计算机安全,2012(6):34-38. 被引量:13

共引文献3

同被引文献66

引证文献6

二级引证文献21

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部