期刊文献+

基于表示学习的中文分词算法探索 被引量:34

Chinese Word Segment Based on Character Representation Learning
下载PDF
导出
摘要 分词是中文自然语言处理中的一个关键基础技术。通过基于字的统计机器学习方法学习判断词边界是当前中文分词的主流做法。然而,传统机器学习方法严重依赖人工设计的特征,而验证特征的有效性需要不断的尝试和修改,是一项费时费力的工作。随着基于神经网络的表示学习方法的兴起,使得自动学习特征成为可能。该文探索了一种基于表示学习的中文分词方法。首先从大规模语料中无监督地学习中文字的语义向量,然后将字的语义向量应用于基于神经网络的有监督中文分词。实验表明,表示学习算法是一种有效的中文分词方法,但是我们仍然发现,由于语料规模等的限制,表示学习方法尚不能完全取代传统基于人工设计特征的有监督机器学习方法。 Word segmentation is a fundamental technology of Chinese natural language processing.Using characterbased statistical machine learning methods to perform Chinese word segmentation is the main trendcurrently.However,conventional machine learning methods heavily rely on manually designed features,which require intensive labor to modify the features and verify their effectiveness.With the rapid develop of neural-network-based representation learning,it becomes realistic to learn featuresautomatically.This paper investigates a Chinese word segment method based on representation learning.We first learn embedding vectors for Chinese characters from a large corpus unsupervisedly,and then apply them to neural-network-based Chinese word segmentation supervisedly.Experimental results show that representation learning is an effective method for Chinese word segmentation.However,due to the limitation of corpus size,it still cannot replace conventional machine learning methods whichbased on manually designed features.
出处 《中文信息学报》 CSCD 北大核心 2013年第5期8-14,共7页 Journal of Chinese Information Processing
基金 国家自然科学基金资助项目(61070106,61272332,61202329) 国家高技术研究发展计划资助项目(863计划)(2012AA011102) 国家重点基础研究发展计划资助项目(973计划)(2012CB316300) 网络文化与数字传播北京市重点实验室开放课题资助项目(ICDD201201)
关键词 表示学习 中文分词 representation learning Chinese word segmentation
  • 相关文献

参考文献20

  • 1汉语信息处理词汇01部分:基本术语(GB12200.1-90)6[s],中国标准出版社,1991.
  • 2Hinton G E,Salakhutdinov R R.Reducing the dimensionality of data with neural networks[J].Science,2006,313(5786):504-507.
  • 3Bengio Y,Schwenk H,Senécal J S,et al.Neural probabilistic language models[M].Innovations in Machine Learning.Springer Berlin Heidelberg,2006:137-186.
  • 4Collobert R,Weston J,Bottou L,et al.Natural language processing (almost) from scratch[J].The Journal of Machine Learning Research,2011,12:2493-2537.
  • 5Xue N.Chinese word segmentation as character tagging[J].Computational Linguistics and Chinese Language Processing,2003,8(1):29-48.
  • 6刘群,张华平,俞鸿魁,程学旗.基于层叠隐马模型的汉语词法分析[J].计算机研究与发展,2004,41(8):1421-1429. 被引量:198
  • 7Peng F,Feng F,McCallum A.Chinese segmentation and new word detection using conditional random fields[C]//Proceedings of the 20th International Conference on Computational Linguistics.Association for Computational Linguistics,2004:562.
  • 8Tang B,Wang X,Wang X.Chinese Word Segmentation Based on Large Margin Methods[J].Int.J.of Asian Lang.Proc.,2009,19(2):55-68.
  • 9ZhaoH,Huang C N,Li M,et al.Effective tag set selection in Chinese word segmentation via conditional random field modeling[C]//Proceedings of PACLIC.2006,20:87-94.
  • 10Wang K,Zong C,Su K Y.A character-based joint model for Chinese word segmentation[C]//Proceedings of the 23rd International Conference on Computational Linguistics.Association for Computational Linguistics,2010:1173-1181.

二级参考文献33

共引文献428

同被引文献322

引证文献34

二级引证文献155

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部