期刊文献+

HDP与互信息相结合的中文无指导分词 被引量:2

Unsupervised Chinese Word Segmentation Based on HDP and Mutual Information Getting together
下载PDF
导出
摘要 该文探讨了无指导条件下的中文分词,这对构建语言无关的健壮分词系统大有裨益。互信息与HDP(Hierarchical Dirichlet Process)是无指导情况下常用的分词模型,该文将两者结合,并改进了采样算法。不考虑标点符号,在两份大小不同的测试语料上获得的F值为0.693与0.741,相比baseline的HDP分别提升了5.8%和3.9%。该文还用该模型进行了半指导分词,实验结果比常用的CRF有指导分词提升了2.6%。 This paper explores Chinese word segmentation without training data, which greatly benefits the foundation of language-independent word segmentation system. Mutual information and HDP are both widely used methods for unsupervised segmentation task. We combine these two models and improve the sampling algorithm. Without regard to punctuations, the F-scores of tWO test corpus with different sizes are 0. 693 and 0. 741. Compared to HDP baseline, the scores rise 5.80//00 and 3.9%, respectively. Finally, our model is applied to semi-supervised word segmentation. The F-score is 2.6% larger than the common supervised CRF model.
出处 《中文信息学报》 CSCD 北大核心 2013年第6期1-5,44,共6页 Journal of Chinese Information Processing
基金 国家自然科学基金资助项目(61273278) 国家社会科学基金资助项目(12&ZD227) 国家科技支撑计划子课题资助项目(2011BAH10B04-03) 国家863计划项目资助(2012AA011101)
关键词 HDP 互信息 无指导分词 HDP mutual information unsupervised word segmentation
  • 相关文献

参考文献12

  • 1Sproat, Richard, Shih C. A statistical method for finding word boundaries in Chinese text [J]. Computer Processing of Chinese and Oriental Languages, 1990, 4: 336-51.
  • 2黄萱菁,吴立德,王文欣,叶丹瑾.基于机器学习的无需人工编制词典的切词系统[J].模式识别与人工智能,1996,9(4):297-303. 被引量:24
  • 3Maosong S, Dayang S, Tsou B K. Chinese word seg mentation without using lexicon and hand-cra{ted train- ing data [C]//Proceedings of the 17th International Conference on Computational linguistics-Volume 2, F, 1998.
  • 4刘挺,吴岩,王开铸.串频统计和词形匹配相结合的汉语自动分词系统[J].中文信息学报,1998,12(1):17-25. 被引量:65
  • 5孙茂松,肖明,邹嘉彦.基于无指导学习策略的无词表条件下的汉语自动分词[J].计算机学报,2004,27(6):736-742. 被引量:37
  • 6Pitman J, Yor M. The two-parameter Poisson Dirichlet distribution derived from a stable subordina tor [J]. The Annals of Probability, 1997, 25(2) : 855900.
  • 7Goldwater S, Griffiths T L, Johnson M. Contextual dependencies in unsupervised word segmentation[C]// Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meet- ing of the Association for Computational Linguistics, F, 2006.
  • 8Goldwater S, Griffiths T L, Johnson M. A Bayesian framework for word segmentation: Exploring the effects of context [J]. Cognition, 2009, 112(1): 21-54.
  • 9TEH Y W. A hierarchical Bayesian language model based on Pitman-Yor processes [C]//Proceedings of the 2Jst International Con[erence on Computational Linguistics and the 44th Annual Meeting of the Asso- ciation for Computational Linguistics, F, 2006.
  • 10Wood F, Teh Y W. A hierarchical, hierarchical Pit- man Yor process language model[C]//Proceedings of the ICML 2008 Workshop on Nonparametric Bayes, F, 2008.

二级参考文献19

共引文献108

同被引文献27

  • 1李双龙,刘群,王成耀.基于条件随机场的汉语分词系统[J].微计算机信息,2006,22(10S):178-180. 被引量:15
  • 2Huijnen P,Laan F,Rijke M,et al. A digital humanities approach tothe history of science [J]. Social Informatics Lecture Notes in Com-puter Science, 2014,83(59) :71 -85.
  • 3Zhao Hai, Huang Chang-Ning, Li Mu, et al. A unified character-based, tagging method of Chinese word segmentation via conditionalrandom field modeling[ J]. ACM Transaction on Asian LanguageInformation Processing, 2010, 9(2) :1 -32.
  • 4汉籍电子文献[EB/OL].[2015 -05 - 07]. http://hanji. sini-ca. edu. tw/index. html.
  • 5汉达文库[EB/0L]. [2015 -04 ~ 13 ]. http://www. chant, org/.
  • 6Lafferty J, McCallum A, Pereira F. Conditional random fields:Probabilistic models for segmenting and labeling sequence data[C ] //The International Mchine Learning Society. Proceedings ofthe Eighteenth International Conference on Machine Learning. Wil-liamstown: Williams College, 2001:282 -289.
  • 7CRF + + [ EB/OL]. [2015 - 05 - 07 ]. http://sourceforge. net/projects/crfpp/.
  • 8邱冰,皇甫娟.基于中文信息处理的古代汉语分词研究[J].微计算机信息,2008,24(24):100-102. 被引量:31
  • 9刘挺,吴岩,王开铸.串频统计和词形匹配相结合的汉语自动分词系统[J].中文信息学报,1998,12(1):17-25. 被引量:65
  • 10宋彦,蔡东风,张桂平,赵海.一种基于字词联合解码的中文分词方法[J].软件学报,2009,20(9):2366-2375. 被引量:42

引证文献2

二级引证文献25

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部