期刊文献+

An Improved Unsupervised Approach to Word Segmentation

An Improved Unsupervised Approach to Word Segmentation
下载PDF
导出
摘要 ESA is an unsupervised approach to word segmentation previously proposed by Wang, which is an iterative process consisting of three phases: Evaluation, Selection and Adjustment. In this article, we propose Ex ESA, the extension of ESA. In Ex ESA, the original approach is extended to a 2-pass process and the ratio of different word lengths is introduced as the third type of information combined with cohesion and separation. A maximum strategy is adopted to determine the best segmentation of a character sequence in the phrase of Selection. Besides, in Adjustment, Ex ESA re-evaluates separation information and individual information to overcome the overestimation frequencies. Additionally, a smoothing algorithm is applied to alleviate sparseness. The experiment results show that Ex ESA can further improve the performance and is time-saving by properly utilizing more information from un-annotated corpora. Moreover, the parameters of Ex ESA can be predicted by a set of empirical formulae or combined with the minimum description length principle. ESA is an unsupervised approach to word segmentation previously proposed by Wang, which is an iterative process consisting of three phases: Evaluation, Selection and Adjustment. In this article, we propose Ex ESA, the extension of ESA. In Ex ESA, the original approach is extended to a 2-pass process and the ratio of different word lengths is introduced as the third type of information combined with cohesion and separation. A maximum strategy is adopted to determine the best segmentation of a character sequence in the phrase of Selection. Besides, in Adjustment, Ex ESA re-evaluates separation information and individual information to overcome the overestimation frequencies. Additionally, a smoothing algorithm is applied to alleviate sparseness. The experiment results show that Ex ESA can further improve the performance and is time-saving by properly utilizing more information from un-annotated corpora. Moreover, the parameters of Ex ESA can be predicted by a set of empirical formulae or combined with the minimum description length principle.
出处 《China Communications》 SCIE CSCD 2015年第7期82-95,共14页 中国通信(英文版)
基金 supported in part by National Science Foundation of China under Grants No. 61303105 and 61402304 the Humanity & Social Science general project of Ministry of Education under Grants No.14YJAZH046 the Beijing Natural Science Foundation under Grants No. 4154065 the Beijing Educational Committee Science and Technology Development Planned under Grants No.KM201410028017 Beijing Key Disciplines of Computer Application Technology
关键词 word segmentation character sequence smoothing algorithm maximum strategy 分词方法 监督 个人信息 最小描述长度 ESA 平滑算法 合理利用 经验公式
  • 相关文献

参考文献31

  • 1Sproat, Richard and Chilin Shih. A statistical method for finding word boundaries in chinese text[A]. Computer Processing of Chinese and Oriental Languages, 1990, 4(4):336-351.
  • 2Zhao, H. and C. Kit. Integrating unsupervised and supervised word segmentation: The role of goodness measures[J]. Information Sciences, 2011. 181(1): 163-183.
  • 3Sun, W. and J. Xu. Enhancing Chinese word segmentation using unlabeled data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2011: Association for Computational Linguistics.
  • 4Ng, H.T. and J.K. Low. Chinese part-of-speech tagging: One-at-a-time or all-at-once? wordbased or character-based? in EMNLP. 2004.
  • 5Zhang, L., et al. Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations. in ACL (2). 2013.
  • 6Goldwater, S., T.L. Griffiths, and M. Johnson, A Bayesian framework for word segmentation: Exploring the effects of context. Cognition, 2009. 112(1): 21-54.
  • 7Liu, Y., et al., Domain Adaptation for CRF-based Chinese Word Segmentation using Free Annotations. in Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2014: Association for Computational Linguistics.
  • 8Pitman, Jim and Marc Yor. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator.The Annals of Probability, 1997, 25(2):855-900.
  • 9Ge, Xianping, Wanda Pratt, and Padhraic Smyth. 1999. Discovering Chinese words from unsegmented text. InProceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'1999), pages 271-272, Berkeley, CA.
  • 10Peng, Fuchun and Dale Schuurmans. 2001. Self-supervised Chinese word segmentation. InProceedings of the Fourth International Symposium on Intelligent Data Analysis (IDA'2001), pages 238-247, Lisbon.

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部