

Chinese comma classification based on segmentation and part of speech tagging
摘要 近年来,标点符号作为篇章的重要部分逐渐引起研究者的关注。然而,针对汉语逗号的研究才刚刚展开,采用的方法也大多都是在句法分析的基础上,尚不存在利用汉语句子的表层信息开展逗号自动分类的研究。提出了一种基于汉语句子的分词与词性标注信息做逗号自动分类的方法,并采用了两种有监督的机器学习分类器,即最大熵分类器和CRF分类器,来完成逗号的自动分类。在CTB 6.0语料上的实验表明,CRF的总体结果比最大熵的要好,而这两种分类器的分类精度都非常接近基于句法分析方法的分类精度。由此说明,基于词与词性做逗号分类的方法是可行的。 In recent years, punctuation as an important part of discourse is attracting more and more attention of the researchers. However, most methods are based on syntactic analysis. Research of Chinese comma classification using the surface information of Chinese sentences does not exist. This paper proposes a method for Chinese comma classification based on segmentation and part-of-speech tagging and adopts two supervised machine learning classifiers, namely the maximum entropy classifier and CRF classifier, to complete the automatic classification of commas. Experimental results on the CTB 6.0 corpus show that CRF model is better than maximum entropy model, and the accuracy of the two classifiers are very close to the method based on syntactic analysis. It demonstrates that the method for Chinese comma classification based on segmentation and part-of-speech tagging is feasible.
出处 《计算机工程与应用》 CSCD 北大核心 2015年第18期120-125,共6页 Computer Engineering and Applications
基金 国家自然科学基金青年项目(No.61202162) 教育部博士点基金项目(No.20123201120011)
关键词 汉语逗号分类 最大熵 条件随机场(CRF) Chinese comma classification maximum entropy Conditional Random Field(CRF)
  • 相关文献


  • 1Yang Yaqin, Xue Nianwen.Chinese comma disambiguation for discourse analysis[C]//Proceedings of Annual Meeting on Association for Computational Linguistics(ACL-12), 2012 : 786-794.
  • 2Guo Yuqing,Wang Haifeng,van Genabith J.A linguisti- cally inspired statistical model for Chinese punctuation generation[J].Proceedings of ACM Transactions on Asian Language Processing, 2010,9(2).
  • 3Huang Hen-Hsen, Chen Hsin-His.Chinese discourse rela- tion recognition[C]//Proceedings of the 5th International Joint Conference on Natural Language Processing 2011, 2011 : 1442-1446.
  • 4Feng Vanessa Wei, Hirst Graeme.Text-level discourse with rich linguistic feature[C]//Proceedings of Annual Meeting on Association for Computational Linguistics(ACL-12), 2012:60-68.
  • 5Xue Nianwen,Xia Fei, Chiou Fu-Dong,et al.The Penn Chinese Treeabnk: phrase structure annotation of a large corpus[C]//Proceedings of Natural Language Engineering, 2005 : 207-238.
  • 6Zhou Yuping,Xue Nianwen.PDTB-style discourse annota- tion of Chinese text[C]//Proceedings of Annual Meeting on Association for Computational Linguistics(ACL-12), 2012 : 69-77.
  • 7乐明.汉语篇章修辞结构的标注研究[J].中文信息学报,2008,22(4):19-23. 被引量:27
  • 8宋柔.现代汉语跨标点句句法关系的性质研究[J].世界汉语教学,2008,22(2):26-44. 被引量:27
  • 9宋柔.汉语篇章广义话题结构研究[R].北京语言大学语言信息处理研究所研究报告,2012.
  • 10Jin Meixun, Kim Mi-Young, Kim Dong-I1, et al.Segmen- ration of Chinese long sentences using commas[C]//Pro- ceedings of the SIGHANN Workshop on Chinese Lan- guage Processing, 2004.


  • 1侯敏,孙建军.汉语中的零形回指及其在汉英机器翻译中的处理对策[J].中文信息学报,2005,19(1):14-20. 被引量:23
  • 2William Mann, and Sandra Thompson. Rhetorical Structure Theory: A Theory of Text Organization [M]. ISI/RS-87-190. Information Sciences Institute, University of Southern California. 1987.
  • 3William Mann, and Sandra Thompson. Rhetorical Structure Theory: Toward a functional theory of text organization[J] Text. 1988, 8(3): 243-281.
  • 4Lynn Carlson, Daniel Marcu, and Mary E. Okurowski. Building a discourse-tagged corpus in the frame work of Rhetorical Structure Theory [C]//Jan van Kuppevelt and Ronnie Smith, editors, Current Directions in Discourse and Dialogue. Kluwer Academic Publishers. 2003.
  • 5Manfred Stede. The Potsdam Commentary Corpus. [C]//Proceedings of the ACL 2004 Workshop Discourse Annotation', Barcelona. 2004.
  • 6R. Soricut and Daniel Marcu. Sentence level discourse parsing using syntactic and lexical information [C]// Proceedings of Human Language Technology and North American Association for Computational Linguistics Conference ( HLTNAACL' 2003). Edmonton, Canada.
  • 7J. Burstein and Daniel Marcu. A machine learning approach for identification of thesis and conclusion statements in student essays [J]. Computers and the Humanities. 2003,37(4), 455-467.
  • 8Benjamin K. T'sou, Lin H. L., Ho H. C., Lai T.and Chan T. Automated Chinese Full-text Abstraction Based on Rhetorical Structure Analysis [J]. Computer Processing of Oriental Languages. 1996,10 (2) : 225- 238.
  • 9YUE Ming. Discursive Usage of Six Chinese Punctuation Marks [C]//Proceedings of COLING/ACL-2006 Student Research Workshop. Sydney. July 2006. 43- 48.
  • 10邢福义.汉语复句研究[M].北京:商务印书馆.2002.









使用帮助 返回顶部