摘要
近年来,标点符号作为篇章的重要部分逐渐引起研究者的关注。然而,针对汉语逗号的研究才刚刚展开,采用的方法也大多都是在句法分析的基础上,尚不存在利用汉语句子的表层信息开展逗号自动分类的研究。提出了一种基于汉语句子的分词与词性标注信息做逗号自动分类的方法,并采用了两种有监督的机器学习分类器,即最大熵分类器和CRF分类器,来完成逗号的自动分类。在CTB 6.0语料上的实验表明,CRF的总体结果比最大熵的要好,而这两种分类器的分类精度都非常接近基于句法分析方法的分类精度。由此说明,基于词与词性做逗号分类的方法是可行的。
In recent years, punctuation as an important part of discourse is attracting more and more attention of the researchers. However, most methods are based on syntactic analysis. Research of Chinese comma classification using the surface information of Chinese sentences does not exist. This paper proposes a method for Chinese comma classification based on segmentation and part-of-speech tagging and adopts two supervised machine learning classifiers, namely the maximum entropy classifier and CRF classifier, to complete the automatic classification of commas. Experimental results on the CTB 6.0 corpus show that CRF model is better than maximum entropy model, and the accuracy of the two classifiers are very close to the method based on syntactic analysis. It demonstrates that the method for Chinese comma classification based on segmentation and part-of-speech tagging is feasible.
出处
《计算机工程与应用》
CSCD
北大核心
2015年第18期120-125,共6页
Computer Engineering and Applications
基金
国家自然科学基金青年项目(No.61202162)
教育部博士点基金项目(No.20123201120011)
关键词
汉语逗号分类
最大熵
条件随机场(CRF)
Chinese comma classification
maximum entropy
Conditional Random Field(CRF)