摘要
采用半指导机器学习方法co training实现中文组块识别。首先明确了中文组块的定义,co training算法的形式化定义。文中提出了基于一致性的co training选取方法将增益的隐马尔可夫模型(TransductiveHMM)和基于转换规则的分类器(fnTBL)组合成一个分类体系,并与自我训练方法进行了比较,在小规模汉语树库语料和大规模未带标汉语语料上进行中文组块识别,实验结果要比单纯使用小规模的树库语料有所提高,F值分别达到了85 34%和83 4 1% ,分别提高了2 13%和7 2 1%。
In this paper we discuss the application of semi-supervised machine learning method-co-training on Chinese Text Chunking. Firstly, we give the definition of Chinese chunk,then the formalized definition of co-training algorithm.We proposed a example selection method based on the consistence, using two classifiers : Transductive HMM and fnTBL to combine a classification system to perform the Chinese text chunking task with the small-scale labled Chinese treebank and large-scale unlabled Chinese corpus. The result were compared with the self-training result and the result of the non co-training experiment in which we only used the small-scale Chinese treebank as training data and use one classifier(Transductive HMM or fnTBL) to recognize the Chinese chunk. The improvement is significant, the F value of the two classifiers reached 83.41%,85.34%, get a improvement of 2.13 points and 7.21 points respectively.
出处
《中文信息学报》
CSCD
北大核心
2005年第3期73-79,共7页
Journal of Chinese Information Processing
基金
国家教育部科学技术研究重点资助项目 (10 4 0 6 5 )
国家自然科学基金和微软亚洲研究院联合资助项目 (6 0 2 0 30 19)