摘要
该文提出了一种无监督和有监督相结合的中文分词方法:将邻接变化数(Accessor Variety,AV)引入基于条件随机场的中文分词系统中。针对邻接变化数在处理较少的训练数据时存在的缺陷,提出了一种归一化的改进方法,以减轻计算AV值时产生的波动。基于Bakeoff-4的中文分词实验表明,归一化的邻接变化数方法无论对于封闭测试,还是开放测试,都带来了性能的提升。
This paper proposes a method combining supervised learning with unsupervised method to conduct Chinese word segmentation (CWS), which incorporates the Accessor Variety (AV) into the Conditional Random Fields (CRFs). To solve the flaw in Accessor Variety (AV) when dealing with limited training data, normalization is in- troduced to alleviate the fluctuation in the AV value in the phrase of unsupervised segmentation. Experiments on the Bakeoff-4 data indicate that normalized Accessor Variety is effective both for close and open tracks.
出处
《中文信息学报》
CSCD
北大核心
2010年第1期15-19,共5页
Journal of Chinese Information Processing
基金
高等学校学科创新引智计划资助项目(B08004)
国家支撑计划资助项目(2007BAHo5B02-04)