摘要
汉语依存树库的建设相对其他语言如英语,在规模和质量上还有一些差距。树库标注需要付出很大的人力物力,并且保证树库质量也比较困难。该文尝试通过规则和统计相结合的方法,将宾州汉语短语树库PennChinese Treebank转化为哈工大依存树库HIT-IR-CDT的体系结构,从而增大现有依存树库的规模。将转化后的树库加入HIT-IR-CDT,训练和测试依存句法分析器的性能。实验表明,加入少量经转化后的树库后,依存句法分析器的性能有所提高;但加入大量树库后,性能反而下降。经过细致分析,作为一种利用多种树库提高依存句法分析器性能的方法,短语转依存还存在很多需要深入研究的方面。
The progress of Chinese dependency treebank construction has fallen behind other languages, such as English, in terms of scale and quality. Building a large scale treebank needs a lot of human and material resources. Meanwhile, it is very difficult to guarantee the quality of the treebank. In this paper, we explore a new method which combines rule based method and statistical-based method to convert a constituent treebank named Penn Chinese Treebank to a dependency treebank which follows the annatation standard of HIT Chinese Dependency Treebank (HIT-IR-CDT). We increase the size of training data by adding converted treebank into HIT-IR CDT and retrain the dependency parser. Experiments show that small addition of converted treebank can improve the performance of dependency parser, while large addition will bring it down. Through detailed analysis, we believe that convertion of constituent to dependency treebank still needs in depth research as a method of improving performance of dependency parser by utilizing different treebanks.
出处
《中文信息学报》
CSCD
北大核心
2008年第6期14-19,共6页
Journal of Chinese Information Processing
基金
自然科学基金资助项目(60675034
60575042)
国家863计划资助项目(2006AA01Z145)
关键词
计算机应用
中文信息处理
短语结构树库
依存结构树库
依存句法分析
computer application
Chinese information processing
constituent-based treebank
dependency treebank
dependency parsing