摘要
知识获取一直以来是自然语言处理中的瓶颈,基于树库的统计句法分析也不例外。树库中潜在隐含的语言知识是非常丰富的,但它们并不是可以直接得到,往往需要特定的策略才能将它们融合到模型中。我们的汉语统计句法分析模型从3个方面融合潜在的丰富语言知识:1)重新标注树库中的非递归名词短语和非递归动词短语;2 )设计新的中心词映射表;3)引进上下文配置框架以更具体地描述二元依存结构。由于融合了以上三种潜在语言知识,模型的F1值提高了2 37% ,完全匹配正确率提高了5 36 %。
Knowledge acquisition is always regarded as a bottleneck in many NLP tasks, such as machine translation, information extraction. Treebank-based statistical parsing is not an exceptant. The latent linguistic knowledge in treebank is very rich, which, however, cant be acquired directly.In our model, the following three ways are used to incorporate such rich linguistic features for Chinese statistical parsing. First of all, non-recursive noun and verb phrases are annotated in the Penn Chinese Treebank because of their strong mark of boundaries. Second, a new head percolation table is designed based on Xias table. The last linguistic feature our model uses is the context configuration frame which provides a stronger representation of bilexical dependency structures. All these three linguistic features gain an improvement of remarkable 2.37% in terms of F1 measure, 5.36% in terms of complete match ratio.
出处
《中文信息学报》
CSCD
北大核心
2005年第3期61-66,共6页
Journal of Chinese Information Processing
基金
国家 8 6 3计划资助项目 (2 0 0 3AA1110 10
2 0 0 1AA114 0 10 )
关键词
人工智能
自然语言处理
统计句法分析
非递归短语
中心词映射表
上下文配置框架
artificial intelligence
natural language processing
statistical parsing
non-recursive NPs
head percolation table
context configuration frame