摘要
词性兼类是自然语言理解必须解决的一类非常重要的歧义现象,尤其是对生词的词性歧义处理有很大的难度。文章基于隐马尔科夫模型(HMM),通过将生词的词性标注问题转化为求词汇发射概率,在词性标注中提出了一种生词处理的新方法。该方法除了用到一个标注好的单语语料库外,没使用任何其他资源(比如语法词典、语法规则等),封闭测试正确率达97%左右,开放测试正确率也达95%左右,基本上达到了实用的程度。同时还给出了与其他同样基于HMM的词性标注方法的测试比较结果,结果表明本文方法的标注正确率有较大的提高。
Ambiguity of part of speech (POS) which urgent needs to be resolved is a very important ambiguous phenomenon in natural language processing. Furthermore, it is very difficult to disambiguate the ambiguity of part of speech of the new words. In this paper, through converting the problem of tagging of POS to the problem of calculation of word's emission probability; a new approach based on HMM is proposed to solve this problem. This approach uses nothing more than a tagged corpus (e.g. no grammar dictionaries, no grammar rules), and the result shows that the correct rata arrive at 97% in close test and 92% in open test.
出处
《中文信息学报》
CSCD
北大核心
2003年第5期1-5,共5页
Journal of Chinese Information Processing
基金
国家自然科学基金资助项目(60272088)