摘要
兼类词歧义直接影响词性标注的准确率。本文针对越南语兼类词歧义问题提出一种融合语言特性的越南语兼类词消歧方法。通过构建越南语兼类词词典和兼类词语料库,分析越南语的语言特征和兼类词特点,选取有效的特征集;然后利用条件随机场能添加任意特征等优点,在使用词和词性上下文信息的同时,引入句法成分和指示词特征,得到消歧模型。最后在兼类词语料上实验,准确率达到了87.23%。实验表明本文所提出的越南语兼类词消歧方法有效可行,可以提高词性标注正确率。
Multi-category words disambiguation directly affects the part of speech(POS)tagging accuracy.This paper proposed a statistical disambiguation method combined with linguistic characteristics of Vietnamese multi-category words.First,the paper builds Vietnamese multi-category words dictionary and Vietnamese multi-category words corpus,and selects effective feature sets for multi-category words by analyzing of Vietnamese language and multi-category words.Secondly,the paper takes into account the advantages of adding any features of CRFs model,introduces the syntactic and lexical features excepting the features of words and POS,and then builds up the disambiguation model.Finally,testing is carried out on the real multi-category category words corpus,and the accuracy is 87.23%.Experimental results show that the proposed Vietnamese multi-category words disambiguation model is effective and feasible,which can improve the correct rate of POS tagging.
作者
郭剑毅
赵晨
刘艳超
毛存礼
余正涛
Guo Jianyi;Zhao Chen;Liu Yanchao;Mao Cunli;Yu Zhengtao(School of Information Engineering and Automation,Kunming University of Science and Technology,Kunming,650500,China;Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming,650500,China)
出处
《数据采集与处理》
CSCD
北大核心
2019年第4期577-584,共8页
Journal of Data Acquisition and Processing
基金
国家自然科学基金(61262041,61562052,61662041)资助项目,国家自然科学基金重点(61732005)资助项目
关键词
兼类词消歧
兼类词词典
兼类词语料库
语言特征
条件随机场模型
越南语
multi-category words disambiguation
multi-category words dictionary
multi-category words corpus
linguistic characteristics
conditional random fields model
Vietnamese