摘要
为了选择最优的边界,采用交叉验证方法,将取得错误率最低的上下文边界确定为上下文最优边界,并应用此方法对Sem Eval-2007中文数据集进行处理,得出此数据集的上下文最优边界为[-2,+2]。为了验证其结果的有效性,进一步采用SemEval-2007测试集进行消歧测试,结果表明采用交叉验证法确定的最优边界对词义消歧准确率有一定提升。同时对不同词性歧义词的最优边界也进行讨论。
To determine the optimal context field of ambiguous word, the paper uses cross -validation method to identify the optimal context window, and the best one has the lowest error rate in all of candidates. Using this method, it processes SemEval - 2007 data sets and finds that the optimal context windows for this data sets is [ - 2, + 2 ]. In order to verify this result, there is a WSD test for SemEval - 2007 test data sets, which shows that the performance of Chinese WSD upgrades to a certain extent. And the different optimal context windows for different parts of speech of ambiguous word are discussed.
出处
《现代图书情报技术》
CSSCI
北大核心
2009年第7期49-53,共5页
New Technology of Library and Information Service
基金
国家自然科学基金项目“文本集特征提取方法及应用研究”(项目编号:70673070)的研究成果之一
关键词
词义消歧
上下文边界
特征选择
中文
Word sense disambiguation Context window Feature selection Chinese