摘要
词义消歧,作为自然语言处理领域最具挑战性的任务之一,目前正面临着知识获取瓶颈(Knowledge Acquisition Bottle-neck)的阻碍。目录标签消歧,作为词义消歧的又一崭新的应用领域,是轻量级本体学习(Lightweight Ontology Learning)中十分重要的一个环节。旨在探索一种基于Web知识(不受知识获取瓶颈制约)并应用于目录标签消歧的词义消歧方法。其主要思想为:首先,利用Web知识(Web搜索引擎)和WordNet等外部资源,将待消歧词t的上下文c及n个候选词义s1…sn扩展为各自的向量形式,并提出的一种tf-idf变体(条件tf-idf)来计算向量中的分量值。之后,又提出一种新颖的混合消歧模型,综合考虑各候选词义与待消歧词上下文的相关度及候选词义先验分布这两个因素进行消歧。据了解,类似做法在基于Web的词义消歧中还未出现过。在实验中,在网页目录DMOZ的一个子集(共1100个待消歧词)上进行了实验。系统以100%的召回率达到83.40%的准确率,高于基线准确率(单纯根据词义先验分布消歧)73.37%达10个百分点。
Word sense disambiguation ( WSD), as the most challenging task in natural language processing sector, is facing the impediment from knowledge acquisition bottleneck. Directory label disambiguation, as a brand new application of WSD, plays an essential role in light- weight ontology learning. This article aims at exploring a way to disambiguate word sense using Web knowledge ( not limited by the knowledge acquisition bottleneck) and applying this in directory labels' disambiguation. In the approach we proposed that,firstly the context c and n candidate word senses s~...sn of the target word (to be disambiguated) are expanded to vectors using external resources such as Web knowl- edge ( Web search engine) and WordNet. The components of the vectors are calculated by ~ variant of tf-idf (conditional tf-idf) proposed in this paper. Furthermore, a novel model of mixture disambiguation for WSD is proposed, in which both the similarity between context of the word to be disambiguated and candidate word senses and the priori probability of candidate word senses are comprehensively considered to perform the disambiguation. To the author' s knowledge, there is no similar approach in Web-based WSD before. In the experiment, we performed WSD task on a subset of DMOZ Web directory ( 1100 target words to be disambiguated in total). We achieved a precision of 83.40% with 100% recall ,which is 10 percents higher than the baseline precision (disambiguation purely based on priori probabilities of word senses) 73.37%.
出处
《计算机应用与软件》
CSCD
2010年第9期224-227,282,共5页
Computer Applications and Software
关键词
词义消岐
基于Web知识
无监督
轻量级本体
Word sense disambiguation Web based knowledge Unsupervised Lightweight ontology