摘要
针对当前句子检索方法中因数据稀疏而存在的"词不匹配"问题,提出了一种Word Net和词向量相结合的句子检索方法。首先在Word Net语义关系图中应用个性化PageRank算法计算与查询项最相关的同义词集合,实现查询项扩展,从而在一定程度上解决了查询项数据稀疏的问题;然后利用在大规模语料中训练神经网络语言模型获取的词向量对查询项和句子进行表示;最后引入WMD(word mover's distance)计算查询项与句子的语义相似度,从而利用语义信息进一步降低"词不匹配"问题带来的影响,将句子按相似度值从高到低排序作为句子检索结果。文章方法在TREC2003和TREC2004会议的项目中进行评测,MAP和R-Precision值相较于次优结果分别提高了13.29%和13.54%。
A WordNet and Word Embedding based sentence retrieval method is proposed in this paper to solve the vocabulary mismatch problem rooted in the sparsity of sentences and queries.Firstly,we run the personalized PageRank algorithm over the graph representation of WordNet concepts and relations to obtain concepts related to the queries,which could partially settle the sparsity of the queries.Secondly,the word embeddings that represent semantics of the query and sentence are gained through training in large-scale corpus with the Continous Skip-gram Model.Finally,the ranked list of retrieval results is achieved by applying Word Mover's Distance(WMD) to calculate semantic similarity of query and sentence,which can further handle the vocabulary mismatch problem.The evaluation on TREC2003 and TREC2004 reveals that the proposed method is significantly superior to the baseline sentence retrieval method.The MAP and R-Precision are 13.29% and13.54% higher than the second best result.
出处
《信息工程大学学报》
2017年第4期486-491,共6页
Journal of Information Engineering University
基金
国家社会科学基金资助项目(14BXW028)