摘要
文本之间在相似度比较时主要考虑关键词的匹配特性,缺乏对关键词间组合关系的深入分析。针对关键词间组合特性,按序组合的关键词数目越大,对文本之间相似度贡献越大,并提出基于关键词组合数目的非线性语义关联性函数,在LCS基础上提取文本中所有关键词组合块。将这种结合关键词组合关系的相似度比较方法运用于短文本的相似度比较中,数据采用微软语义释义语料库,实验结果表明,短文本相似度计算的准确率和F1值都有了提高,其中F1值的提高较为明显。
Similarity comparison between texts is mainly based on keywords matching, while lacking of analysis of combinationrelationship among keywords deeply. Aiming at the combination of keywords, the larger of the sum of keywordswhich appears orderly, the greater significance for the similarity comparison between texts, a novel non-linear semanticrelevance function is proposed based on the sum of keywords combination cooperatively, under the foundation of LCS theory,it extracts all the combination blocks of keywords. The experimental results on an open benchmark dataset fromMicrosoft Research Paraphrase corpus(MSRP)show that the proposed algorithm acquires a well accuracy and F1 performanceparticularly compared with traditional algorithm under the circumstance of short text similarity comparison.
作者
周丽杰
于伟海
郭成
ZHOU Lijie;YU Weihai;GUO Cheng(Electronic Teaching Center, Yantai Vocational College, Yantai, Shandong 264670, China;Yantai Normal Language Teaching Center, Yantai, Shandong 264670, China;School of Software Technology, Dalian University of Technology, Dalian, Liaoning 116620, China)
出处
《计算机工程与应用》
CSCD
北大核心
2016年第19期90-93,共4页
Computer Engineering and Applications
基金
国家自然科学基金(No.61401060
No.61272173)
山东省高等学校科技计划基金(No.J12LN73)
关键词
关键词组合
非线性语义关联
语义关联函数
文本相似度
combination of keywords
non-linear semantic relevance
semantic relevance function
text similarity