摘要
针对短文本内容简短、特征稀疏等特点,提出一种融合共现距离和区分度的短文本相似度计算方法。一方面,该方法在整个短文本语料库中利用两个共现词之间距离计算它们的共现距离相关度。另一方面通过计算共现区分度来提高距离相关度的准确度,然后对每个文本中词项进行相关性加权,最后通过词项的权重和词项之间的共现距离相关度计算两个文本的相似度。实验结果表明,本文提出的方法能够提高短文本相似度计算的准确率。
Aiming at the typical characteristics of severe sparseness and high dimension of short texts,we propose a short text similarity measure method based on co-occurrence distance and discrimination.On the one hand,the method leverages the co-occurrence distance between terms in each document to determine co-occurrence distance correlation.On the other hand,we calculate the co-occurrence discrimination to improve the accuracy of co-occurrence distance correlation,and then the relevance weight of the terms in the text is calculated.The text similarity between two short texts is calculated according to the term weights and the co-occurrence distance between terms.Experimental results show that the proposed method outperforms the baseline algorithm in term of performance and efficiency in similarity calculation.
作者
刘文
马慧芳
脱婷
陈海波
LIU Wen;MA Hui-fang;TUO Ting;CHEN Hai-bo(College of Computer Science and Engineering,Northwest Normal University,Lanzhou 730070;Guangxi Key Laboratory of Trusted Software,Guilin University of Electronic Technology,Guilin 541004,China)
出处
《计算机工程与科学》
CSCD
北大核心
2018年第7期1281-1286,共6页
Computer Engineering & Science
基金
国家自然科学基金(61762078
61363058)
广西可信软件重点实验室研究课题(KX201705)
西北师范大学学生创新能力计划(CX2018Y054)
关键词
短文本
共现距离相关度
共现区分度
词项加权
相似度计算
short text
co occurrence distance correlation
co occurrence discrimination
term weighting
similarity calculation