摘要
短文本相似度计算是自然语言处理方面的研究热点,传统基于词项的文本相似度算法只考虑了词项因素,忽略了词序对短文本相似性的影响。为此提出了一种基于公共词块的短文本相似度计算方法,综合考虑了词项和词序因素,将基于词项重合的重叠相似度算法与公共词块间的词序相似度算法相结合,并采用自适应的加权组合方式得到短文本相似度值。实验结果显示:与传统算法相比,该算法在稳定性和F值方面都具有较好的结果。
It is short text similarity computation that has been the focus of the natural language pro- cessing. Only the words are considered in the traditional text similarity algorithm based on the terms, with words order ignored. A new method based on common chunks was presented to calculate the short text similarity, which considers the number and the sequence of the same words. The similarity of the short texts was gotten through making automatic coefficient between the similarity based on the same words and the similarity based on the order of the same words. The simulation results show that, compared with conventional similarity algorithms, the presented algorithm has a better performance in the stability and the harmonic-mean towards the precision and the recall.
出处
《重庆理工大学学报(自然科学)》
CAS
2015年第8期88-93,共6页
Journal of Chongqing University of Technology:Natural Science
基金
国家自然科学基金资助项目(61173184)
重庆市教委科技计划项目(KJ100821)
重庆理工大学研究生创新基金资助项目(YCX2014227)
关键词
短文本
词序
公共词块
相似度算法
short text
words order
common chunks
similarity algorithm