摘要
基于统计的TF-IDF相似度计算方法由于不考虑词语的语义信息,不能准确地反映文本间的相似性。针对该问题,提出一种结合语义理解和TF-IDF的科技项目相似度计算方法。在项目分词的基础上,利用《知网》计算两个项目间的特征项语义相似度,基于TF-IDF计算每个特征项的权重,然后针对权重大于给定阈值的特征项进行加权进而计算得到项目相似度值。实验结果表明,该方法效果优于单纯的TF-IDF和语义理解的方法。
TF-IDF(term frequency - inverse document frequency)is one of the traditional text similarity calculation method based on statistics. Because TF-IDF does not consider the semantic information of words, it can not accurately reflect the similarity between texts. Aiming at this problem, this paper advances a method combined with the semantic tmderstanding and TF-IDF to calculate the similarity of technology project. Based on the word segmentation of the technology project and the information from the HowNet, calculates the feature semantic similarity of the two between, then calculates weight of each feature by using TF-IDF, and finally calculates the similarity value of the technology project according to the weight of the features that their weight is greater than the given threshold. The experimental results show that the method is better than the pure TF-IDF and the method of semantic understanding.
出处
《计算机时代》
2015年第5期1-3,6,共4页
Computer Era
基金
2013年浙江省公益技术应用研究项目"基于语义的科技项目查重研究与实现"(2013C33G2040027)2013-2014
关键词
语义理解
《知网》
特征项权重
相似度计算
TF-IDF
semantic understanding
HowNet
weight of feature
similarity calculation