摘要
词语语义相似度计算是自然语言处理领域研究的基础。针对基于路径方法中普遍存在的密度不均匀性问题,提出融合路径距离与信息内容方法,通过一个平滑参数将路径和信息内容融合调整概念间的语义距离,使路径方法计算的相似度值更加合理。该方法具有较少的参数,能够避免其他方法因引入参数过多带来的过拟合问题,具有较好的通用性。实验结果表明:本文方法计算的相似度值与国际标准测试集人工判定值的皮尔逊相关系数达到了0.852 3,具有较好的性能。同时对实验结果分析发现,结果受算法参数的影响甚小,表明本文提出的算法具有较强的鲁棒性。
The computation of word semantic similarity is the basis of natural language processing.Aiming at the problem of density inhomogeneity in path-based methods,a method of merging path distance and information content is proposed, which fuses the path and information content characteristics are fused through a smooth parameter to adjust the semantic distance between concepts and makes the similarity values calculated by path-based method more reasonable. The method has fewer parameters,thus avoids the problem of over-fitting caused by introducing too many parameters in other methods,and has a good universality. The experiments shows that the Pearson correlation coefficient between the similarity values from the presented method and the human judgments in the international standard test dataset has reached 0. 852 3,which means better performance. The analysis of experiment results shows that the results of the presented algorithm arevery little influenced by the parameters of the algorithm,which indicates that it has stronger robustness.
作者
郭承湘
唐忠
石怀明
GUO Cheng-xiang;TANG Zhong;SHI Huai-ming(Guangxi Food and Drug Security Center for Information and Monitoring,Nanning 530029,China;School of Information Management,Guangxi Medical University,Nanning 530021,China)
出处
《广西大学学报(自然科学版)》
CAS
北大核心
2018年第3期1074-1081,共8页
Journal of Guangxi University(Natural Science Edition)
基金
国家重点研发计划项目(2017YFC1602000)
关键词
语义相似度
语义距离
信息内容
不均匀性
鲁棒性
semantic similarity
semantic distance
information content
inhomogeneity
robustness