摘要
针对基于音形码或HowNet的中文词相似度检测算法无法同时考虑汉字的音、形、义特征,导致检测结果不准确等问题,提出改进音形码与HowNet相结合的中文词相似度检测算法。考虑缺少声母与韵母的情况,采用格雷码编码的音码进行改进。将形码的四角号码编码改为可较为细致描述汉字的笔顺编码。在加权编辑距离的基础上,改进字符串匹配方式。最后将改进的音形码与HowNet相结合。实验表明,无论从音形还是词义检测中文词相似度,算法有更高的准确度。
Because the similarity detection algorithm based on sound-character code or HowNet cannot fully consider the features of Chinese characters,which results in inaccurate detection results and is difficult to meet the application of complex scenarios.A Chinese words similarity detection algorithm based on improved sound-character code and HowNet was proposed in this paper.Firstly,considering the lack of initials and vowels,the sound codes described by the Gray code were improved.Secondly,the shape code that refers to the four-corner number code was changed to a stroke order code which can describe Chinese characters in more detail.Thirdly,the matching method was improved on the basis of the weighted edit distance for Chinese character words similarity detection.Finally,the improved sound-character code was combined with HowNet.The experimental results show that the algorithm can improve the accuracy of Chinese character words similarity detection from the three major features of sound,shape and meaning.
作者
王华敏
黄梦醒
冯文龙
冯思玲
WANG Hua-min;HUANG Meng-xing;FENG Wen-long;FENG Si-ling(Hainan University,Haikou Hainan 570228,China)
出处
《计算机仿真》
北大核心
2022年第8期460-465,472,共7页
Computer Simulation
基金
国家重点研发计划项目(2018YFB1404400)。
关键词
中文词相似度
汉字相似度
知网
音形码
编辑距离
Chinese words similarity
Chinese character similarity
HowNet
Improved sound-character code(IScc)
Edit distance