摘要
新词识别作为自然语言处理的基础任务之一,为构建中文词典、分析词语情感倾向等提供了支持。然而,目前的新词识别方法没有考虑针对谐音新词的识别,导致谐音新词识别的准确率不高。为了解决这一问题,提出一种基于拼音相似度的中文谐音新词发现方法,引入新旧词拼音比较来提高谐音新词识别的准确率。首先,对文本进行预处理,计算平均互信息(AMI)以判定候选词的内部结合度,并使用改进邻接熵确定候选新词的边界;然后,将保留下的词转换成发音相近的汉语拼音与中文词典中的旧词拼音进行相似度比较,并保留最相似的比较结果;最后,若比较结果超过阈值,则将结果中的新词作为谐音新词,对应的旧词即为谐音新词的原有词。在自建的微博数据集上的实验结果表明,与BNshCNs(Blended Numeric and symbolic homophony Chinese Neologisms)、依存句法与语义信息结合的相似性计算模型(DSSCNN)相比,所提方法的准确率、召回率和F1分数分别提高了0.51和5.27个百分点、2.91和6.31个百分点以及1.75和5.81个百分点。可见所提方法具有更好的中文谐音新词识别效果。
As one of the basic tasks of natural language processing,new word identification provides theoretical support for the establishment of Chinese dictionary and analysis of word sentiment tendency.However,the current new word identification methods do not consider the homophonic neologism identification,resulting in low precision of homophonic neologism identification.To solve this problem,a Chinese homophonic neologism discovery method based on Pinyin similarity was proposed,and the precision of homophonic neologism identification was improved by introducing the phonetic comparison of new and old words in this method.Firstly,the text was preprocessed,the Average Mutual Information(AMI)was calculated to determine the degree of internal cohesion of candidate words,and the improved branch entropy was used to determine the boundaries of candidate new words.Then,the retained words were transformed into Chinese Pinyin with similar pronunciations and compared to the Chinese Pinyin of the old words in the Chinese dictionary,and the most similar results of comparisons would be retained.Finally,if a comparison result exceeded the threshold,the new word in the result was taken as the homophonic neologism,and its corresponding word was taken as the original word.Experimental results on self built Weibo datasets show that compared with BNshCNs(Blended Numeric and symbolic homophony Chinese Neologisms)and DSSCNN(similarity computing model based on Dependency Syntax and Semantics),the proposed method has the precision,recall and F1 score improved by 0.51 and 5.27 percentage points,2.91 and 6.31 percentage points,1.75 and 5.81 percentage points respectively,indicating that the proposed method has better Chinese homophonic neologism identification effect.
作者
李瀚臣
张顺香
朱广丽
王腾科
LI Hanchen;ZHANG Shunxiang;ZHU Guangli;WANG Tengke(School of Computer Science and Engineering,Anhui University of Science and Technology,Huainan Anhui 232001,China;Institute of Artificial Intelligence Research,Hefei Comprehensive National Science Center,Hefei Anhui 230088,China)
出处
《计算机应用》
CSCD
北大核心
2023年第9期2715-2720,共6页
journal of Computer Applications
基金
国家自然科学基金资助项目(62076006)
安徽高校协同创新项目(GXXT-2021-008)。
关键词
谐音新词
新词识别
拼音相似度
平均互信息
邻接熵
homophonic neologism
new word identification
Pinyin similarity
Average Mutual Information(AMI)
branch entropy