期刊文献+

基于拼音相似度的中文谐音新词发现方法 被引量:2

Chinese homophonic neologism discovery method based on Pinyin similarity
下载PDF
导出
摘要 新词识别作为自然语言处理的基础任务之一,为构建中文词典、分析词语情感倾向等提供了支持。然而,目前的新词识别方法没有考虑针对谐音新词的识别,导致谐音新词识别的准确率不高。为了解决这一问题,提出一种基于拼音相似度的中文谐音新词发现方法,引入新旧词拼音比较来提高谐音新词识别的准确率。首先,对文本进行预处理,计算平均互信息(AMI)以判定候选词的内部结合度,并使用改进邻接熵确定候选新词的边界;然后,将保留下的词转换成发音相近的汉语拼音与中文词典中的旧词拼音进行相似度比较,并保留最相似的比较结果;最后,若比较结果超过阈值,则将结果中的新词作为谐音新词,对应的旧词即为谐音新词的原有词。在自建的微博数据集上的实验结果表明,与BNshCNs(Blended Numeric and symbolic homophony Chinese Neologisms)、依存句法与语义信息结合的相似性计算模型(DSSCNN)相比,所提方法的准确率、召回率和F1分数分别提高了0.51和5.27个百分点、2.91和6.31个百分点以及1.75和5.81个百分点。可见所提方法具有更好的中文谐音新词识别效果。 As one of the basic tasks of natural language processing,new word identification provides theoretical support for the establishment of Chinese dictionary and analysis of word sentiment tendency.However,the current new word identification methods do not consider the homophonic neologism identification,resulting in low precision of homophonic neologism identification.To solve this problem,a Chinese homophonic neologism discovery method based on Pinyin similarity was proposed,and the precision of homophonic neologism identification was improved by introducing the phonetic comparison of new and old words in this method.Firstly,the text was preprocessed,the Average Mutual Information(AMI)was calculated to determine the degree of internal cohesion of candidate words,and the improved branch entropy was used to determine the boundaries of candidate new words.Then,the retained words were transformed into Chinese Pinyin with similar pronunciations and compared to the Chinese Pinyin of the old words in the Chinese dictionary,and the most similar results of comparisons would be retained.Finally,if a comparison result exceeded the threshold,the new word in the result was taken as the homophonic neologism,and its corresponding word was taken as the original word.Experimental results on self built Weibo datasets show that compared with BNshCNs(Blended Numeric and symbolic homophony Chinese Neologisms)and DSSCNN(similarity computing model based on Dependency Syntax and Semantics),the proposed method has the precision,recall and F1 score improved by 0.51 and 5.27 percentage points,2.91 and 6.31 percentage points,1.75 and 5.81 percentage points respectively,indicating that the proposed method has better Chinese homophonic neologism identification effect.
作者 李瀚臣 张顺香 朱广丽 王腾科 LI Hanchen;ZHANG Shunxiang;ZHU Guangli;WANG Tengke(School of Computer Science and Engineering,Anhui University of Science and Technology,Huainan Anhui 232001,China;Institute of Artificial Intelligence Research,Hefei Comprehensive National Science Center,Hefei Anhui 230088,China)
出处 《计算机应用》 CSCD 北大核心 2023年第9期2715-2720,共6页 journal of Computer Applications
基金 国家自然科学基金资助项目(62076006) 安徽高校协同创新项目(GXXT-2021-008)。
关键词 谐音新词 新词识别 拼音相似度 平均互信息 邻接熵 homophonic neologism new word identification Pinyin similarity Average Mutual Information(AMI) branch entropy
  • 相关文献

参考文献5

二级参考文献30

  • 1邹纲,刘洋,刘群,孟遥,于浩,西野文人,亢世勇.面向Internet的中文新词语检测[J].中文信息学报,2004,18(6):1-9. 被引量:59
  • 2郑家恒 李文花.新词语自动识别方法研究.自然语言理解与机器翻译[M].北京:清华大学出版社,2001..
  • 3陆志苇.现代汉语构词法(修订本)[M].北京:中华书局,1975..
  • 4K.J.Chen,Ming-Hong Bai.Unknown word detection for Chinese by a corpus-based learning method.International Journal of Computational Linguistics and Chinese Language Processing,1998,3 (1):27~44
  • 5K.J.Chen,W.Y.Ma.Unknown word extraction for Chinese documents.The 19th COLING 2002,Taipei,2002
  • 6Jianfeng Gao,Mu Li,Andi Wu,et al.Chinese word segmentation:A pragmatic approach.Microsoft Research,Technical Report:MSR-TR-2004-123,2004
  • 7Nie Jian-Yun,Wanying Jin,Mareie-Louise Hannan.A hybrid approach to unknown word detection and segmentation of Chinese.Int' 1 Conf.Chinese Computing,Singapore,1994
  • 8Hua-Ping Zhang,Qun Liu,Hao Zhang,et al.Automatic recognition of Chinese unknown words based on roles tagging.The 1st SIGHAN Workshop on Chinese Language Processing,Taipei,2002
  • 9Andi Wu,Zixin Jiang.Statistically-enhanced new word identification in a rule-based Chinese system.The 2nd Chinese Language Processing Workshop,Hong Kong,2000
  • 10Fuchun Peng,Fangfang Feng,Andrew McCallum.Chinese segmentation and new word detection using conditional random fields.COLING 2004,Geneva,Switzerland,2004

共引文献90

同被引文献10

引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部