基于拼音相似度的中文谐音新词发现方法被引量：2

Chinese homophonic neologism discovery method based on Pinyin similarity

下载PDF

导出

摘要新词识别作为自然语言处理的基础任务之一,为构建中文词典、分析词语情感倾向等提供了支持。然而,目前的新词识别方法没有考虑针对谐音新词的识别,导致谐音新词识别的准确率不高。为了解决这一问题,提出一种基于拼音相似度的中文谐音新词发现方法,引入新旧词拼音比较来提高谐音新词识别的准确率。首先,对文本进行预处理,计算平均互信息(AMI)以判定候选词的内部结合度,并使用改进邻接熵确定候选新词的边界;然后,将保留下的词转换成发音相近的汉语拼音与中文词典中的旧词拼音进行相似度比较,并保留最相似的比较结果;最后,若比较结果超过阈值,则将结果中的新词作为谐音新词,对应的旧词即为谐音新词的原有词。在自建的微博数据集上的实验结果表明,与BNshCNs(Blended Numeric and symbolic homophony Chinese Neologisms)、依存句法与语义信息结合的相似性计算模型(DSSCNN)相比,所提方法的准确率、召回率和F1分数分别提高了0.51和5.27个百分点、2.91和6.31个百分点以及1.75和5.81个百分点。可见所提方法具有更好的中文谐音新词识别效果。 As one of the basic tasks of natural language processing,new word identification provides theoretical support for the establishment of Chinese dictionary and analysis of word sentiment tendency.However,the current new word identification methods do not consider the homophonic neologism identification,resulting in low precision of homophonic neologism identification.To solve this problem,a Chinese homophonic neologism discovery method based on Pinyin similarity was proposed,and the precision of homophonic neologism identification was improved by introducing the phonetic comparison of new and old words in this method.Firstly,the text was preprocessed,the Average Mutual Information(AMI)was calculated to determine the degree of internal cohesion of candidate words,and the improved branch entropy was used to determine the boundaries of candidate new words.Then,the retained words were transformed into Chinese Pinyin with similar pronunciations and compared to the Chinese Pinyin of the old words in the Chinese dictionary,and the most similar results of comparisons would be retained.Finally,if a comparison result exceeded the threshold,the new word in the result was taken as the homophonic neologism,and its corresponding word was taken as the original word.Experimental results on self built Weibo datasets show that compared with BNshCNs(Blended Numeric and symbolic homophony Chinese Neologisms)and DSSCNN(similarity computing model based on Dependency Syntax and Semantics),the proposed method has the precision,recall and F1 score improved by 0.51 and 5.27 percentage points,2.91 and 6.31 percentage points,1.75 and 5.81 percentage points respectively,indicating that the proposed method has better Chinese homophonic neologism identification effect.

作者李瀚臣张顺香朱广丽王腾科 LI Hanchen;ZHANG Shunxiang;ZHU Guangli;WANG Tengke(School of Computer Science and Engineering,Anhui University of Science and Technology,Huainan Anhui 232001,China;Institute of Artificial Intelligence Research,Hefei Comprehensive National Science Center,Hefei Anhui 230088,China)

机构地区安徽理工大学计算机科学与工程学院合肥综合性国家科学中心人工智能研究院

出处《计算机应用》 CSCD 北大核心 2023年第9期2715-2720,共6页 journal of Computer Applications

基金国家自然科学基金资助项目(62076006) 安徽高校协同创新项目(GXXT-2021-008)。

关键词谐音新词新词识别拼音相似度平均互信息邻接熵 homophonic neologism new word identification Pinyin similarity Average Mutual Information(AMI) branch entropy

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献5

1郑家恒,李文花.基于构词法的网络新词自动识别初探[J].山西大学学报（自然科学版）,2002,25(2):115-119. 被引量：56
2崔世起,刘群,孟遥,于浩,西野文人.基于大规模语料库的新词检测[J].计算机研究与发展,2006,43(5):927-932. 被引量：32
3王煜,徐建民.用于网络新闻热点识别的热点新词发现[J].计算机应用,2020,40(12):3513-3519. 被引量：5
4赵志滨,石玉鑫,李斌阳.基于句法分析与词向量的领域新词发现方法[J].计算机科学,2019,46(6):29-34. 被引量：14
5张爽,陈莉,李铮.融合相似性判断的网络新词发现算法[J].西北大学学报（自然科学版）,2022,52(2):239-247. 被引量：4

二级参考文献30

1邹纲,刘洋,刘群,孟遥,于浩,西野文人,亢世勇.面向Internet的中文新词语检测[J].中文信息学报,2004,18(6):1-9. 被引量：59
2郑家恒李文花.新词语自动识别方法研究.自然语言理解与机器翻译[M].北京:清华大学出版社,2001..
3陆志苇.现代汉语构词法（修订本）[M].北京:中华书局,1975..
4K.J.Chen,Ming-Hong Bai.Unknown word detection for Chinese by a corpus-based learning method.International Journal of Computational Linguistics and Chinese Language Processing,1998,3 (1):27～44
5K.J.Chen,W.Y.Ma.Unknown word extraction for Chinese documents.The 19th COLING 2002,Taipei,2002
6Jianfeng Gao,Mu Li,Andi Wu,et al.Chinese word segmentation:A pragmatic approach.Microsoft Research,Technical Report:MSR-TR-2004-123,2004
7Nie Jian-Yun,Wanying Jin,Mareie-Louise Hannan.A hybrid approach to unknown word detection and segmentation of Chinese.Int' 1 Conf.Chinese Computing,Singapore,1994
8Hua-Ping Zhang,Qun Liu,Hao Zhang,et al.Automatic recognition of Chinese unknown words based on roles tagging.The 1st SIGHAN Workshop on Chinese Language Processing,Taipei,2002
9Andi Wu,Zixin Jiang.Statistically-enhanced new word identification in a rule-based Chinese system.The 2nd Chinese Language Processing Workshop,Hong Kong,2000
10Fuchun Peng,Fangfang Feng,Andrew McCallum.Chinese segmentation and new word detection using conditional random fields.COLING 2004,Geneva,Switzerland,2004

共引文献90

1黄东平,田芳.BBS信息过滤技术研究[J].长江大学学报（自然科学版）,2004,1(1):16-18. 被引量：4
2崔世起,刘群,孟遥,于浩,西野文人.基于大规模语料库的新词检测[J].计算机研究与发展,2006,43(5):927-932. 被引量：32
3任禾,曾隽芳.一种基于信息熵的中文高频词抽取算法[J].中文信息学报,2006,20(5):40-43. 被引量：22
4李新福,赵杰,梁巍.基于互信息的宋史语料库词表的提取[J].河北大学学报（自然科学版）,2006,26(5):557-560. 被引量：4
5吕学强,黄河,李渝勤,施水才.BBS中文新词语自动挖掘[J].现代图书情报技术,2007(1):37-39. 被引量：2
6夏霙,刘功申,李翔.基于标引信息的网络新概念发现算法[J].微型电脑应用,2007,23(1):8-10.
7罗智勇,宋柔.基于多特征的自适应新词识别[J].北京工业大学学报,2007,33(7):718-725. 被引量：14
8贺敏,龚才春,张华平,程学旗.一种基于大规模语料的新词识别方法[J].计算机工程与应用,2007,43(21):157-159. 被引量：24
9龚才春,贺敏,陈海强,许洪波,程学旗.大规模语料的频繁模式快速发现算法[J].通信学报,2007,28(12):161-166. 被引量：4
10李钝,曹元大,万月亮.Internet中的新词识别[J].北京邮电大学学报,2008,31(1):26-29. 被引量：7

同被引文献10

1王铭涛,方晔玮,陈文亮.基于中文字形的ELMo在电商事件识别上的应用[J].中文信息学报,2021,35(12):94-102. 被引量：4
2田丰,桂小林,杨攀,王刚,郭岳龙.采用类别相似度聚合的关联文本分类方法[J].西安交通大学学报,2012,46(12):6-11. 被引量：8
3郑晓雪,刘理,胡蝶,孙良,吴婷,孟秀红.微博暴力对合肥市大学生心理健康的影响[J].医学与社会,2018,31(9):63-65. 被引量：8
4周昊,沈庆宏.基于改进音形码的中文敏感词检测算法[J].南京大学学报（自然科学版）,2020,56(2):270-277. 被引量：11
5刘宇瀚,刘常健,徐睿峰,骆旺达,陈奕,吉忠晟,应能涛.结合字形特征与迭代学习的金融领域命名实体识别[J].中文信息学报,2020,34(11):74-83. 被引量：15
6王艳,王胡燕,余本功.基于多特征融合的中文文本分类研究[J].数据分析与知识发现,2021,5(10):1-14. 被引量：5
7张爽,陈莉,李铮.融合相似性判断的网络新词发现算法[J].西北大学学报（自然科学版）,2022,52(2):239-247. 被引量：4
8师夏阳,张风远,袁嘉琪,黄敏.基于多语BERT的无监督攻击性言论检测[J].计算机应用,2022,42(11):3379-3385. 被引量：4
9米健霞,谢红薇.面向招标物料的命名实体识别研究及应用[J].计算机工程与应用,2023,59(2):314-320. 被引量：1
10杨兴锐,赵寿为,张如学,陶叶辉,杨兴俊.改进BERT词向量的BiLSTM-Attention文本分类模型[J].传感器与微系统,2023,42(10):160-164. 被引量：6

引证文献2

1马子晨,张顺香,刘云朵,朱广丽.融合多特征和表情情感词典的性别对立言论识别方法[J].数据采集与处理,2024,39(3):699-709.
2王琰慧,王小龙,张顺香,周渝皓,汪才钦.基于谐音干扰词替换的中文仇恨言论检测方法[J].应用科技,2024,51(3):72-81.

1王鑫.概念整合理论视角下汉语谐音新词的认知探析[J].汉字文化,2019(10):48-49.
2瞿枫.“主调”(Homophony)的误译与误释[J].音乐研究,2023(3):5-18.
3黄根春,周歌珊.“原文新约”应用程序的元素与愿景(附使用指引)[J].圣经文学研究,2023(2):296-317.
4田业,刘轩,姚雪松,冯培磊,赵玉,李琰.基于SAX及空间信息熵的谐振接地系统单相接地故障选线方法[J].南方电网技术,2023,17(7):103-114. 被引量：1
5王宏民,叶浩槐,宋莹莹,王建生,邓辅秦,潘增喜.基于SAX算法的CMT增材制造缺陷在线监测[J].焊接,2023(8):22-28.
6刘凡平,陈慧,沈振雷,吴业俭.基于BERT的开放领域中文新词发现研究[J].计算机应用与软件,2023,40(6):173-180.
7Ayman Altameem,Ramesh Chandra Poonia,Ankit Kumar,Linesh Raja,Abdul Khader Jilani Saudagar.P-ROCK: A Sustainable Clustering Algorithm for Large Categorical Datasets[J].Intelligent Automation & Soft Computing,2023(1):553-566.
8Zinan Zhou,Yirun Chen,Wensheng Dai.Modeling the Proposal of the Simultaneous Purchases and Sales of Electricity and Gas for the Energy Market in a Microgrid Using the Harmony Search Algorithm[J].Energy Engineering,2022,119(6):2681-2709.
9Ling Liang.An Analysis of the Research Trends in Blended Learning Methods in Teaching College English Reading Courses[J].Journal of Contemporary Educational Research,2023,7(8):41-52.
10Khader M.Hamdia,Mohammed Arafa,Mamoun Alqedra.Structural damage assessment criteria for reinforced concrete buildings by using a Fuzzy Analytic Hierarchy process[J].Underground Space,2018,3(3):243-249. 被引量：7

计算机应用

2023年第9期

浏览历史

内容加载中请稍等...

基于拼音相似度的中文谐音新词发现方法被引量：2

参考文献5

二级参考文献30

共引文献90

同被引文献10

引证文献2

相关作者

相关机构

相关主题

浏览历史

基于拼音相似度的中文谐音新词发现方法 被引量：2

参考文献5

二级参考文献30

共引文献90

同被引文献10

引证文献2

相关作者

相关机构

相关主题

浏览历史

基于拼音相似度的中文谐音新词发现方法被引量：2