融合词簇约束的汉越跨语言词嵌入

Chinese-Vietnamese Cross-Lingual Word-Embedding Combined with Word Cluster Constraints

下载PDF

导出

摘要针对传统跨语言词嵌入方法在汉越等差异较大的低资源语言上对齐效果不佳的问题,提出一种融合词簇对齐约束的汉越跨语言词嵌入方法。通过独立的单语语料训练获取汉越单语词嵌入,使用近义词、同类词和同主题词3种不同类型的关联关系,充分挖掘双语词典中的词簇对齐信息以融入到映射矩阵的训练过程中,使映射矩阵进一步学习到不同语言相近词间具有的一些共性特征及映射关系,根据跨语言映射将两种语言的单语词嵌入映射至同一共享空间中对齐,令具有相同含义的汉语与越南语词嵌入在空间中彼此接近,并利用余弦相似度为空间中每一个未经标注的汉语单词查找对应的越南语翻译构建汉越对齐词对,实现跨语言词嵌入。实验结果表明,与传统有监督及无监督的跨语言词嵌入方法Multi_w2v、Orthogonal、VecMap、Muse相比,该方法能有效提升映射矩阵在非标注词上的泛化性,改善汉越低资源场景下模型对齐效果较差的问题,其在汉越双语词典归纳任务P@1和P@5上的对齐准确率相比最好基线模型提升了2.2个百分点。 To solve for the poor alignment effect of the traditional cross-lingual word-embedding method in low-resource languages such as Chinese-Vietnamese,this paper proposes a Chinese-Vietnamese cross-lingual word embedding method with word cluster alignment constraints.First,Chinese and Vietnamese monolingual word embeddings are obtained via training on an independent monolingual corpus.Subsequently,three different types of association relationships including synonyms,similar words,and same subject words are used to completely mine the word cluster alignment information in the bilingual dictionary and integrate it into the training process of the mapping matrix.This allows the mapping matrix to further learn some common features and mapping relationships between similar words in different languages.Second,the monolingual word embeddings of the two languages are mapped onto a shared space through cross-lingual mapping to ensure that the Chinese and Vietnamese word embeddings with the same meaning are close to each other in the space.Finally,the cosine similarity is used to find the corresponding Vietnamese translation for each non-labeled Chinese word in the space,and ChineseVietnamese aligned word pairs are constructed to realize cross-lingual word embedding.The experimental results show that the proposed method is different from traditional supervised and unsupervised cross-lingual word-embedding methods such as Multi_w2v,Orthogonal,VecMap,and Muse,and can effectively improve the generalization of the mapping matrix with non-labeled words and poor effect of model alignment in low-resource languages such as Chinese-Vietnamese.Moreover,its alignment accuracy in the Chinese-Vietnamese bilingual dictionary induction tasks P@1 and P@5 is improved by2.2 percentage points compared with that of the best baseline model.

作者武照渊余正涛黄于欣 WU Zhaoyuan;YU Zhengtao;HUANG Yuxin(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China;Yunnan Key Laboratory of Artificial Intelligence,Kunming 650500,China)

机构地区昆明理工大学信息工程与自动化学院云南省人工智能重点实验室

出处《计算机工程》 CAS CSCD 北大核心 2023年第1期82-91,共10页 Computer Engineering

基金国家自然科学基金(61732005,U21B2027,61972186,61866020,61866019) 云南省重大科技专项(202002AD080001,202103AA080015) 云南省高新技术产业专项(201606)。

关键词汉越双语低资源语言跨语言词嵌入词簇对齐多粒度约束 Chinese-Vietnamese bilingual low-resource language cross-lingual word embedding word cluster alignment multi-granularity constraints

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献1

1陈玺,杨雅婷,董瑞.面向汉维机器翻译的BERT嵌入研究[J].计算机工程,2021,47(12):112-117. 被引量：4

二级参考文献2

1哈里旦木.阿布都克里木,刘洋,孙茂松.神经机器翻译系统在维吾尔语-汉语翻译中的性能对比[J].清华大学学报（自然科学版）,2017,57(8):878-883. 被引量：25
2李俊,吕学强.融合BERT语义加权与网络图的关键词抽取方法[J].计算机工程,2020,46(9):89-94. 被引量：12

共引文献3

1哈里旦木·阿布都克里木,侯钰涛,姚登峰,阿布都克力木·阿布力孜,陈吉尚.维吾尔语机器翻译研究综述[J].计算机工程,2024,50(1):1-16. 被引量：1
2申影利,赵小兵.语言模型蒸馏的低资源神经机器翻译方法[J].计算机工程与科学,2024,46(4):743-751.
3张中文,吐松江·卡日,张紫薇,崔传世,邵罗.基于双分支特征融合的电力设备缺陷文本挖掘方法[J].高压电器,2024,60(6):188-196.

1赵生辉,胡莹.数字图书馆跨语言信息服务等级框架研究[J].情报科学,2020,38(12):63-69. 被引量：3
2刘晨阳,赵天锐.融入双语词向量的韩汉名词短语对齐方法研究[J].智能计算机与应用,2021,11(9):42-47.
3蒋文凭.我国日汉、汉日双语词典的编纂出版史略[J].文化创新比较研究,2022,6(34):28-31.
4张道雪.跨文化交际视角下网络热词的英译研究[J].海外英语,2022(24):32-34. 被引量：1
5摆玉财,马彩花,李健,马耀兴,王一婷,陈兵.集成磁共振联合高分辨率扩散加权成像在海马硬化型颞叶内侧癫痫中的应用[J].中国医学影像学杂志,2022,30(12):1206-1211. 被引量：6
6李大硕,张宏军,廖春林,徐有为,王航,李逸林.多层次特征融合的中医药材推荐方法研究[J].软件导刊,2022,21(12):14-20.
7林锦荣,李子颖,胡志华,兰青,王勇剑,陶意,刘政国.华南火山岩型、花岗岩型热液铀矿共性特征与形成机制[J].铀矿地质,2023,39(1):1-15. 被引量：3
8高小珠.浅谈新课标理念下小学英语教学的中西方文化差异渗透[J].语文课内外,2022(30):143-145.
9曾碧卿,徐马一,杨健豪,裴枫华,甘子邦,丁美荣,程良伦.基于双通道语义差网络的方面级别情感分类[J].中文信息学报,2022,36(12):159-172.
10万泽宇,龚庆悦,李铁军,王红云,鲍剑洋.基于自适应词嵌入RoBERTa-wwm的名中医临床病历命名实体识别研究[J].软件导刊,2022,21(12):58-62. 被引量：1

<12 >

计算机工程

2023年第1期

融合词簇约束的汉越跨语言词嵌入

参考文献1

二级参考文献2

共引文献3

相关作者

相关机构

相关主题

融合词簇约束的汉越跨语言词嵌入

参考文献1

二级参考文献2

共引文献3

相关作者

相关机构

相关主题

微信扫一扫：分享