多重CCA算法的柬汉双语词向量构建方法被引量：2

Construction of Khmer-Chinese BilingualWord Embedding Based on Multiple CCAAlgorithms

下载PDF

导出

摘要针对现有双语词向量研究方法获取双语词向量需要用到大量双语平行文本,对于柬汉双语而言存在着平行文本不足的关键问题,而英语作为通用语言,英语-汉语以及英语-柬埔寨语双语平行文本较多且容易获得,因此在典型相关分析跨语言词向量模型上作出进一步改进,提出以英语为中间语言的基于多重CCA算法的汉柬双语词向量构建方法。通过将英语、汉语词向量投影至汉-英向量空间,将英语、柬语词向量投影至柬-英向量空间,根据CCA算法分别得到英-汉、英-柬双语词向量;以英语作为中间词并结合部分实验室构建的柬汉双语电子词典将上一步得到的英-柬、英-汉双语词向量投影至第三方同一向量空间中,再次根据CCA算法得到柬语和汉语在新向量空间中的投影转换矩阵;得到柬英汉多语词向量,多语词向量中包含有柬汉双语词向量。与传统方法相比,该方法解决了当前其他模型所面临的初始柬汉平行文本稀缺的问题,且获得较高的柬汉双语词向量。 A large number of parallel bilingual texts are needed to acquire the bilingual word embedding in the existing research methods of bilingual word embedding,and there are some key problems in Khmer-Chinese bilingualism.As English is a general language,English-Chinese and English-Khmer bilingual parallel texts are more and easier to obtain.Therefore,the cross-language word embedding of canonical correlation analysis is further improved,and a method of constructing Khmer-Chinese bilingual word embedding based on multiple CCA algorithm with English as the intermediate language is proposed.The English and Chinese word embedding is projected into the Chinese-English embedding space,and the English and Khmer word embedding is projected into the Khmer-English embedding space.According to CCA algorithm,the English-Chinese and English-Khmer bilingual word embedding is obtained respectively.Then,the English-Khmer and English-Chinese bilingual word embedding obtained from the previous step are projected into the same embedding space of the third party,and the projection transformation matrix of Khmer and Chinese in the new embedding space is obtained according to CCA algorithm.Finally,the Khmer-English-Chinese multilingual word embedding is obtained.The multilingual word embedding contains the Khmer-Chinese bilingual word embedding.Compared with traditional methods,this method solves the problem of scarcity of initial Khmer-Chinese parallel texts faced by other models,and obtains higher Khmer-Chinese bilingual word embedding.

作者蒋亚芳严馨李思远徐广义周枫 JIANG Yafang;YAN Xin;LI Siyuan;XU Guangyi;ZHOU Feng(School of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650504,China;Yunnan Nantian Electronic Information Industry Co.,Ltd.,Kunming 650051,China)

机构地区昆明理工大学信息工程与自动化学院云南南天电子信息产业股份有限公司

出处《计算机工程与应用》 CSCD 北大核心 2020年第17期167-172,共6页 Computer Engineering and Applications

基金国家自然科学基金(No.61462055,No.61562049)。

关键词双语词向量典型相关分析(CCA) 汉柬双语多重典型相关分析算法 bilingual word embedding Canonical Correlation Analysis(CCA) Khmer-Chinese bilingual multiple Canonical Correlation Analysis(CCA)algorithm

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献1

1师光达,李芳.基于中间语言和可比语料库的双语词表构建[J].北京化工大学学报（自然科学版）,2016,43(2):98-102. 被引量：2

二级参考文献11

1Kaji H, Tamamura S, Erdenebat D, Automatic construc- tion of a Japanese-Chinese di~ti014ary via English [ C ] // The 6th International Conference on Language Resources and Evaluation, Marrakech, Morocco, 2008: 699-706.
2Tanaka K, Umemura K. Construction of a bilingual dic- tionary intermediated by a third language [ C ]//Proceed- ings of the 15th Conference on Computational Linguistics, Kyoto, Japan, 1994: 297-303.
3Laruche A, Langlais P. Revisiting context-based projection methods for term-translation spotting in comparable corpora [ C]//Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, 2010: 617-625.
4Bond F, Yamazaki T, Sulong R B, et al. Design and construction of a machine-tractable Japanese-Malay dic-tionary [ C ] // Proceedings of Machine Translation Summit VIII, Santiago de Compostela, Spain, 2001 : 53-58.
5Shezaf D, Rappoport A. Bilingual lexicon generation u- sing non-aligned signature [ C ~ // Proceedings of Associa- tion for Computational Linguistics, Uppsala, Sweden, 2010 : 98-107.
6Lee L H, Aw A, Zhang M, et al. EM-based hybrid mod- el for bilingual terminology extraction from comparable corpora[ C]//23rd International Conference on Computa- tional Linguistics, Beijing, 2010: 639-646.
7Fung P. A statistical view on bilingual lexicon extraction: from parallel corpora to nonparallel corpora [ C ]// Pro- ceedings of the 4th Conference of the Association for Ma- chine Translation in the Americas, Cuernavaca, Mexico, 2000: 1-17.
8Haghighi A~ Liang P, Berg-Kirkpatrick T. Learning bi- lingual lexicons from monolingual corpora [ C ]//The As- sociation for Computational Linguistics on Computational Linguistics, Ohio, USA, 2008 : 771-779.
9Chu C, Nakazawa T, Kurohashi S. Iterative bilingual lex- icon extraction from comparable corpora with topical and contextual knowledge [ C ]// The 15th International Con- ference on Intelligent Text Processing and Computational Linguistics, Kathmandu, Nepal, 2014: 296-309.
10Rubino R, Linar~s G. A multi-view approach for term translation spotting[ C]//Proceedings of the 12th Interna- tional Conference on Computational Linguistics and Intel- ligent Text Processing, Tokyo, Japan, 2011: 29-40.

共引文献1

1原伟,代勋勋,徐琳宏.基于俄汉新闻网评可比语料库的情感分析研究[J].解放军外国语学院学报,2019,42(2):99-106. 被引量：7

同被引文献13

1丁沃沃,胡友培,窦平平.城市形态与城市微气候的关联性研究[J].建筑学报,2012(7):16-21. 被引量：121
2梁国龙,陶凯,范展.声矢量阵自适应波束域广义似然比检测算法[J].电子学报,2015,43(1):135-139. 被引量：3
3宋柔,葛诗利.面向篇章机器翻译的英汉翻译单位和翻译模型研究[J].中文信息学报,2015,29(5):125-135. 被引量：15
4周亚婷.英语篇章机器翻译单位及模型设计及应用[J].电子测试,2018,29(10):118-119. 被引量：1
5高成吉.一种英语口语识别算法[J].信息技术,2018,42(8):148-151. 被引量：3
6卢蓉.基于语义网络的英语机器翻译模型设计与改进[J].现代电子技术,2018,41(14):126-129. 被引量：5
7黄登娴.英语翻译软件翻译准确性矫正算法设计[J].现代电子技术,2018,41(14):170-172. 被引量：12
8郭蕾.基于自然语言处理的英语翻译计算机智能评分系统设计[J].现代电子技术,2019,42(4):158-160. 被引量：15
9陈钟梅.基于CAT技术的大学英语翻译教学实践[J].英语教师,2019,19(9):20-25. 被引量：1
10白瑞芳.基于RNN编码器的交互式机器翻译平台控制技术[J].计算机测量与控制,2019,27(7):89-92. 被引量：7

引证文献2

1党莎莎,龚小涛.基于改进GLR算法的智能识别英语翻译模型设计[J].计算机测量与控制,2020,28(4):161-164. 被引量：9
2李俊荣,刘代云.寒地滨水空间形态与生态关系关联及优化研究[J].城市建筑,2023,20(6):179-183.

二级引证文献9

1吴迪.基于机器翻译的语法错误检测语音识别模型[J].信息技术,2022,46(5):82-87. 被引量：9
2王雪,王娟,胡仁青.基于数据挖掘的机器英语翻译模型研究[J].电子设计工程,2022,30(15):167-171. 被引量：2
3何媛媛.云计算模式下的机器辅助翻译系统设计分析[J].电子设计工程,2022,30(16):80-83. 被引量：1
4郭珍.基于深度学习的乘务英语自动播报检测系统设计[J].自动化与仪器仪表,2023(2):200-203.
5任丽娜.英语作文切题度分析算法[J].自动化技术与应用,2024,43(3):99-103.
6郭珍.基于深度学习的乘务英语自动播报检测系统设计[J].西安轨道交通职业教育研究,2024(1):51-55.
7李静莹.融合NMT模型与PBSMT模型的语料库机器翻译模型应用研究[J].自动化与仪器仪表,2024(6):174-178. 被引量：1
8左佳.基于压缩算法与自注意力机制模型的语料库机器翻译系统设计研究[J].自动化与仪器仪表,2024(6):194-198.
9华琴,赵刚.基于自适应寻优控制和多目标学习参数模型的AI人工智能翻译研究[J].自动化与仪器仪表,2024(9):33-38.

1唐慎龙,崔婷.初级阶段柬埔寨大学生汉语语音教学研究——以柬埔寨皇家科学院孔子学院为例[J].科教导刊（电子版）,2020(8):172-173.
2魏东升,王华亮.例析向量投影法在解数量积问题中的应用[J].中学数学研究,2020(7):53-55. 被引量：1
3王于叶,张皓天,许泽遥.基于深度学习的漏洞检测中样本集预处理的方法研究[J].无线互联科技,2020,17(9):123-125.
4汪诚愚,何晓丰,宫学庆,周傲英.面向上下位关系预测的词嵌入投影模型[J].计算机学报,2020,43(5):868-883. 被引量：9
5汲传波,李宇明.《疫情防控“简明汉语”》的研制及其若干思考[J].世界汉语教学,2020,34(3):311-322. 被引量：30

计算机工程与应用

2020年第17期

浏览历史

内容加载中请稍等...

多重CCA算法的柬汉双语词向量构建方法被引量：2

参考文献1

二级参考文献11

共引文献1

同被引文献13

引证文献2

二级引证文献9

相关作者

相关机构

相关主题

浏览历史

多重CCA算法的柬汉双语词向量构建方法 被引量：2

参考文献1

二级参考文献11

共引文献1

同被引文献13

引证文献2

二级引证文献9

相关作者

相关机构

相关主题

浏览历史

多重CCA算法的柬汉双语词向量构建方法被引量：2