期刊文献+

多重CCA算法的柬汉双语词向量构建方法 被引量:2

Construction of Khmer-Chinese BilingualWord Embedding Based on Multiple CCAAlgorithms
下载PDF
导出
摘要 针对现有双语词向量研究方法获取双语词向量需要用到大量双语平行文本,对于柬汉双语而言存在着平行文本不足的关键问题,而英语作为通用语言,英语-汉语以及英语-柬埔寨语双语平行文本较多且容易获得,因此在典型相关分析跨语言词向量模型上作出进一步改进,提出以英语为中间语言的基于多重CCA算法的汉柬双语词向量构建方法。通过将英语、汉语词向量投影至汉-英向量空间,将英语、柬语词向量投影至柬-英向量空间,根据CCA算法分别得到英-汉、英-柬双语词向量;以英语作为中间词并结合部分实验室构建的柬汉双语电子词典将上一步得到的英-柬、英-汉双语词向量投影至第三方同一向量空间中,再次根据CCA算法得到柬语和汉语在新向量空间中的投影转换矩阵;得到柬英汉多语词向量,多语词向量中包含有柬汉双语词向量。与传统方法相比,该方法解决了当前其他模型所面临的初始柬汉平行文本稀缺的问题,且获得较高的柬汉双语词向量。 A large number of parallel bilingual texts are needed to acquire the bilingual word embedding in the existing research methods of bilingual word embedding,and there are some key problems in Khmer-Chinese bilingualism.As English is a general language,English-Chinese and English-Khmer bilingual parallel texts are more and easier to obtain.Therefore,the cross-language word embedding of canonical correlation analysis is further improved,and a method of constructing Khmer-Chinese bilingual word embedding based on multiple CCA algorithm with English as the intermediate language is proposed.The English and Chinese word embedding is projected into the Chinese-English embedding space,and the English and Khmer word embedding is projected into the Khmer-English embedding space.According to CCA algorithm,the English-Chinese and English-Khmer bilingual word embedding is obtained respectively.Then,the English-Khmer and English-Chinese bilingual word embedding obtained from the previous step are projected into the same embedding space of the third party,and the projection transformation matrix of Khmer and Chinese in the new embedding space is obtained according to CCA algorithm.Finally,the Khmer-English-Chinese multilingual word embedding is obtained.The multilingual word embedding contains the Khmer-Chinese bilingual word embedding.Compared with traditional methods,this method solves the problem of scarcity of initial Khmer-Chinese parallel texts faced by other models,and obtains higher Khmer-Chinese bilingual word embedding.
作者 蒋亚芳 严馨 李思远 徐广义 周枫 JIANG Yafang;YAN Xin;LI Siyuan;XU Guangyi;ZHOU Feng(School of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650504,China;Yunnan Nantian Electronic Information Industry Co.,Ltd.,Kunming 650051,China)
出处 《计算机工程与应用》 CSCD 北大核心 2020年第17期167-172,共6页 Computer Engineering and Applications
基金 国家自然科学基金(No.61462055,No.61562049)。
关键词 双语词向量 典型相关分析(CCA) 汉柬双语 多重典型相关分析算法 bilingual word embedding Canonical Correlation Analysis(CCA) Khmer-Chinese bilingual multiple Canonical Correlation Analysis(CCA)algorithm
  • 相关文献

参考文献1

二级参考文献11

  • 1Kaji H, Tamamura S, Erdenebat D, Automatic construc- tion of a Japanese-Chinese di~ti014ary via English [ C ] // The 6th International Conference on Language Resources and Evaluation, Marrakech, Morocco, 2008: 699-706.
  • 2Tanaka K, Umemura K. Construction of a bilingual dic- tionary intermediated by a third language [ C ]//Proceed- ings of the 15th Conference on Computational Linguistics, Kyoto, Japan, 1994: 297-303.
  • 3Laruche A, Langlais P. Revisiting context-based projection methods for term-translation spotting in comparable corpora [ C]//Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, 2010: 617-625.
  • 4Bond F, Yamazaki T, Sulong R B, et al. Design and construction of a machine-tractable Japanese-Malay dic-tionary [ C ] // Proceedings of Machine Translation Summit VIII, Santiago de Compostela, Spain, 2001 : 53-58.
  • 5Shezaf D, Rappoport A. Bilingual lexicon generation u- sing non-aligned signature [ C ~ // Proceedings of Associa- tion for Computational Linguistics, Uppsala, Sweden, 2010 : 98-107.
  • 6Lee L H, Aw A, Zhang M, et al. EM-based hybrid mod- el for bilingual terminology extraction from comparable corpora[ C]//23rd International Conference on Computa- tional Linguistics, Beijing, 2010: 639-646.
  • 7Fung P. A statistical view on bilingual lexicon extraction: from parallel corpora to nonparallel corpora [ C ]// Pro- ceedings of the 4th Conference of the Association for Ma- chine Translation in the Americas, Cuernavaca, Mexico, 2000: 1-17.
  • 8Haghighi A~ Liang P, Berg-Kirkpatrick T. Learning bi- lingual lexicons from monolingual corpora [ C ]//The As- sociation for Computational Linguistics on Computational Linguistics, Ohio, USA, 2008 : 771-779.
  • 9Chu C, Nakazawa T, Kurohashi S. Iterative bilingual lex- icon extraction from comparable corpora with topical and contextual knowledge [ C ]// The 15th International Con- ference on Intelligent Text Processing and Computational Linguistics, Kathmandu, Nepal, 2014: 296-309.
  • 10Rubino R, Linar~s G. A multi-view approach for term translation spotting[ C]//Proceedings of the 12th Interna- tional Conference on Computational Linguistics and Intel- ligent Text Processing, Tokyo, Japan, 2011: 29-40.

共引文献1

同被引文献13

引证文献2

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部