期刊文献+

融合主题及上下文特征的汉缅双语词汇抽取方法 被引量:1

Method of Chinese Burmese Bilingual Vocabulary Extraction Based on Subject and Context Features
下载PDF
导出
摘要 缅甸语属于低资源语言,网络中获取大规模的汉-缅双语词汇一定程度上可以缓解汉-缅机器翻译中面临句子级对齐语料匮乏的问题.为此,本文提出了一种融合主题及上下文特征的汉缅双语词汇抽取方法.首先利用LDA主题模型获取汉缅文档主题分布,并通过双语词向量表征将跨语言主题向量映射到共享的语义空间后抽取同一主题下相似度较高的词作为汉-缅双语候选词汇,然后基于BERT获取候选双语词汇相关上下文的词汇语义表征构建上下文向量,最后通过计算候选词的上下文向量的相似度对候选双语词汇进行加权得到质量更高的汉缅互译词汇.实验结果表明,相对于基于双语词典的方法和基于双语LDA+CBW的方法,本文提出的方法准确率上分别提升了11.07%和3.82%. Burmese is a low-resource language.Obtaining large-scale Chinese-Burmese bilingual vocabularies on the Internet,which can be mitigated due to the lacking of sentence-level alignment corpora in Chinese-Burmese machine translation.Consequently,this article proposes a method of Chinese-Burmese bilingual vocabulary extraction based on subject and context features.Firstly,the topic distribution of the Chinese-Burmese document is obtained by the LDA topic model,what's more the cross-language topic vector is mapped to the shared semantic space through bilingual word vector representation.The words with higher similarity under the same topic are extracted as Chinese-Burmese bilingual candidate vocabulary,we obtain the linguistic semantic representation of the context of the candidate bilingual vocabulary to construct a context vector by BERT.Finally we weight the candidate bilingual vocabulary by calculating the similarity of the context vector of the candidate word to obtain the higher quality Chinese-Myanmar translation Vocabulary.Experimental results show that compared with the method based on bilingual dictionary and the method based on bilingual LDA+CBW,the accuracy of the proposed method is improved by 11.07%and 3.82%respectively.
作者 李越 毛存礼 余正涛 高盛祥 王振晗 张亚飞 LI Yue;MAO Cun-li;YU Zheng-tao;GAO Sheng-xiang;WANG Zhen-han;ZHANG Ya-fei(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China;Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650500,China)
出处 《小型微型计算机系统》 CSCD 北大核心 2021年第1期91-95,共5页 Journal of Chinese Computer Systems
基金 国家自然科学基金重点项目(61732005)资助 国家自然科学基金项目(61662041,61761026,6186019,61972186)资助 云南省中青年学术和技术带头人后备人才项目(2019HB006)资助 云南省自然科学基金重点项目(2019FA023)资助。
关键词 汉缅双语词汇 主题特征 上下文特征 BERT 双语词向量 Chinese-Myanmar vocabulary thematic features contextual features BERT bilingual word vector
  • 相关文献

同被引文献10

引证文献1

二级引证文献38

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部