期刊文献+

融合词簇约束的汉越跨语言词嵌入

Chinese-Vietnamese Cross-Lingual Word-Embedding Combined with Word Cluster Constraints
下载PDF
导出
摘要 针对传统跨语言词嵌入方法在汉越等差异较大的低资源语言上对齐效果不佳的问题,提出一种融合词簇对齐约束的汉越跨语言词嵌入方法。通过独立的单语语料训练获取汉越单语词嵌入,使用近义词、同类词和同主题词3种不同类型的关联关系,充分挖掘双语词典中的词簇对齐信息以融入到映射矩阵的训练过程中,使映射矩阵进一步学习到不同语言相近词间具有的一些共性特征及映射关系,根据跨语言映射将两种语言的单语词嵌入映射至同一共享空间中对齐,令具有相同含义的汉语与越南语词嵌入在空间中彼此接近,并利用余弦相似度为空间中每一个未经标注的汉语单词查找对应的越南语翻译构建汉越对齐词对,实现跨语言词嵌入。实验结果表明,与传统有监督及无监督的跨语言词嵌入方法Multi_w2v、Orthogonal、VecMap、Muse相比,该方法能有效提升映射矩阵在非标注词上的泛化性,改善汉越低资源场景下模型对齐效果较差的问题,其在汉越双语词典归纳任务P@1和P@5上的对齐准确率相比最好基线模型提升了2.2个百分点。 To solve for the poor alignment effect of the traditional cross-lingual word-embedding method in low-resource languages such as Chinese-Vietnamese,this paper proposes a Chinese-Vietnamese cross-lingual word embedding method with word cluster alignment constraints.First,Chinese and Vietnamese monolingual word embeddings are obtained via training on an independent monolingual corpus.Subsequently,three different types of association relationships including synonyms,similar words,and same subject words are used to completely mine the word cluster alignment information in the bilingual dictionary and integrate it into the training process of the mapping matrix.This allows the mapping matrix to further learn some common features and mapping relationships between similar words in different languages.Second,the monolingual word embeddings of the two languages are mapped onto a shared space through cross-lingual mapping to ensure that the Chinese and Vietnamese word embeddings with the same meaning are close to each other in the space.Finally,the cosine similarity is used to find the corresponding Vietnamese translation for each non-labeled Chinese word in the space,and ChineseVietnamese aligned word pairs are constructed to realize cross-lingual word embedding.The experimental results show that the proposed method is different from traditional supervised and unsupervised cross-lingual word-embedding methods such as Multi_w2v,Orthogonal,VecMap,and Muse,and can effectively improve the generalization of the mapping matrix with non-labeled words and poor effect of model alignment in low-resource languages such as Chinese-Vietnamese.Moreover,its alignment accuracy in the Chinese-Vietnamese bilingual dictionary induction tasks P@1 and P@5 is improved by2.2 percentage points compared with that of the best baseline model.
作者 武照渊 余正涛 黄于欣 WU Zhaoyuan;YU Zhengtao;HUANG Yuxin(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China;Yunnan Key Laboratory of Artificial Intelligence,Kunming 650500,China)
出处 《计算机工程》 CAS CSCD 北大核心 2023年第1期82-91,共10页 Computer Engineering
基金 国家自然科学基金(61732005,U21B2027,61972186,61866020,61866019) 云南省重大科技专项(202002AD080001,202103AA080015) 云南省高新技术产业专项(201606)。
关键词 汉越双语 低资源语言 跨语言词嵌入 词簇对齐 多粒度约束 Chinese-Vietnamese bilingual low-resource language cross-lingual word embedding word cluster alignment multi-granularity constraints
  • 引文网络
  • 相关文献

参考文献1

二级参考文献2

共引文献3

;
使用帮助 返回顶部