期刊文献+

融合领域知识与深度学习的机器翻译领域自适应研究 被引量:6

Research on Domain Adaption in Machine Translation Combining Domain Knowledge and Deep Learning
原文传递
导出
摘要 【目的/意义】无论是统计机器翻译,还是神经机器翻译,训练数据通常来源复杂,主题多样,文体不一,与待翻译目标文本的领域不能保证完全一致,导致领域自适应问题。目前机器翻译的领域自适应方法大多用主题模型得到主题信息,将数据粗略划分为领域内(in-domain)和领域外(out-domain),缺乏更为明确的领域标签。【方法/过程】本研究采用中图分类号作为领域标签,采用两种方法对汉语句子进行自动领域标注领域:利用论文关键词和科技词系统等知识组织构建领域知识库的领域标注方法;训练卷积神经网络的深度学习的领域标注方法,通过神经网络深度融合模型将这两种方法融合起来得到效果更佳的领域标注器,利用机器翻译的测试集获取领域标签集合筛选其训练数据。【结果/结论】经过在神经机器翻译系统上进行测试,针对两个特定领域测试集,仅利用部分训练数据就获取了比原始训练数据高约1.3BLEU得分(相对5.4%)的翻译结果,证明了本研究方法的有效性和可行性。 【Purpose/significance】In SMT(statistical machine translation) or NMT(neural machine translation), training data usually have the characteristics of diverse sources, multiple themes, different genre, and are often not in accordance with the domain of target text to be translated, resulting in domain adaptive problem. At present, Most of domain adaptive methods for machine translation obtain topic information by employing topic model. These methods always divide the topic into two types, in-domain and out-domain, which lack more specific domain labels.【Method/process】In this study, CLC(Chinese Library Classification) number is looked as the domain labels, and two methods are used to automatically label the domain of Chinese sentences. Such knowledge organization as thesis keywords and Chinese Scientific and Technical Vocabulary system are used to construct the domain knowledge base which helps to label Chinese sentences' domains.This method is combined with a deep learning based domain labeling method by designing a deep fusion neural model to obtain more accurate domain labels.【Result/conclusion】After testing NMT on two specific domain test sets, experiments show that only a part of the training data can achieve approximate 1.3 BLEU score(5.4% relative). This shows that the method is efficient and feasible.
作者 丁亮 何彦青 DING Liang HE Yan-qing(Institute of scientific and Technical Information of China, Beijing 100038,China)
出处 《情报科学》 CSSCI 北大核心 2017年第10期125-132,共8页 Information Science
基金 国家自然科学基金项目(61303152 71503240 71403257) 中国科学技术信息研究所重点工作项目(ZD2017-4)
关键词 神经机器翻译 训练语料选取 领域自适应 神经网络 深度融合模型 neural machine translation training data selection domain adaption neural network deep fusion model
  • 相关文献

参考文献5

二级参考文献71

共引文献43

同被引文献66

引证文献6

二级引证文献52

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部