摘要
神经网络机器翻译技术本质上是数据驱动型技术,大规模、高质量的语料资源是构建高性能多语种神经网络机器翻译系统的基础条件,语料资源建设至关重要。本文基于现有专利机器翻译引擎训练语料扩充及特定语言方向专利语料资源建设的需求,对基于标准BLEU4算法、基于伪数据构建及基于同族专利数据进行多语言专利平行语料构建的方法分别进行研究,并分析总结不同的专利平行语料构建方法的优劣及各自的适用场景,以期探索多语言专利平行语料构建的可靠方案,有效扩充现有专利语料资源。
Neural machine translation(NMT) technology is data-driven technology intrinsically and the foundation of a high performance multilingual neural machine translation system is large-scale and highquality corpus resources. Therefore, the construction of corpus resources is crucial. Based on the shortage of existing patent corpus resources and the needs of patent corpus resource construction, this paper conducts a study on patent parallel corpus construction methods based on a standard BLEU4 algorithm, based on pseudo-data construction and based on family patents, and analyzes and summarizes the advantages and disadvantages of said patent parallel corpus construction methods and their respective applicable scenarios,so as to explore reliable construction schemes of a multilingual parallel corpus and thus to achieve the effective expansion of the current multilingual parallel corpus resources.
作者
曹竟成
邬小倩
王倩
孙小宇
邓汇娟
CAO Jingcheng;WU Xiaoqian;WANG Qian;SUN Xiaoyu;DENG Huijuan(China Patent Information Center,Beijing 100044)
出处
《中国发明与专利》
2022年第6期70-75,80,共7页
China Invention & Patent
关键词
多语言平行语料构建
中间语言匹配
标准BLEU4算法
伪数据构建
同族专利
multilingual parallel corpus construction
intermediate language-based matching
standard BLEU4 algorithm
pseudo-data construction
family patents