摘要
双语术语对齐库是自然语言处理领域的重要资源,对于跨语言信息检索、机器翻译等多语言应用具有重要意义。双语术语对通常是通过人工翻译或从双语平行语料中自动提取获得的。然而,人工翻译需要一定的专业知识且耗时耗力,而特定领域的双语平行语料也很难具有较大规模。但是同一领域中各种语言的单语术语库却较易获得。为此,提出一种基于两种不同语言的单语术语库自动实现术语对齐,以构建双语术语对照表的方法。该方法首先利用多个在线机器翻译引擎通过投票机制生成目标端“伪”术语,然后利用目标端“伪”术语从目标端术语库中检索得到目标端术语候选集合,最后采用基于mBERT的语义匹配算法对目标端候选集合进行重排序,从而获得最终的双语术语对。计算机科学、土木工程和医学三个领域的中英文双语术语对齐实验结果表明,该方法能够提高双语术语抽取的准确率。
Bilingual terminologies are essential resources in natural language processing,which are of great significance for many multilingual applications such as cross-lingual information retrieval and machine translation.Bilingual terminology pairs are usually obtained by either human translation or automatic extraction from a bilingual parallel corpus.However,human translation requires professional knowledge and is time-consuming and labor-intensive.Besides,it is not easy to have a large bilingual parallel corpus in a specific domain.But the monolingual terminology banks of various languages in the same domain are relatively easy to obtain.Therefore,this paper proposes a novel method to extract bilingual terminology pairs by automatically aligning terms from monolingual terminology banks of two languages.Firstly,multiple online machine translation engines are adopted to generate the target pseudo terminology through a voting mechanism.Secondly,the target pseudo terminology is used to retrieve from the target terminology bank to obtain the candidate set of target terminologies.Finally,a mBERT-based semantic matching model is used to re-rank the candidate set and obtain the final bilingual terminology pair.Experimental results of Chinese-English bilingual terminology alignment on three domains,including computer science,civil engineering,and medicine,show that our proposed method can effectively improve the accuracy of bilingual terminology extraction.
作者
向露
周玉
宗成庆
XIANG Lu;ZHOU Yu;ZONG Chengqing
出处
《中国科技术语》
2022年第1期14-25,共12页
CHINA TERMINOLOGY
关键词
双语术语
单语术语库
术语对齐
语义匹配
bilingual terminology
monolingual terminological bank
terminology alignment
semantic matching