摘要
近年来,基于预训练语言模型的文本生成评价方法得到了广泛关注,其通过计算两个句子间子词粒度的相似度来评价生成文本的质量。但是对于越南语、泰语等存在大量黏着语素的语言,单个音节或子词不能独立成词表达语义,仅基于子词粒度匹配的方法并不能够完整表征两个句子间的语义相似关系。基于此,该文提出一种基于子词、音节、词组等多粒度特征的文本生成评价方法。首先基于MBERT模型生成文本的表示,然后引入音节、词组等粗粒度语义单元之间的相似性来增强子词粒度的相似度评价模型。在机器翻译、跨语言摘要、跨语言数据筛选等任务上的实验结果表明,该文提出的多粒度特征评价方法相比ROUGE、BLEU等基于统计的评价方法以及Bertscore等基于语义相似度的评价方法都取得了更好的性能,与人工评价结果相关性更高。
Recently,the evaluation method of text generation based on pre-trained language model has gained attention,which evaluates the quality of generated text by computing the granularity similarity of sub-words of two sentences.However,for languages that contain many adhesive morphemes,such as Vietnamese and Thai,a single syllable or sub-word cannot form the semantic integrity,which means that the sub-word granularity matching method cannot fully represent the semantic relationship between two sentences.Therefore,we propose a text generation evaluation method with multi-granularity features of sub-words,syllables,and phrases.After the representation of text is obtained by MBERT,the semantic similarity of syllables and phrases is introduced to enhance the evaluation model of sub-words.Experimental results on such tasks as cross-language summarization,machine translation,and data screening show that,compared with ROUGE,BLEU based on statistical evaluation and Bertscore based on deep semantic matching,the proposed metric correlates better with human judgments.
作者
赖华
高玉梦
黄于欣
余正涛
张勇丙
LAI Hua;GAO Yumeng;HUANG Yuxin;YU Zhengtao;ZHANG Yongbing(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming,Yunnan 650504,China;Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming,Yunnan 650504,China)
出处
《中文信息学报》
CSCD
北大核心
2022年第3期45-53,63,共10页
Journal of Chinese Information Processing
基金
国家自然科学基金(61732005,61972186,61762056,61761026)
云南省重大科技专项计划项目(202002AD080001-5)
云南省重大科技专项计划项目(202103AA080015)
云南省高新技术产业专项(201606)
云南省基础研究计划项目(202001AT070047,2018FB104)。
关键词
文本生成
评价方法
黏着语素
多粒度特征
MBERT
text generation
evaluation method
adhesive morphemes
multi-granularity feature
MBERT