摘要
词汇约束文本生成是自然语言处理领域的重要研究任务之一,旨在给定一组有序词汇,生成包含这些词汇的流畅文本,在语言教学、文本生成、信息检索等领域有广泛应用。现有的生成方法存在生成速度慢、无法包含所有约束词等问题,难以满足实际应用需求。该文提出一种基于片段预测的端到端词汇受限文本生成方法,将词汇约束文本生成视为对约束词之间的文本片段的预测,利用基于二维位置编码的预训练语言模型预测所有片段,再将其填充回约束词的对应位置,从而保证了生成速度和词汇约束;利用词性标注方式构造多参考数据进行数据增强,进一步提升了文本生成质量。为验证方法的有效性,该文在公开的英文数据集,以及基于国际中文教材构建的中文数据集上进行了实验,结果表明,该文提出的LCTG-SP方法可以满足所有词汇约束、具有较快生成速度,生成文本的流利度和多样性表现更好。本文中的模型代码和数据开源在GitHub上①。
Lexically constrained text generation aims to generate fluent text containing these words given a set of ordered words,which is widely used in language teaching,text generation,information retrieval,and other fields.This paper proposes an end-to-end lexically constrained text generation method based on fragment prediction,which considers the lexically constrained text generation task as an end-to-end prediction of text fragments between constrained words.It uses two-dimensional position encoding to learn semantic relationships between segments and within segments,thereby speeding up text generation while ensuring generation quality and lexical constraints.In addition,the part-of-speech tagging method is used to construct multi-reference data for data augmentation.Experiments are conducted on the English dataset publicly available and a Chinese dataset of international Chinese textbooks constructed by this paper.The experimental results show that the method proposed in this paper has significantly improved generation speed,fluency,and diversity(code and data available at https://github.com/blcuicall/LCTG-SP).
作者
聂锦燃
杨麟儿
杨尔弘
NIE Jinran;YANG Lin’er;YANG Erhong(National Language Resources Monitoring and Research Center for Print Media,Beijing Language and Culture University,Beijing 100083,China;School of Information Science,Beijing Language and Culture University,Beijing 100083,China)
出处
《中文信息学报》
CSCD
北大核心
2023年第8期150-158,共9页
Journal of Chinese Information Processing
关键词
词汇约束
片段预测
文本生成
数据增强
lexical constraints
segment prediction
text generation
data augmentation