摘要
词语简化是将给定句子中的复杂词替换成意义相等的简单替代词,从而达到简化句子的目的.已有的词语简化方法只依靠复杂词本身而不考虑其上下文信息来生成候选替换词,这将不可避免地产生大量的虚假候选词.为此,提出了一种基于预语言训练表示模型的词语简化方法,利用预训练语言表示模进行候选替换词的生成和排序.基于预语言训练表示模型的词语简化方法在候选词生成过程中,不仅不需要任何语义词典和平行语料,而且能够充分考虑复杂词本身和上下文信息产生候选替代词.在候选替代词排序过程中,基于预语言训练表示模型的词语简化方法采用了5个高效的特征,除了常用的词频和词语之间相似度特征之外,还利用了预训练语言表示模的预测排名、基于基于预语言训练表示模型的上、下文产生概率和复述数据库PPDB三个新特征.通过3个基准数据集进行验证,基于预语言训练表示模型的词语简化方法取得了明显的进步,整体性能平均比最先进的方法准确率高出29.8%.
Lexical simplification(LS)aims to replace complex words in a given sentence with their simpler alternatives of equivalent meaning,so as to simplify the sentence.Recently unsupervised lexical simplification approaches only rely on the complex word itself regardless of the given sentence to generate candidate substitutions,which will inevitably produce a large number of spurious candidates.Therefore,we present a lexical simplification approach BERT-LS based on pretrained representation model BERT,which exploits BERT to generate substitute candidates and rank candidates.In the step of substitute generation,BERT-LS not only does not rely on any linguistic database and parallel corpus,but also fully considers both the given sentence and the complex word during generating candidate substitutions.In the step of substitute ranking,BERT-LS employs five efficient features,including BERT's prediction ranking,BERT-based language model and the paraphrase database PPDB,in addition to the word frequency and word similarity commonly used in other LS methods.Experimental results show that our approach obtains obvious improvement compared with these baselines,outperforming the state-of-the-art by 29.8 Accuracy points on three well-known benchmarks.
作者
强继朋
钱镇宇
李云
袁运浩
朱毅
QIANG Ji-Peng;QIAN Zhen-Yu;LI Yun;YUAN Yun-Hao;ZHU Yi(School of Information Engineering,Yangzhou University,Yangzhou 225127)
出处
《自动化学报》
EI
CAS
CSCD
北大核心
2022年第8期2075-2087,共13页
Acta Automatica Sinica
基金
国家自然科学基金(62076217,61906060,61703362)
江苏省自然科学基金(BK20170513)资助。
关键词
词语简化
候选词生成
候选词排序
预训练语言表示模型
Lexical simplification
substitution generation
substitution ranking
bidirectional encoder representations from transformers