摘要
深度学习已经在许多自然语言处理任务中达到了最高水平的性能,但是这种方法通常需要大量的标注数据。该文围绕问题意图识别语料标注问题,结合深度学习和主动学习技术,实现了语料标注成本的降低。主动学习需要不断迭代地再训练,计算成本非常高,为加速这个进程,该文提出了一种适合问题意图识别任务的轻量级架构,使用双层CNN结构组成的深度学习模型。同时为更好地评估样本的价值,设计了一种结合样本的信息性、代表性与多样性的多准则主动学习方法。最终在民航客服语料下进行实验,实验结果表明该方法可减少约50%的标注工作量,同时在公开数据集TREC问题分类语料上验证了该方法的通用性。
Deep learning has achieved best performance in many natural language processing tasks on the basis of large amount of annotation data.To reduce the cost of corpus annotation,this paper combines the active learning and deep learning to identify the corpus of question intent.To minimize the iteration of retraining in active learning,a lightweight architecture suitable for question intent recognition task is proposed by using a deep learning model consisting of a two-layer CNN structure.At the same time,in order to better evaluate the value of the sample,a multi-criteria active learning method is designed by considering the information,representativeness and diversity of samples.Finally,experiments on the civil aviation customer service corpus show that the method can reduce the annotation workload by about 50%,which is also validated by the public dataset TREC question classification corpus.
作者
付煜文
马志柔
刘杰
白琳
薄满辉
叶丹
FU Yuwen;MA Zhirou;LIU jie;BAI Lin;BO Manhui;YE Dan(Institute of Software,Chinese Academy of Sciences,Beijing 100190,China;TravelSky Mobile Technology Limited,Beijing 100029,China)
出处
《中文信息学报》
CSCD
北大核心
2021年第4期92-99,109,共9页
Journal of Chinese Information Processing
基金
国家重点研发计划(2017YFB1002303)
国家自然科学基金(61802381,61972386)
民航科技重大专项(MHRD20160109)。
关键词
主动学习
文本标注
意图识别
active learning
text annotation
intention recognition