期刊文献+

基于多粒度对比学习的聊天对话摘要模型

Chat Dialogue Summary Model Based on Multi-granularity Contrastive Learning
下载PDF
导出
摘要 社交网络的发展在给人们带来便捷的同时也产生了海量的聊天数据,如何从聊天对话中筛选出关键信息成为一大难题。聊天摘要是解决此类问题的有效工具,既不必重复浏览冗长的聊天记录,又可以快速获取重要内容。目前,预训练模型被广泛应用于各种类型的文本,包括非结构化、半结构化和结构化文本。然而,针对聊天对话文本的应用,常见的预训练模型难以捕捉到其独特的结构特征,仍需进一步探索与改进。对此,提出了一种基于对比学习的聊天摘要算法MGCSum。该算法无需人工标注数据集,便于学习和迁移。首先使用文档频数、词项频数和信息熵构造了针对聊天文本的停用词列表,去除聊天中的干扰信息;其次,从词语和主题两个粒度进行自监督对比学习,识别对话中的结构信息,挖掘聊天中的关键词和不同主题信息。在聊天摘要公开数据集SAMSum和金融欺诈对话数据集FINSum上进行实验,结果表明,与当前主流的聊天摘要方法相比,该算法在摘要的连贯性、信息量和ROUGE评价指标上均有显著提升。 While the development of social networks brings convenience,but also generates massive amounts of chat data.How to filter key information from chat conversations has become a major difficulty.Chat summary is an effective tool to solve such pro-blems,as it allows users to quickly obtain important content without having to repeatedly browse through lengthy chat records.Currently,pre-trained models are widely used in various types of text,including unstructured,semi-structured,and structured text.However,for chat dialogue text,common pre-trained models are often unable to capture its unique structural features,and further exploration and improvement are still needed.To address these issues,this paper proposes a chat summary model MGCSum,which based on multi-granularity contrastive learning and does not require manual annotation of the datasets,making it easy to learn and transfer.Firstly,a stop word list for chat text is constructed by using document frequency,term frequency and entropy to remove interference information in chat.Then,self-supervised contrastive learning is performed at the granularity of words and topics to identify the structure of conversation,uncover keywords and distinct topic information in chats.Experimental results on the publicly available chat summary datasets SAMSum and financial fraud dialogue summary dataset FINSum show that,compared to current mainstream chat summary methods,this algorithm significantly improves coherence,information content and ROUGE evaluation metrics.
作者 康梦瑶 刘扬 黄俊恒 王佰玲 刘树龙 KANG Mengyao;LIU Yang;HUANG Junheng;WANG Bailing;LIU Shulong(School of Computer Science and Technology,Harbin Institute of Technology(Weihai),Weihai,Shandong 264209,China;Research Institute of Cyberspace Security,Harbin Institute of Technology(Weihai),Weihai,Shandong 264209,China)
出处 《计算机科学》 CSCD 北大核心 2023年第11期192-200,共9页 Computer Science
基金 国家重点研发计划(2020YFB2009502) 国家自然科学基金(62272129) 中央高校基本科研业务费专项资金(HIT.NSRIF.2020098)。
关键词 聊天摘要 对比学习 预训练模型 关键词检测 主题分割 Chat summary Contrastive learning Pre-trained models Keyword detection Topic segmentation
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部