摘要
【目的】解决专利文本摘要生成中专利文本输入结构单一导致的摘要生成单一偏向性问题,以及摘要生成整体上的重复生成、不够简洁流畅、原始信息丢失等问题,提升专利文本摘要生成的质量。【方法】设计基于改进多头注意力机制的专利文本摘要生成模型IMHAM(Improved Multi-Head Attention Mechanism)。首先,针对结构单一问题,在专利的文本逻辑结构基础上设计两种基于余弦相似度的算法,选出最重要的专利文档;其次,设计一种具有多头注意力机制的序列至序列结构模型,更好地学习专利文本的特征表达;同时,在编码器层与解码器层增加自注意力层,修改注意力函数,解决重复生成的问题;最后,加入改进的指针网络结构解决原始信息丢失的问题。【结果】在公开的专利文本数据集上,所提模型相较于MedWriter基线模型,评价指标Rouge-1、Rouge-2、Rouge-L分别高出3.3%、2.4%、5.5%。【局限】所提模型更适用于专利这种有多种结构的文档,对于单一的文档结构无法发挥最重要文档算法的选择效果。【结论】对于类似具有多文档结构的文本,所提模型在摘要生成领域的质量提升具有良好的泛化能力,同时生成的摘要具有较好的流畅性。
[Objective]This paper addresses the problem of single-bias in patent text summarization caused by the single input structure of the patent text in patent texts.It also addresses the issues of repeated generation,the need for conciseness and fluency,and the loss of original information in generating abstracts.[Methods]We designed a patent text abstract generation model based on an improved multi-head attention mechanism(IMHAM).Firstly,we designed two cosine similarity-based algorithms based on the logical structure of the patent text to address the single structure issue and select the most important patent document.Then,we established a sequence-to-sequence model with a multi-head attention mechanism to learn the feature representation of patent text.Meanwhile,we added self-attention layers at the encoder and decoder levels.Next,we modified the attention function to address the problem of repetitive generation.Finally,we added an improved pointer network structure to solve the problem of original information loss.[Results]On the publicly available patent text dataset,the Rouge-1,Rouge-2,and Rouge-L scores of the proposed model were 3.3%,2.4%,and 5.5%higher than the MedWriter baseline model.[Limitations]The proposed model is more applicable for documents with multiple structures and cannot fully utilize the algorithm for selecting the most important ones from single-structured documents.[Conclusions]The proposed model has good generalization ability in improving the quality of summary generation for text with multi-document structures.
作者
施国良
周抒
王云峰
施春江
刘亮
Shi Guoliang;Zhou Shu;Wang Yunfeng;Shi Chunjiang;Liu Liang(Business School,Hohai University,Nanjing 211100,China;Bank of Jiangsu,Nanjing 210006,China)
出处
《数据分析与知识发现》
CSCD
北大核心
2023年第6期61-72,共12页
Data Analysis and Knowledge Discovery
基金
中央高校基本科研业务费专项资金项目(项目编号:B200207036)的研究成果之一。
关键词
专利文本
摘要生成
多头注意力
指针网络
Patent Text
Abstract Generation
Multi-head Attention
Pointer Network