摘要
关系抽取技术可用于威胁情报挖掘与分析,为网络安全防御提供关键信息支持,但网络安全领域的关系抽取任务面临数据集匮乏的问题。近年来,大语言模型展现了优秀的文本生成能力,为数据增强任务提供了强大的技术支撑。为了弥补传统数据增强方式在准确性和多样性方面的不足,文章提出一种面向网络安全关系抽取的大语言模型数据增强方法MGDA,该方法从单词、短语、语法和语义4个粒度使用大语言模型增强原始数据,从而在确保准确性的同时提升多样性。实验结果表明,文章所提数据增强方法有效改善了网络安全关系抽取任务上的有效性以及生成数据的多样性。
Relationship extraction technology can be used for threat intelligence mining and analysis,providing crucial information support for network security defense.However,relationship extraction tasks in cybersecurity face the problem of dataset deficiency.In recent years,large language model has shown its superior text generation ability,providing powerful technical support for data augmentation tasks.In order to compensate for the shortcomings of traditional data augmentation methods in terms of accuracy and diversity,this paper proposed a data augmentation method via large language model for relation extraction in cybersecurity named MGDA.MGDA used large language model to enhance the original data from four granularities of words,phrases,grammar,and semantics in order to ensure accuracy while improving diversity.The experimental results show that the proposed data augmentation method in this paper effectively improves the effectiveness of relationship extraction tasks in cybersecurity and diversity of generated data.
作者
李娇
张玉清
吴亚飚
LI Jiao;ZHANG Yuqing;WU Yabiao(Topsec Technologies Inc.,Beijing 100193,China;School of Computer Science and Technology,University of Chinese Academy of Sciences,Beijing 101408,China)
出处
《信息网络安全》
CSCD
北大核心
2024年第10期1477-1483,共7页
Netinfo Security
关键词
网络安全
关系抽取
数据增强
大语言模型
cyber security
relation extraction
data augmentation
large language model