期刊文献+

面向大语言模型的越狱攻击综述 被引量:1

Jailbreak Attack for Large Language Models:A Survey
下载PDF
导出
摘要 近年来,大语言模型(large language model,LLM)在一系列下游任务中得到了广泛应用,并在多个领域表现出了卓越的文本理解、生成与推理能力.然而,越狱攻击正成为大语言模型的新兴威胁.越狱攻击能够绕过大语言模型的安全机制,削弱价值观对齐的影响,诱使经过对齐的大语言模型产生有害输出.越狱攻击带来的滥用、劫持、泄露等问题已对基于大语言模型的对话系统与应用程序造成了严重威胁.对近年的越狱攻击研究进行了系统梳理,并基于攻击原理将其分为基于人工设计的攻击、基于模型生成的攻击与基于对抗性优化的攻击3类.详细总结了相关研究的基本原理、实施方法与研究结论,全面回顾了大语言模型越狱攻击的发展历程,为后续的研究提供了有效参考.对现有的安全措施进行了简略回顾,从内部防御与外部防御2个角度介绍了能够缓解越狱攻击并提高大语言模型生成内容安全性的相关技术,并对不同方法的利弊进行了罗列与比较.在上述工作的基础上,对大语言模型越狱攻击领域的现存问题与前沿方向进行探讨,并结合多模态、模型编辑、多智能体等方向进行研究展望. In recent years,large language models(LLMs)have been widely applied in a range of downstream tasks and have demonstrated remarkable text understanding,generation,and reasoning capabilities in various fields.However,jailbreak attacks are emerging as a new threat to LLMs.Jailbreak attacks can bypass the security mechanisms of LLMs,weaken the influence of safety alignment,and induce harmful outputs from aligned LLMs.Issues such as abuse,hijacking and leakage caused by jailbreak attacks have posed serious threats to both dialogue systems and applications based on LLMs.We present a systematic review of jailbreak attacks in recent years,categorize these attacks into three distinct types based on their underlying mechanism:manually designed attacks,LLM-generated attacks,and optimization-based attacks.We provide a comprehensive summary of the core principles,implementation methods,and research findings derived from relevant studies,thoroughly examine the evolutionary trajectory of jailbreak attacks on LLMs,offering a valuable reference for future research endeavors.Moreover,a concise overview of the existing security measures is offered.It introduces pertinent techniques from the perspectives of internal defense and external defense,which aim to mitigate jailbreak attacks and enhance the content security of LLM generation.Finally,we delve into the existing challenges and frontier directions in the field of jailbreak attacks on LLMs,examine the potential of multimodal approaches,model editing,and multi-agent methodologies in tackling jailbreak attacks,providing valuable insights and research prospects to further advance the field of LLM security.
作者 李南 丁益东 江浩宇 牛佳飞 易平 Li Nan;Ding Yidong;Jiang Haoyu;Niu Jiafei;Yi Ping(School of Cyber Science and Engineering,Shanghai Jiao Tong University,Shanghai 200240)
出处 《计算机研究与发展》 EI CSCD 北大核心 2024年第5期1156-1181,共26页 Journal of Computer Research and Development
基金 国家自然科学基金项目(61831007) 国家重点研发计划(2020YFB1807504)。
关键词 生成式人工智能 越狱攻击 大语言模型 自然语言处理 网络空间安全 generative artificial intelligence jailbreak attack large language model(LLM) natural language processing(NLP) cyber security
  • 相关文献

参考文献1

共引文献1

同被引文献23

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部