大语言模型“数据为王”:训练数据的价值、迷思与数字传播的未来挑战被引量：4

Data Ruling Large Language Models:The Values,Myths,and Challenges of Training Data in Future Digital Communications

下载PDF

导出

摘要伴随着ChatGPT的问世和流行,关于生成式人工智能的意涵和影响迅速成为学界和业界的关注焦点。在这场由大语言模型引领的非监督性深度学习浪潮中,一个核心议题就是训练数据。对训练数据的规模和质量的追求,演绎了“万模大战”形势下的“数据为王”法则。而在训练数据的价值、功能和误读的背后,是对数据概念的改写、对数据可供性的迷信和对数据所有权的争夺。训练数据的具体架构和内部机制引发了智能传播生态的重建和信息生产秩序的重构,在这一变革之中也蕴藏着大语言模型时代的数字危机,其具体体现为蒸馏式传播的偏见再生产、过滤式传播的信息保守化和随机性传播的意义之消散。大语言模型及其训练数据急需破除规模迷思,着重思考如何让数据切实成为社会技术系统的一部分。 With the advent and popularity of ChatGPT,the implications and impacts of generative artificial intelligence have rapidly become focal points of attention in both academic and industrial circles.Within this wave of unsupervised deep learning led by large language models,a central issue revolves around training data.The pursuit of the scale and quality of training data epitomizes the dictum of"data as king"amidst the landscape of the"model war".Behind the values,functions,and misconceptions of training data lies a rewriting of the concept of data,a superstition regarding data affordance,and a struggle for data ownership.The specific structure and internal mechanism of training data have triggered the reconstruction of the intelligent communication ecosystem and the formation of a new information production order.The transformation also harbors a digital crisis caused by large language models,manifested in the reproduction of biases under distilled communications,the concretization of information under filtered communications,and the dissipation of meaning under stochastic communications.Both training data and large language models urgently need to dispel the myth of scale and focus more on how to integrate data effectively into social-technical systems.

作者胡泳刘纯懿 HU Yong;LIU Chun-yi(School of Journalism and Communication,Peking University,Beijing,100871,PRC)

机构地区北京大学新闻与传播学院

出处《西北师大学报（社会科学版）》北大核心 2024年第3期43-54,共12页 Journal of Northwest Normal University(Social Sciences)

关键词大语言模型训练数据生成式AI ChatGPT 智能传播 large language model training data generative AI ChatGPT intelligent communications

分类号 G206 [文化科学—传播学]

引文网络
相关文献

参考文献5

1谷文祥,黄平,朱磊,殷明浩.人工智能问题中的相变现象研究[J].计算机科学,2011,38(5):1-7. 被引量：2
2孟小峰,慈祥.大数据管理:概念、技术与挑战[J].计算机研究与发展,2013,50(1):146-169. 被引量：2391
3李国杰,程学旗.大数据研究:未来科技及经济社会发展的重大战略领域——大数据的研究现状与科学思考[J].中国科学院院刊,2012,27(6):647-657. 被引量：1604
4胡泳,刘纯懿.战争中的社交媒体:社交媒体的武器化与数字化战争的到来[J].现代传播（中国传媒大学学报）,2023,45(6):131-150. 被引量：7
5王元卓,靳小龙,程学旗.网络大数据:现状与展望[J].计算机学报,2013,36(6):1125-1138. 被引量：714

二级参考文献260

1金初高.来自西方的广播战[J].中国广播电视学刊,1990(6):95-97. 被引量：2
2Jian-ErChen.Parameterized Computation and Complexity： A New Approach Dealing with NP-Hardness[J].Journal of Computer Science & Technology,2005,20(1):18-37. 被引量：21
3Chris Anderson. The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Wired, 2008, 16 (7).
4Albert-L~iszl6 Barab~isi. The network takeover. Nature Physics, 2012,8(1): 14-16.
5Reuven Cohen, Shlomo Havlin. Scale-Free Networks Are U1- trasmall. Physical Review Letters, 2003, 90,(5 ).
6Tony Hey, Stewart Tansley, Kristin Tolle (Editors). The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft, 2009 October 16.
7Big Data. Nature, 2008, 455(7 209): 1-136.
8Dealing with data. Science, 2011,331 ( 6 018 ): 639-806.
9Complexity. Nature Physics, 2012, 8( 1 ).
10Big Data. ERCIM News, 2012, (89).