FAIR Enough:Develop and Assess a FAIR-Compliant Dataset for Large Language Model Training?

导出

摘要 The rapid evolution of Large Language Models(LLMs) highlights the necessity for ethical considerations and data integrity in AI development, particularly emphasizing the role of FAIR(Findable, Accessible, Interoperable, Reusable) data principles. While these principles are crucial for ethical data stewardship, their specific application in the context of LLM training data remains an under-explored area. This research gap is the focus of our study, which begins with an examination of existing literature to underline the importance of FAIR principles in managing data for LLM training. Building upon this, we propose a novel frame-work designed to integrate FAIR principles into the LLM development lifecycle. A contribution of our work is the development of a comprehensive checklist intended to guide researchers and developers in applying FAIR data principles consistently across the model development process. The utility and effectiveness of our frame-work are validated through a case study on creating a FAIR-compliant dataset aimed at detecting and mitigating biases in LLMs. We present this framework to the community as a tool to foster the creation of technologically advanced, ethically grounded, and socially responsible AI models.

作者 Shaina Raza Shardul Ghuge Chen Ding Elham Dolatabadi Deval Pandya

机构地区 Vector Institute for Artificial Intelligence Toronto Metropolitan University York University

出处《Data Intelligence》 EI 2024年第2期559-585,共27页 数据智能（英文）

关键词 Responsible Al Large language models FAIR data principles Ethical Al Biases

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献2

1Annika Jacobsen,Ricardo de Miranda Azevedo,Nick Juty,Dominique Batista,Simon Coles,Ronald Cornet,Melanie Courtot,Merce Crosas,Michel Dumontier,Chris T.Evelo,Carole Goble,Giancarlo Guizzardi,Karsten Kryger Hansen,Ali Hasnain,Kristina Hettne,Jaap Heringa,Rob W.W.Hooft,Melanie Imming,Keith G.Jeffery,Rajaram Kaliyaperumal,Martijn GKersloot,Christine R.Kirkpatrick,Tobias Kuhn,Ignasi Labastida,Barbara Magagna,PeterMcQuilton,Natalie Meyers,Annalisa Montesanti,Mirjam van Reisen,Philippe Rocca-Serra,Robert Pergl,Susanna-Assunta Sansone,Luiz Olavo Bonino da Silva Santos,Juliane Schneider,George Strawn,Mark Thompson,Andra Waagmeester,Tobias Weigel,Mark D.Wilkinson,Egon L.Willighagen,Peter Wittenburg,Marco Roos,Barend Mons,Erik Schultes.FAIR Principles:Interpretations and Implementation Considerations[J].Data Intelligence,2020,2(1):10-29. 被引量：30
2Luana Sales,Patricia Henning,Viviane Veiga,Maira Murrieta Costa,Luis Fernando Sayao,Luiz Olavo Bonino da Silva Santos,Luis Ferreira Pires.GO FAIR Brazil:A Challenge for Brazilian Data Science[J].Data Intelligence,2020,2(1):238-245. 被引量：6

二级参考文献5

1Nick Juty,Sarala M.Wimalaratne,Stian Soiland-Reyes,John Kunze,Carole A.Goble,Tim Clark.Unique,Persistent,Resolvable:Identifiers as the Foundation of FAIR[J].Data Intelligence,2020,2(1):30-39. 被引量：12
2Christopher Brewster,Barry Nouwt,Stephan Raaijmakers,Jack Verhoosel.Ontology-based Access Control for FAIR Data[J].Data Intelligence,2020,2(1):66-77. 被引量：9
3Peter McQuilton,Dominique Batista,Oya Beyan,Ramon Granell,Simon Coles,Massimiliano Izzo,Allyson L.Lister,Robert Pergl,Philippe Rocca-Serra,Ben Schaap,Hugh Shanahan,Milo Thurston,Susanna-Assunta Sansone.Helping the Consumers and Producers of Standards,Repositories and Policies to Enable FAIR Data[J].Data Intelligence,2020,2(1):151-157. 被引量：5
4Hana Pergl Sustkova,Kristina Maria Hettne,Peter Wittenburg,Annika Jacobsen,Tobias Kuhn,Robert Pergl,Jan Slifka,Peter McQuilton,Barbara Magagna,Susanna-Assunta Sansone,Markus Stocker,Melanie Imming,Larry Lannom,Mark Musen,Erik Schultes.FAIR Convergence Matrix:Optimizing the Reuse of Existing FAIR-Related Resources[J].Data Intelligence,2020,2(1):158-170. 被引量：5
5Sarah Jones,Robert Pergl,Rob Hooft,Tomasz Miksa,Robert Samors,Judit Ungvari,Rowena I.Davis,Tina Lee.Data Management Planning:How Requirements and Solutions are Beginning to Converge[J].Data Intelligence,2020,2(1):208-219. 被引量：8

共引文献33

1Ebtisam Alharbi,Rigina Skeva,Nick Juty,Caroline Jay,Carole Goble.Exploring the Current Practices,Costs and Benefits of FAIR Implementation in Pharmaceutical Research and Development:A Qualitative Interview Study[J].Data Intelligence,2021,3(4):507-527.
2赵武壮,李燕.对我国铝市场的回眸与展望[J].世界有色金属,2000(B03):19-20.
3杨啸林,杨晟,潘虹洁,王哲,王志刚,何勇群.FAIR准则与生物医学数据标准应用服务[J].中国医学伦理学,2020,33(2):153-159. 被引量：10
4姜恩波,李娜.中国开放政府农业数据分析与评价[J].农业图书情报学报,2020,32(10):4-15. 被引量：5
5宋佳,温亮明,李洋.科学数据共享FAIR原则:背景、内容及实践[J].情报资料工作,2021,42(1):57-68. 被引量：42
6翟军,梁佳佳,吕梦雪,林岩.欧盟开放科学数据的FAIR原则及启示[J].图书与情报,2020(6):103-111. 被引量：21
7刘凤红,彭琳.FAIR原则背景下国际出版集团的数据政策和实践[J].中国科技期刊研究,2021,32(2):173-179. 被引量：18
8邱春艳.开放科学愿景下欧盟推进FAIR原则的路径、经验及启示[J].情报理论与实践,2021,44(5):199-205. 被引量：22
9叶兰.FAIR数据评估模型与工具研究[J].图书情报工作,2021,65(16):138-147. 被引量：6
10Barend Mons,Erik Schultes,Fenghong Liu,Annika Jacobsen.The FAIR Principles:First Generation Implementation Choices and Challenges[J].Data Intelligence,2020,2(1):1-9. 被引量：5

Data Intelligence

2024年第2期

浏览历史

内容加载中请稍等...

FAIR Enough:Develop and Assess a FAIR-Compliant Dataset for Large Language Model Training?

参考文献2

二级参考文献5

共引文献33

相关作者

相关机构

相关主题

浏览历史