A Bootstrapping-based Method to Automatically Identify Data-usage Statements in Publications 被引量：2

A Bootstrapping-based Method to Automatically Identify Data-usage Statements in Publications

下载PDF

导出

摘要 Purpose： Our study proposes a bootstrapping-based method to automatically extract data- usage statements from academic texts. Design/methodology/approach： The method for data-usage statements extraction starts with seed entities and iteratively learns patterns and data-usage statements from unlabeled text. In each iteration, new patterns are constructed and added to the pattern list based on their calculated score. Three seed-selection strategies are also proposed in this paper. Findings： The performance of the method is verified by means of experiments on real data collected from computer science journals. The results show that the method can achieve satisfactory performance regarding precision of extraction and extensibility of obtained patterns. Research limitations： While the triple representation of sentences is effective and efficient for extracting data-usage statements, it is unable to handle complex sentences. Additional features that can address complex sentences should thus be explored in the future. Practical implications： Data-usage statements extraction is beneficial for data-repository construction and facilitates research on data-usage tracking, dataset-based scholar search, and dataset evaluation. Originality/value： To the best of our knowledge, this paper is among the first to address the important task of automatically extracting data-usage statements from real data. Purpose： Our study proposes a bootstrapping-based method to automatically extract data- usage statements from academic texts. Design/methodology/approach： The method for data-usage statements extraction starts with seed entities and iteratively learns patterns and data-usage statements from unlabeled text. In each iteration, new patterns are constructed and added to the pattern list based on their calculated score. Three seed-selection strategies are also proposed in this paper. Findings： The performance of the method is verified by means of experiments on real data collected from computer science journals. The results show that the method can achieve satisfactory performance regarding precision of extraction and extensibility of obtained patterns. Research limitations： While the triple representation of sentences is effective and efficient for extracting data-usage statements, it is unable to handle complex sentences. Additional features that can address complex sentences should thus be explored in the future. Practical implications： Data-usage statements extraction is beneficial for data-repository construction and facilitates research on data-usage tracking, dataset-based scholar search, and dataset evaluation. Originality/value： To the best of our knowledge, this paper is among the first to address the important task of automatically extracting data-usage statements from real data.

作者 Qiuzi Zhang Qikai Cheng Yong Huang Wei Lu

机构地区 School of Information Management

出处《Journal of Data and Information Science》 2016年第1期69-85,共17页 数据与情报科学学报（英文版）

基金 supported by the National Natural Science Foundation of China (Grant No.:71473183)

关键词 Data-usage statements extraction Information extraction BOOTSTRAPPING Unsupervised learning Academic text-mining Data-usage statements extraction Information extraction Bootstrapping Unsupervised learning Academic text-mining

分类号 G254 [文化科学—图书馆学]

引文网络
相关文献

同被引文献22

1司莉,伍丹.我国图书情报领域科学数据引用行为调查与分析[J].情报科学,2020,0(2):11-16. 被引量：10
2孙志茹,韩涛,杨文.生物信息学科学数据与科学文献的关联关系分析[J].图书情报工作,2008,52(2):88-91. 被引量：10
3沈锡宾,顾佳,包婧玲,韩静,霍永丰,李君,袁庆,李敬文.美国NLM DTD 3.0期刊存储和交换标签集中参考文献的标记解读[J].中国科技期刊研究,2013,24(2):233-237. 被引量：8
4邱春艳.期刊文献与科学数据的关联服务研究[J].情报资料工作,2014,35(2):63-66. 被引量：18
5郭学武.基于引文的科学数据与科技文献关联研究[J].情报科学,2014,32(4):59-62. 被引量：16
6丁楠,丁莹,杨柳,凌晨,潘有能.我国图书情报领域数据引用行为分析[J].中国图书馆学报,2014,40(6):105-114. 被引量：37
7邱春艳.科学数据与期刊文献的关联实现研究[J].图书馆杂志,2015,34(8):29-33. 被引量：11
8王丹丹.数据论文:数据集独立出版与共享模式研究[J].情报资料工作,2015,36(5):95-98. 被引量：20
9屈宝强,王凯.科学数据引用现状和研究进展[J].情报理论与实践,2016,39(5):134-138. 被引量：19
10丁培.科学文献与科学数据细粒度语义关联研究[J].图书馆论坛,2016,36(7):24-33. 被引量：17

引证文献2

1杨宁,张志强.融合全文信息的科学数据正式引用识别方法研究[J].情报理论与实践,2022,45(2):191-197. 被引量：8
2潘有能,吕晶晶,丁楠.基于Labeled-LDA模型的科学数据与科技文献关联识别研究——以生物医学领域为例[J].情报科学,2023,41(9):138-145. 被引量：2

二级引证文献10

1佘硕,林雅玲.基于LDA主题模型的我国突发公共卫生事件应急管理主题热度与趋势分析[J].中国应急管理科学,2024(6):66-85.
2邱均平,徐中阳,魏开洋,付裕添.数据计量学研究:概念内涵、理论方法及发展趋势[J].情报理论与实践,2022,45(9):27-36. 被引量：1
3邱均平,肖博轩,徐中阳,胡博.国内外图书情报领域数据引用特征的多维度分析[J].情报理论与实践,2022,45(9):44-50. 被引量：7
4邱玉红,焦红,杨波.多元数据出版模式下生物医学领域科研人员数据引用行为研究[J].图书情报工作,2022,66(16):92-104. 被引量：9
5戚筠,何琳.数据生命周期视角下国外人文社会科学领域数据期刊政策及启示[J].中国科技期刊研究,2023,34(3):315-324. 被引量：2
6周佳茵,钱庆,唐明坤,吴思竹.科学数据引用识别方法研究[J].数据分析与知识发现,2023,7(6):38-49. 被引量：1
7潘有能,吕晶晶,丁楠.基于Labeled-LDA模型的科学数据与科技文献关联识别研究——以生物医学领域为例[J].情报科学,2023,41(9):138-145. 被引量：2
8徐琳宏,王凯达,张立杰.国内自然语言处理领域数据集引用行为分析[J].数字图书馆论坛,2023,19(11):29-37.
9屈亚杰,黄国彬,程冰.科学数据、数据导引与注册式研究报告的关系剖析[J].图书情报研究,2024,17(1):9-18.
10杨宁,张志强,黄飞虎,张鑫.科学数据引用网络建模及演化特征分析--以基因表达数据集为例[J].现代情报,2024,44(5):45-57.

1马可青.浅议Usage Reports在专业图书馆外文期刊建设中的作用[J].图书情报工作,2010,54(S1):136-137. 被引量：2
2李若溪.关于“计算者”(COUNTER)与“使用度”(Usage)[J].编辑学报,2009,21(4):375-375.
3曲蕴,马春.2015年美国公共图书馆电子书使用调查[J].公共图书馆,2015(4):89-91. 被引量：1
4惠钧,笑雯.译著缺文堪忧[J].读书,1993,0(7):140-143.
5Researchers Identify Two New Blood Types[J].Chinese Journal of Biomedical Engineering(English Edition),2012,21(1):44-45.
6Dutch Researchers Identify Huge Potential of Nanocrystals in Fuel Cells[J].中国材料进展,2011,30(4):57-57.
7E. Savas.The Effects of Hydrocolloid and Emulsifier Usage on Chemical Composition and Sensory Quality of Turkish Cheese Dessert： Hosmerim[J].Journal of Food Science and Engineering,2011,1(3):207-213.
8宋丽萍,陈巍,贺颖.论文层面科学评价实证研究--以PLoS ONE为例[J].图书馆工作与研究,2015(7):85-88. 被引量：22
9谢欢.国外数字期刊及其它资源利用统计项目的经验及启示——以COUNTER项目为例[J].图书与情报,2014(1):8-11. 被引量：2
10Andrea Scarlatell.China's 8 Cuisines Revealed and How to Identify Them[J].国际人才交流,2011(6):58-59.

Journal of Data and Information Science

2016年第1期

浏览历史

内容加载中请稍等...

A Bootstrapping-based Method to Automatically Identify Data-usage Statements in Publications 被引量：2

同被引文献22

引证文献2

二级引证文献10

相关作者

相关机构

相关主题

浏览历史