摘要
综合运用科技文献特征向量空间和机器学习模型实现海量文献中潜在“精品”的自动识别与推荐,能够提升海量科技文献的科学影响和其科技发展促进作用。设计和实现基于机器学习的科技文献潜在“精品”识别分类器和模型框架,测度出国际高影响力期刊和国内图书情报与档案管理期刊论文的原文及引文特征,运用特征工程构建科技论文特征向量空间;然后分别采用支持向量机和朴素贝叶斯等传统机器学习模型,以及深度置信网络和多层感知机等深度学习模型进行潜在“精品”的自动识别,并基于ROC曲线(receiver operating characteristic curve)和混淆矩阵构建评价模型识别效果的指标体系。研究结果显示:①深度学习模型在潜在“精品”识别方面的效果较差,而传统机器学习模型的识别效果较优,其中随机森林和支持向量机的潜在“精品”识别效果最佳,决策树识别效果次之,朴素贝叶斯识别效果较差且稳定性不足。②影响因子越高的期刊潜在“精品”识别效果越好;无论国际自然科学领域高影响力期刊,还是国内社会科学领域图书情报与档案管理期刊,识别出的“精品”论文全部为被引频次较高的论文且综述论文的占比较低,国内期刊的“精品”论文中仅有1篇为综述论文。③“精品”论文的计量特征值与总体论文样本相比,呈现较大差异,即“精品”论文的首次响应时间较短且拥有基金资助,参考文献数量、关键词数量和被引频次较多,摘要和论文篇幅较长且偏向多作者论文。实证结果表明,机器学习模型能够准确识别科技文献中的潜在“精品”,并提升潜在“精品”识别的自动化程度,为海量文献中潜在“精品”文献的自动识别与传播利用提供理论参考与方法支撑。
Constructing a feature vector space of massive literature and using machine learning models to accurately and automatically identify and utilize potential“treasures”from a vast body of literature can enhance their scientific influence and facilitate advancements in science and technology.This study designs and implements machine learning models and the model framework of identifying potential“treasures”from consistent scientific and technological papers.As samples,we collected papers(and their citation data)published in international high-influencing journals and domestic journals from Web of Science and Library Information and Archives Management,respectively.Subsequently,we measured the bibliometric characteristics of all these papers and constructed a feature vector space of the literature.Thereafter,traditional machine learning models,such as support vector machine and naive Bayes model,and deep learning models,such as deep belief networks and multilayer perceptron,were used to identify potential“high-quality”papers.An receiver operating characteristic(ROC)curve and a confusion matrix were used to evaluate the recognition effect of the machine learning algorithms.The results show that deep learning models cannot efficiently identify the potential“treasures”from consistent papers,thus exhibiting a low recognition effect.However,the traditional machine learning models can efficiently identify the potential“treasures”from international high-influencing journals and domestic journals in library Information and Archives Management.While two types of machine learning models,including random forest and support vector machine,show the optimum recognition effect,relatively low recognition effect for the decision tree model and Naive Bayes model is identified.Moreover,the more influential a journal is,the higher the recognition effect.Irrespective of whether we considered international high-influencing journals from natural sciences or domestic journals from social sciences,all identified excellent papers exhibit a higher citation frequency,and extremely few review papers are found among them.Furthermore,by comparing the bibliometric features of all papers analyzed,we find that most identified excellent papers are multi author articles supported by science foundation and present a shorter first-citation time,more references and keywords,higher citation frequency,and longer abstracts.The empirical results show that the machine learning model can accurately identify potential“high-quality”articles from massive scientific and technological literature and improve the automation scope of identifying potential“high-quality”articles.This can also provide theoretical reference and methodological support for automatic recognition,dissemination,and utilization of potential“high-quality”papers from massive literature.
作者
胡泽文
任萍
崔静静
Hu Zewen;Ren Ping;Cui Jingjing(School of Management Science and Engineering,Nanjing University of Information Science&Technology,Nanjing 210044)
出处
《情报学报》
CSCD
北大核心
2023年第2期189-202,共14页
Journal of the China Society for Scientific and Technical Information
基金
国家社会科学基金项目“面向海量科技文献的潜在‘精品’识别方法与应用研究”(20CTQ031)。
关键词
机器学习
深度学习
精品文献
特征工程
随机森林
支持向量机
朴素贝叶斯
深度置信网络
machine learning
deep learning
excellent literature
feature engineering
random forest
support vector machine
naive Bayes model
deep belief networks