摘要
基于大规模专利数据和专利特征指标开展自动化的前瞻性预测已逐渐成为新兴技术识别的研究重点,机器学习方法的引入也让海量技术发明涌现为新兴技术这一小概率事件是一种典型的不均衡分类问题的本质受到关注。本研究目标在于通过优化分类策略改善新兴技术识别中不均衡数据集造成的分类偏向多数类别的识别效果,提出了综合数据、算法和评估三个层面的新兴技术识别不均衡分类优化框架,并以预测癌症药物领域专利是否有成为新兴技术潜质的二分类场景为例开展实证分析。具体改进之处在于:数据层面采纳渐进式重采样思路;算法层面构建代价敏感的随机森林;评估层面引入代价敏感思想,探究在缺乏专家经验时的代价矩阵验证方式。研究结果表明,基于1∶2均衡比例随机欠采样、以ROC-Youden指数阈值代价矩阵构建的代价敏感随机森林在对应的新兴技术识别目标中能正确预测出82.8%的新兴技术和81.6%的普通技术,显著优于本文对照组及现有相关成果,对未来深入挖掘新兴技术识别中不均衡分类问题的本质具有参考价值。
Automated forward-looking forecasting based on large patent data and patent characteristics has gradually become the research focus of emerging technologies identification.In addition,the introduction of machine learning technology has attracted the attention of the small probability of discovering emerging technologies from massive technological inventions represented by patents,which comprises a typical imbalanced classification problem.This study aims to improve the identification performance of the classification bias to the majority caused by imbalanced datasets in emerging technologies identification and to propose a comprehensive imbalanced classification optimization framework that integrates three levels of data,algorithm,and evaluation verified by the binary classification of whether the patents in cancer drugs field can be authorized by the Food and Drug Administration to become new drugs as emerging technologies as an example.The specific improvements are as follows:progressive resampling is verified at the data level,cost-sensitive learning is introduced with three cost matrix setting methods under the background of a lack of expert experience are studied at the evaluation level,and the cost-sensitive random forest is constructed at the algorithm level.The results show that cost-sensitive random forest based on 1∶2 undersampling and ROC(receiver operating characteristic)-Youden index threshold cost matrix can predict 82.8%of the emerging technologies and 81.6%of the common technologies,which is significantly better than the control group and the existing related results.It has a certain reference value for further mining the essence of the imbalanced classification in emerging technologies identification in the future,and has certain reference value for the future exploration of the nature of the imbalanced classification problems in emerging technologies identification.
作者
卢小宾
张杨燚
杨冠灿
行佳鑫
Lu Xiaobin;Zhang Yangyi;Yang Guancan;Xing Jiaxin(School of Information Resource Management,Renmin University of China,Beijing 100872)
出处
《情报学报》
CSSCI
CSCD
北大核心
2022年第10期1059-1070,共12页
Journal of the China Society for Scientific and Technical Information
基金
国家社会科学基金重点项目“新时期产业技术情报分析方法体系研究”(21ATQ008)。
关键词
新兴技术识别
不均衡分类
代价敏感
随机森林
渐进式重采样
emerging technologies identification
imbalanced classification
cost-sensitive
random forest
progressive resampling