期刊文献+

新兴技术识别中的不均衡分类研究--基于代价敏感的随机森林算法 被引量:9

Imbalanced Classification of Emerging Technologies Identification:Based on Cost-sensitive Random Forest
下载PDF
导出
摘要 基于大规模专利数据和专利特征指标开展自动化的前瞻性预测已逐渐成为新兴技术识别的研究重点,机器学习方法的引入也让海量技术发明涌现为新兴技术这一小概率事件是一种典型的不均衡分类问题的本质受到关注。本研究目标在于通过优化分类策略改善新兴技术识别中不均衡数据集造成的分类偏向多数类别的识别效果,提出了综合数据、算法和评估三个层面的新兴技术识别不均衡分类优化框架,并以预测癌症药物领域专利是否有成为新兴技术潜质的二分类场景为例开展实证分析。具体改进之处在于:数据层面采纳渐进式重采样思路;算法层面构建代价敏感的随机森林;评估层面引入代价敏感思想,探究在缺乏专家经验时的代价矩阵验证方式。研究结果表明,基于1∶2均衡比例随机欠采样、以ROC-Youden指数阈值代价矩阵构建的代价敏感随机森林在对应的新兴技术识别目标中能正确预测出82.8%的新兴技术和81.6%的普通技术,显著优于本文对照组及现有相关成果,对未来深入挖掘新兴技术识别中不均衡分类问题的本质具有参考价值。 Automated forward-looking forecasting based on large patent data and patent characteristics has gradually become the research focus of emerging technologies identification.In addition,the introduction of machine learning technology has attracted the attention of the small probability of discovering emerging technologies from massive technological inventions represented by patents,which comprises a typical imbalanced classification problem.This study aims to improve the identification performance of the classification bias to the majority caused by imbalanced datasets in emerging technologies identification and to propose a comprehensive imbalanced classification optimization framework that integrates three levels of data,algorithm,and evaluation verified by the binary classification of whether the patents in cancer drugs field can be authorized by the Food and Drug Administration to become new drugs as emerging technologies as an example.The specific improvements are as follows:progressive resampling is verified at the data level,cost-sensitive learning is introduced with three cost matrix setting methods under the background of a lack of expert experience are studied at the evaluation level,and the cost-sensitive random forest is constructed at the algorithm level.The results show that cost-sensitive random forest based on 1∶2 undersampling and ROC(receiver operating characteristic)-Youden index threshold cost matrix can predict 82.8%of the emerging technologies and 81.6%of the common technologies,which is significantly better than the control group and the existing related results.It has a certain reference value for further mining the essence of the imbalanced classification in emerging technologies identification in the future,and has certain reference value for the future exploration of the nature of the imbalanced classification problems in emerging technologies identification.
作者 卢小宾 张杨燚 杨冠灿 行佳鑫 Lu Xiaobin;Zhang Yangyi;Yang Guancan;Xing Jiaxin(School of Information Resource Management,Renmin University of China,Beijing 100872)
出处 《情报学报》 CSSCI CSCD 北大核心 2022年第10期1059-1070,共12页 Journal of the China Society for Scientific and Technical Information
基金 国家社会科学基金重点项目“新时期产业技术情报分析方法体系研究”(21ATQ008)。
关键词 新兴技术识别 不均衡分类 代价敏感 随机森林 渐进式重采样 emerging technologies identification imbalanced classification cost-sensitive random forest progressive resampling
  • 相关文献

参考文献11

二级参考文献110

  • 1武勃,黄畅,艾海舟,劳世竑.基于连续Adaboost算法的多视角人脸检测[J].计算机研究与发展,2005,42(9):1612-1621. 被引量:66
  • 2穆荣平,任中保,袁思达,乔岩.中国未来20年技术预见德尔菲调查方法研究[J].科研管理,2006,27(1):1-7. 被引量:54
  • 3刘三阳,杜喆.一种改进的模糊支持向量机算法[J].智能系统学报,2007,2(3):30-33. 被引量:10
  • 4陈峰.日本第八次技术预见方法的创新[J].中国科技论坛,2007(8):132-135. 被引量:24
  • 5Schapire R E. The strength of weak leam ability [ J ]. Machine Learning, 1990,5 (2) : 197 - 227.
  • 6Schapire R E, Singer Y. Improved boosting algorithms using confidence- rated predictions[ J]. Machine Learning, 1999,37 ( 3 ) :297 - 336.
  • 7Viola P,Jones M J. Robust Real-Time Face Detection [ J]. Internation- al Journal of Computer Vision,2004,57(2) :137 - 154..
  • 8Zadrozny B, Langford J, Abe N. Cost-sensitive learning by cost-propor- tionate example weighting[ C ]//Proceedings of the 3th IEEE Interna- tional Conference on Data Mining. Washington D. C. , USA: IEEE, 2003:435 - 442.
  • 9Ling C X, Sheng V S, Yang Q. Test strategies for cost-sensitive decision trees[ C ]. IEEE Transactions on Knowledge and Data Engineering, 2006,18 (8) : 1055 - 1067.
  • 10Chai X, Deng L, Yang Q, et al. Test-cost sensitive Naive Bayes classification[ C]//Proceedings of the 4th IEEE International Conference on Data Mining. Washington D. C. , USA : IEEE ,2004 : 1 - 58.

共引文献297

同被引文献138

引证文献9

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部