摘要
当前科研成果呈爆炸式增长,论文跨学科交叉分布不断深化,精准获取所需的论文需要耗费大量的时间和精力。文章提出一种基于随机森林的论文自动分类方法,实现对海量论文的自动分类;提出一种基于LDA模型的主题挖掘方法,提取论文关键词并进行词云展示。实验数据采用Selenium技术抓取中国知网九大主题的1710篇论文,实验结果表明,该论文分类方法在准确率、召回率和F值上都有所提升,有效地挖掘出各学科的主题词,为下一步引文分析、文本挖掘和知识图谱构建提供有效支撑。
With the explosive growth of scientific research results, the cross-disciplinary distribution of papers has been deepened,and it takes a lot of time and effort to accurately acquire the required papers. In this paper, an automatic classification methodbased on random forest is proposed to realize the automatic classification of massive papers and a topic mining method based onLDA model is proposed to extract the keywords of the paper and display the word cloud. The experimental data used Seleniumtechnology to capture 1710 papers on the nine themes of the CNKI. The experimental results show that the paper classificationmethod has improved the precision, recall and F-measure, effectively mining the subjects of various disciplines. The keywordsextracted provide effective support for the next citation analysis, text mining and knowledge graph construction.
作者
杨秀璋
于小民
李娜
夏换
Yang Xiuzhang;Yu Xiaomin;Li Na;Xia Huan(School of Information of Guizhou University of Finance and Economics,Guiyang 550025;Guizhou Key Laboratory of Economics SystemSimulation of Guizhou University of Finance and Economics;Systems engineering research institute)
出处
《计算机时代》
2018年第11期14-18,23,共6页
Computer Era
基金
贵州省教育厅青年科技人才成长项目"实体和属性对齐方法的研究与实现"(黔教合KY字[2016]172)
贵州省教育厅青年科技人才成长项目"无线校园网络建设中Mesh网关负载均衡问题研究"(黔教合KY字[2016]178)
贵州省普通高等学校科技拔尖人才支持计划项目"定向钻机远程实时监控大数据分析评价系统"(黔教合KY字[2016]068)