期刊文献+

科技政策库的系统集成与建设 被引量:3

System Integration and Construction of Science and Technology Policy Database
下载PDF
导出
摘要 为了满足科技政策研究需要,中国科协设计并实现了一种科技政策库系统.本文首先介绍了科技政策库的总体设计方案、系统工作流程;然后详细介绍了系统组成,整个系统由数据采集子系统、数据清洗子系统、数据分析子系统3个子系统组成.数据采集子系统基于网络爬虫框架Scrapy软件针对大量异构站点设计了可管理的网络爬虫,并基于ABBYY FineReader软件(俄罗斯软件公司ABBYY发行的一款文档识别软件)实现了历史文献OCR识别(Optical Character Recognition,光学字符识别)和入库.数据清洗子系统基于机器学习算法实现了数据去重、非相关数据识别、数据属性缺陷识别等功能.数据分析子系统则对有效入库的科技政策进一步进行了文本分类、关联关系分析、全文检索.从2018年10月上线以来,该系统从226个数据源采集564749条数据,经过数据清洗之后入库404083条数据,能够有力地支撑科技政策研究工作. In order to meet the needs of science and technology policy research,China Association for Science and Technology designs and implements a policy database system.This study first introduces the overall design scheme and system workflow of the science and technology policy database.Then it introduces the system components in detail.The system consists of three subsystems: data acquisition subsystem,data cleaning subsystem and data analysis subsystem.The data acquisition subsystem is based on the Scrapy framework for designing manageable web crawlers for a large number of heterogeneous sites,as well as ABBYY FineReader-based OCR (Optical Character Recognition) for historical documentation.The data cleaning subsystem implements functions such as data deduplication,non-correlated data identification,and data attribute defect recognition based on machine learning algorithms.The data analysis subsystem further carries out text classification,association analysis and full-text search for the effective policies.Since its launch in October 2018,the system has collected 564 749 pieces of data from 226 data sources.After data cleaning,it stores 404 083 pieces of data,which can strongly support the research of science and technology policy.
作者 武虹 杨宝龙 杜治高 李涵露 WU Hong;YANG Bao-Long;DU Zhi-Gao;LI Han-Lu(National Academy of Innovation Strategy, Beijing 100086, China;Beihang University, Beijing 100083, China)
出处 《计算机系统应用》 2019年第7期58-64,共7页 Computer Systems & Applications
关键词 科技政策库 网络爬虫 数据清洗 机器学习 自然语言处理 science and technology policy database Web crawler data cleaning machine learning natural language processing
  • 相关文献

参考文献8

二级参考文献116

共引文献214

同被引文献20

引证文献3

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部