摘要
河北省科技创新大数据公共平台是以海量数据资源为基础,基于数据仓库与数据挖掘技术构建,面向管理部门开展决策服务,面向社会公众开展信息服务的网络信息平台。但是,在构建数据仓库的过程中,存在各种各样的数据质量问题,最终产生各种错误的分析结果,所以,在进入数据仓库前,必须对数据进行清洗,从而保证进入数据仓库的数据质量。根据河北省科技攻关项目科技大数据标准化处理与应用系统,提出科技创新大数据清洗框架,在框架的基础上,定义清洗规则,改进清洗算法,在真实数据集上进行实验,解决了进入数据仓库的数据质量问题,从而保证了数据在数据仓库中的一致性和正确性,为后期的分析和处理提供了坚实的数据基础。
Hebei Province science and technology innovation big data public platform is based on massive data resources,the construction is based on data warehouse and data mining technology,oriented management departments to carry out the decision-making service,network information platform for the public to provide information service.However,during the construction of data warehouse,there are all kinds of data quality problems,resulting in various error analysis results,so,before the data get into the data warehouse,data cleaning should be done,so as to ensure the quality of the data into data warehouse.According to the scientific and technological big data standardization processing and application system of science and technology project in Hebei Province,put forward the innovation of science and technology of data cleaning framework,on the basis of the framework,the definition of data cleaning rules,improved data cleaning algorithm,experiments were carried out on the technological innovation of large data system on real data sets,solving the problems of data quality in data warehouse,so as to ensure consistency and the correctness of the datain the data warehouse,providing a solid foundation for data analysis and processing of the late.
作者
赵月琴
范通让
ZHAO Yue-qin;FAN Tong-rang(School of Information Science and Technology,Shijiazhuang Tiedao University,Shijiazhuang Hebei 050043,China)
出处
《河北省科学院学报》
CAS
2018年第2期35-42,共8页
Journal of The Hebei Academy of Sciences
基金
国家自然科学基金"互联网中信息流行为特征的分析"(#61373160)
河北省科技厅"科技大数据标准化处理与应用系统研发"项目(17210113D)
"科技创新大数据综合服务平台"项目(344008)
"科技基础条件资源调查
统计分析与创新平台年报系统开发"项目(179676334D)
关键词
科技创新大数据
数据质量
数据清洗
数据清洗框架
Big data of scientific and technological innovation
Data quality
Data cleaning
data cleaning framework