摘要
“大数据”已经成为计算机领域使用频率最高的专业词汇之一,而且已经逐渐变成了一个商品名称。无论是从学术研究角度,还是从数据交易需求角度,对大数据集的可用性进行评价都是一个新的问题。文中提出了一个大数据可用性评价模型,为学术和流通领域提供参考。结合大数据的4V(Volume,Variety,Velocity,Value)特性,分段统计样本数据的4V特性分布,从而给出基于分段分布的大数据特性概率模型,以及大数据可用性加权评价模型。文中还提出了实现大数据分块抽样的算法,以及大数据评价模型的各个特性加权系数的估计算法。结合视频大数据的可用性评价需求,展示所提模型和算法的具体应用。大数据可用性评价模型可以用于数据科学实验的数据评价,也可以用于大数据交易市场的数据集定价。给出了实际评价工作中,标准化(商品化)数据集以及确定数据评价基准等具体操作方面的解决方案。应用案例对所提模型有支持作用,进一步检验了模型的可行性。
With the rapid development of information technology,the generation of data has shown an exponential growth trend.Big data has become one of the most frequently used words due to the rapid emergence of big data and its great value.It is not only an academic vocabulary,but has gradually become a commodity name.Whether from academic research or data trading needs,how to evaluate the availability of big data sets is a new issue.A big data usability evaluation model is proposed to provide refe-rence for academic and circulation fields in this paper.Combined with the 4V(Volume,Variety,Velocity,Value)characteristics of big data,the 4V characteristic distribution of the statistical data is segmented,which gives the probability model of big data based on the piecewise distribution and the availability of large data sets and weighted evaluation model.An algorithm for realizing big data block sampling and an estimation algorithm for weighting coefficients of each characteristic in the big data set evaluation model are proposed.Combined with the data availability evaluation requirements in video big data analysis,the specific applications of the proposed models and algorithms are demonstrated.The big data usability evaluation model can be used for data evalua-tion of data science experiments,and can also be used for data set pricing in big data transaction markets.In the actual evaluation work,how to standardize(commercialized)data sets,and how to determine the specific operational aspects of the video field eva-luation benchmarks are given.The application case supports the proposed model and further tests the feasibility of the model.
作者
赵会群
吴凯锋
ZHAO Hui-qun;WU Kai-feng(College of Computer Science and Technology,North China University of Technology,Beijing 100144,China;Beijing Key Laboratory of Large-scale Stream Data Integration and Analysis Technology,North China University of Technology,Beijing 100144,China)
出处
《计算机科学》
CSCD
北大核心
2020年第9期110-116,共7页
Computer Science
基金
国家自然科学基金项目(61672041)。
关键词
大数据可用性评价
概率模型
大数据分块算法
视频大数据
Big data availability evaluation
Probability model
Big data blocking algorithm
Video big data