期刊文献+

一种适用于流式大数据系统测试的数据生成方法 被引量:3

A Data Generationmethod for Streaming Big Data System Testing
下载PDF
导出
摘要 在流式大数据系统测试过程中,测试数据集越真实,得到的测试报告越可信。然而真实大量的流式数据并不容易获取,因此需要一种方法能够产生大量符合真实场景特征的数据。这些特征包括数据属性相关性、数据时序相关性、数据流的流速变化等等。在流式大数据环境下,数据的时序相关性与流速变化尤为重要。本文提出了一种适用于流式大数据系统测试的数据生成方法,以真实场景的数据集作为种子数据,对种子数据采用最大互信息系数描述数据属性间的相关性,改进了Prim算法对属性列集合进行分组,在尽量保证属性列强相关的前提下提高生成效率,接着提出了一种时序模型选择策略,保证生成的数据在时序上的相关性,提出了双层滑动窗口的方法控制流数据输出速度。最后,本文比较了提出的方法与其他流数据生成方法的生成效率。 In the process of streaming big data system testing, the more real test data sets, the more reliable the test re- port can be obtained.However,real data is not easy to obtain,so a method is needed to generate a large number of data with real scenario features. Thesefeatures include data attribute correlation, data temporal sequence correlation and the rates of streaming data. In the streaming big data environment, the data temporal sequence correlation and the rates of streaming data- are especially important.In this paper, we present amethod forstreaming big data generation, using real scenario streaming da- ta as the seed data, using the maximum mutual information coefficient to describe the correlation between the data attributes, putting forward at-prim algorithm to partition the attribute group, improve efficiency in the premise of ensuring that the at- tributes arestrong related.according to the different characteristics of each attribute group, using different temporal sequence model to ensure that the data generated hold temporal sequence correlation, a double sliding window method is proposed to control thedegree of parallelism and the output speed of the streaming data.Finally, this paper compares the proposed method with other streaming data generation methods for generating efficiency.
出处 《计算技术与自动化》 2017年第3期139-145,共7页 Computing Technology and Automation
关键词 流式大数据生成 非线性相关性 时序相关性 流速控制 streaming data generation nonlinear correlation temporal sequence correlation velocity control
  • 相关文献

参考文献3

二级参考文献104

  • 1孙平,徐宗本,申建中.基于核化原理的非线性典型相关判别分析[J].计算机学报,2004,27(6):789-795. 被引量:11
  • 2孙禄杰,柏满迎.相关系数与连接函数[J].统计与决策,2006,22(16):4-6. 被引量:11
  • 3李国杰.大数据研究的科学价值[J].中国计算机学会通讯,2012,8(9):8-15.
  • 4Manyika J, Chui M, Brown B, et al. Big data: The next frontier for innovation, competition and productivity[R]. USA, Mckinsey Global Institute, 2011.
  • 5Speed T. A correlation for the 21st century[J]. Science, 2011, 334:1502-1503.
  • 6Fan J, Han F, Liu H. Challenges of big data analysis [J]. National Science Review, 2013, 1 .. 293-314.
  • 7Davis J M, Searles Quick V B, Sikela J M. Replicated linear association between DUF1220 copy number and severity of so- cial impairment in autism[J]. Hum Genet, 2015, 134:569-575.
  • 8Duran B S, Odell P L. Cluster analysis: A survey [M]. Berlin Heidelgerg: Springer-Verlag, 2013.
  • 9Mi Huaiyu, Anushya M, John T C, et al. Large-scale gene function analysis with the panther classification system[J]. Na- ture Protocols, 2013, 8(8): 1551-1566.
  • 10Puth M T, Neuhauser M, Ruxton G D. Effective use of pearson's producte moment correlation coefficient[J]. Animal Be- haviour, 2014, 93:183-189.

共引文献287

同被引文献12

引证文献3

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部