期刊文献+

一种用于数据挖掘算法的数据生成方法 被引量:7

A Method Generating Data Sets to Test Data Mining Algorithms
下载PDF
导出
摘要 由于受到保密性、时间和数据多样性等一些原因的限制,测试数据集的获取一直困扰着数据挖掘算法的研究.因此,提出一种基于遗传算法和熵的测试数据集的模拟生成方法,生成方法利用遗传算法具有继承性的特性对采集到的少量的真实数据进行扩充和模拟,用熵衡量生成数据与真实数据的相似程度,最终生成规模大的测试数据集,并给出了描述型数据的生成算法.使用此方法,可以生成同真实数据集具有相同的属性,相同的属性取值区间和属性值分布,类似属性关联关系的测试数据集,加速数据挖掘算法的研究进程. Because of security, uncertain time, diversity of data etc, the problem of how to acquire the data set to test data mining algorithms has been confusing the study on data mining. A simulating method is therefore suggested to generate the data set on the basis of the genetic algorithm and entropy. The method extends a few data which were collected from reality by GA, then evaluates the similarity between extended data sets and real one with entropy, and generates the most similar data set of big size among the extended ones as the data set to test the data mining algorithms. A generation algorithm is also given, This method is available to generate the data set for testing, which has the same attributes, scales of attribute value and distributions of attribute value to the data set from reality, as well as the correlations among the attributes, This data set for testing will accelerate the study on data mining algorithms
出处 《东北大学学报(自然科学版)》 EI CAS CSCD 北大核心 2008年第3期328-331,共4页 Journal of Northeastern University(Natural Science)
基金 国家自然科学基金(60773218).
关键词 数据挖掘 算法测试 模拟数据集生成 遗传算法 data mining algorithm testing simulation of data generation genetic algorithm entropy
  • 相关文献

参考文献10

  • 1杜鷁,李德毅.一种测试数据挖掘算法的数据源生成方法[J].计算机研究与发展,2000,37(7):776-782. 被引量:16
  • 2Das G, Lin K, Mannila H, et al. Rule discovery from time series [ C ] // Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining. New York: AAAI Press, 1998:16-22.
  • 3Satou K, Shibayama G, Ono T, et al. Finding associations rules on heterogeneous genome data [ C]//Proceedings of the Pacific Symposium on Biocomputing' 97. Hawaii: IEEE Press, 1997:397 - 408.
  • 4Hatonen K, KIemettinen M, ManniIa H, et al. Knowledge discovery from telecommunication network alarm databases [C] // Proceedings of the 12th International Conference on Data Engineering. New Orleans: IEEE Press, 1996:115 - 122.
  • 5Database Research Group. Data collection [ DB/OL J. Hong Kong: Chinese University of Hong Kong, (2005 -06 - 15 ) [2007 -01 - 20]. http://www.cse. euhk. edu. hk/- kdd/ data_collection.html.
  • 6KDD Cup 2000. Real datasets for association rule discovery [DB/OL]. Boston: ACM Special Interest Group on Knowledge Discovery and Data Mining, (2002 -06 - 18) [2007 - 01 - 20]. http: // www. een. purdue.edu/ KDDCUP/data/BMS-POS. dat. gz.
  • 7IBM Almaden Research Center. Quest synthetic data generation code[CP/OL]. United States: IBM, [2007-01 -20]. http: //www. almaden. ibm. com/cs/projects/iis/ hdb/Projects/data_ mining/mining. shtml.
  • 8Melanie M. An introduction to genetic algorithms [ M]. Boston: MIT Press, 1998:121 - 124.
  • 9Cover T M, Thomas J A. Elements of information theory [M]. New York: John Wiley & Sons, 1991 : 130 - 233.
  • 10The University of Waikato. Weka 3.4 [ CP/OL ]. Waikato: Opensource, (2005 - 03 - 07) [2007 - 01 - 20]. Http:// www. cs. waikato.ac. nz/ml/weka/index.

二级参考文献3

  • 1李德毅,孟海军,史雪梅.隶属云和隶属云发生器[J].计算机研究与发展,1995,32(6):15-20. 被引量:1223
  • 2Das G,Proc of the4th Int’ l Conf on Knowledge Discovery and Data Mining( KDD-98),1998年,16页
  • 3Cheung D W L,Proc of the 12 th Int’ l Conf on Data Engineering,1996年,106页

共引文献15

同被引文献56

引证文献7

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部