摘要
两阶段抽样算法从海量数据集中抽取样本数据用于数据挖掘,当数据集规模过大时算法效率偏低,当数据集规模过大且为稀疏数据集时抽样精度偏低。本文改进了传统两阶段抽样算法,提出新的抽样算法EAFAST,可自适应地调节算法参数,而且能充分利用历史信息进行启发式搜索。实验证明,EAFAST算法可同时提高算法效率和抽样精度,弥补了传统算法的不足。
Traditional two-phase sampling algorithms extract the sample data used on data mining from a huge data set. The algorithm efficiency is lower when the data set is oversized, and the sample accuracy is lower when the data set is an oversized sparse one. By improving the traditional two-phase sampling algorithms, the paper presents a new sampling algorithm named EAFAST, which adjusts algorithm parameters adaptively and performs heuristic search using the historical information fully. Experiments demonstrate EAFAST can enhance the efficiency and sample accuracy simultaneously,and thus remedies the insufficiencies of traditional algorithms.
出处
《计算机工程与科学》
CSCD
2007年第7期64-66,70,共4页
Computer Engineering & Science
基金
湖北省自然科学基金资助项目(2006ABA082)
关键词
抽样
两阶段
频繁项目集
剪枝
精度
sample
two-phase
frequent item set
trim
accuracy