期刊文献+

多段支持度数据挖掘算法研究 被引量:23

A Data Mining Algorithm Based on Calculating Multi Segment Support
下载PDF
导出
摘要 在基于相联规则的数据挖掘算法中 ,Apriori等算法最为著名 .它分为两个主要步骤 :(1)通过多趟扫描数据库求解出频繁项集 ;(2 )利用频繁项集生成规则 .随后的许多算法都沿用 Apriori中“频繁项集的子集必为频繁项集”的思想 ,在频繁项集 Lk- 1 上进行 JOIN运算构成潜在 k项集 Ck.由于数据库和 Ck 的规模较大 ,需要相当大的计算量才能生成频繁项集 .Apriori Tid算法给每个事务增加了一个唯一标识 Tid ,其特点是只扫描一趟数据库 ,其余趟扫描 (如第 k趟扫描 )均在相应的数据集 Ck上进行 .由于数据规模改变不大 ,各算法的效率差别并不明显 .该文提出分段计算支持度的思想 ,是把一个项集的支持度分段计算 ,每一个段记录该项集在相应规模事务中出现的频度 ,从而构成一个支持度向量 .由于有了项集的多段支持度 ,可以推测出该项集能否包含在更大规模的频繁项集中 ,采用这种算法既提高了在扫描数据库过程中的信息获取率 ,又能及时剔除超集不是频繁项集的项集 ,进一步缩减了潜在项集的规模 .在数据集扫描过程中 ,按文中定理 1的思想调整数据集 。 Among the studies of KDD, R.Agrawal had presented a theory of association rules based on the basket data, which is the famous algorithm Apriori for data mining. The algorithm is executed in two steps, the large itemsets are generated at first and the set of rules generated afterwards. The algorithms presented by others thereafter still use the ideas of Apriori, that is any subset of a large itemset must also be large. Extending the large ( k-1 ) itemsets L k-1 using JOIN operation generates the set of candidate k itemsets C k . The generation of the large itemsets takes up a large amount of calculation, because scale of database is large and also the C k . The algorithm AprioriTid has set an identifier Tid for each transaction and the database is scanned only once, other scans (e.g. kth scan) are executed at corresponding data set C k . But the efficiency increased by the algorithm AprioriTid is not evidently because the difference of scale of database and data set C k is small. A new algorithm using multi segment for support is presented in this paper. The support of an itemset is divided into a lot of segments, and the counting for the different scale of transactions are recorded in corresponding segments. We can predict whether an itemset may be contained in a large itemset of lager scale, because the algorithm can calculate the multi segment support for the itemsets in one scan. This algorithm not only enhances the information gain ratio in database scanning, but also can find that some supersets are not the members of the large itemsets in advance, and to reduce the size of candidate itemsets by deleting these itemsets. In order to increase the efficiency of generating the large itemsets, the reduction of the scale of data set in each scan is according to the theorem 1 in this paper. A performance comparison of this algorithm and Apriori is given at the end of the paper.
出处 《计算机学报》 EI CSCD 北大核心 2001年第6期661-665,共5页 Chinese Journal of Computers
基金 国家自然科学基金!(6 98730 19) 吉林省科委应用基础基金! (19990 5 2 8)资助
关键词 数据挖掘 相联规则 算法 频繁项集 多段支持度 数据库 data mining, association rule, algorithm, large itemset, mult segments support
  • 相关文献

参考文献5

  • 1[1]Agrawal R, Imielinski T, Swami A. Mining association rules between sets of items in large databases. In: Proc ACM SIGMOD Conference on Management of Data, Washington D C, 1993. 207-216
  • 2[2]Agrawal R, Srikant R. Fast algorithms for mining association rules. In: Proc 20th VLDB Conference Santiago,Chile, 1994. 487-499
  • 3[3]Houtsma M, Swami A. Set-oriented mining of association rules. IBM Almaden Reserch Center,Research Report RJ 9567, 1993
  • 4[4]Bayardo R, Agrawal R. Mining the most interesting rules. In: Proc KDD-99, San Diego, 1999. 122-131
  • 5[5]Agrawal R, Mannila H, Toivonen H, Verkamo A. Fast discovery of association rules. In: Fayyad U,Piatetsky-Shapiro G Symth P, Uthurusamy R eds. Advances in Knowledge Discovery and Data Mining. New York: AAAI/MIT Press, 1996. 307-328

同被引文献150

引证文献23

二级引证文献87

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部