多段支持度数据挖掘算法研究被引量：23

A Data Mining Algorithm Based on Calculating Multi Segment Support

下载PDF

导出

摘要在基于相联规则的数据挖掘算法中 ,Apriori等算法最为著名 .它分为两个主要步骤 :(1)通过多趟扫描数据库求解出频繁项集 ;(2 )利用频繁项集生成规则 .随后的许多算法都沿用 Apriori中“频繁项集的子集必为频繁项集”的思想 ,在频繁项集 Lk- 1 上进行 JOIN运算构成潜在 k项集 Ck.由于数据库和 Ck 的规模较大 ,需要相当大的计算量才能生成频繁项集 .Apriori Tid算法给每个事务增加了一个唯一标识 Tid ,其特点是只扫描一趟数据库 ,其余趟扫描 (如第 k趟扫描 )均在相应的数据集 Ck上进行 .由于数据规模改变不大 ,各算法的效率差别并不明显 .该文提出分段计算支持度的思想 ,是把一个项集的支持度分段计算 ,每一个段记录该项集在相应规模事务中出现的频度 ,从而构成一个支持度向量 .由于有了项集的多段支持度 ,可以推测出该项集能否包含在更大规模的频繁项集中 ,采用这种算法既提高了在扫描数据库过程中的信息获取率 ,又能及时剔除超集不是频繁项集的项集 ,进一步缩减了潜在项集的规模 .在数据集扫描过程中 ,按文中定理 1的思想调整数据集。 Among the studies of KDD, R.Agrawal had presented a theory of association rules based on the basket data, which is the famous algorithm Apriori for data mining. The algorithm is executed in two steps, the large itemsets are generated at first and the set of rules generated afterwards. The algorithms presented by others thereafter still use the ideas of Apriori, that is any subset of a large itemset must also be large. Extending the large ( k-1 ) itemsets L k-1 using JOIN operation generates the set of candidate k itemsets C k . The generation of the large itemsets takes up a large amount of calculation, because scale of database is large and also the C k . The algorithm AprioriTid has set an identifier Tid for each transaction and the database is scanned only once, other scans (e.g. kth scan) are executed at corresponding data set C k . But the efficiency increased by the algorithm AprioriTid is not evidently because the difference of scale of database and data set C k is small. A new algorithm using multi segment for support is presented in this paper. The support of an itemset is divided into a lot of segments, and the counting for the different scale of transactions are recorded in corresponding segments. We can predict whether an itemset may be contained in a large itemset of lager scale, because the algorithm can calculate the multi segment support for the itemsets in one scan. This algorithm not only enhances the information gain ratio in database scanning, but also can find that some supersets are not the members of the large itemsets in advance, and to reduce the size of candidate itemsets by deleting these itemsets. In order to increase the efficiency of generating the large itemsets, the reduction of the scale of data set in each scan is according to the theorem 1 in this paper. A performance comparison of this algorithm and Apriori is given at the end of the paper.

作者李雄飞苑森淼董立岩全勃

机构地区吉林大学计算机学院

出处《计算机学报》 EI CSCD 北大核心 2001年第6期661-665,共5页 Chinese Journal of Computers

基金国家自然科学基金!(6 98730 19) 吉林省科委应用基础基金! (19990 5 2 8)资助

关键词数据挖掘相联规则算法频繁项集多段支持度数据库 data mining, association rule, algorithm, large itemset, mult segments support

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献5

1[1]Agrawal R, Imielinski T， Swami A. Mining association rules between sets of items in large databases. In: Proc ACM SIGMOD Conference on Management of Data, Washington D C, 1993. 207-216
2[2]Agrawal R, Srikant R. Fast algorithms for mining association rules. In: Proc 20th VLDB Conference Santiago,Chile, 1994. 487-499
3[3]Houtsma M, Swami A. Set-oriented mining of association rules. IBM Almaden Reserch Center，Research Report RJ 9567， 1993
4[4]Bayardo R, Agrawal R. Mining the most interesting rules. In: Proc KDD-99, San Diego, 1999. 122-131
5[5]Agrawal R, Mannila H, Toivonen H, Verkamo A. Fast discovery of association rules. In: Fayyad U,Piatetsky-Shapiro G Symth P, Uthurusamy R eds. Advances in Knowledge Discovery and Data Mining. New York: AAAI/MIT Press, 1996. 307-328

同被引文献150

1阮璐,肖冬荣,周杰,高风.利用组合支持度进行关联规则的挖掘[J].微计算机信息,2008,24(9):233-234. 被引量：3
2高杰,李绍军,钱锋.数据挖掘中关联规则算法的研究及应用[J].东南大学学报（自然科学版）,2006,36(S1):128-131. 被引量：4
3皮德常,秦小麟,王宁生.基于动态剪枝的关联规则挖掘算法[J].小型微型计算机系统,2004,25(10):1850-1852. 被引量：16
4区玉明,张师超,徐章艳,卢景丽,刘美玲.一种提高Apriori算法效率的方法[J].计算机工程与设计,2004,25(5):846-848. 被引量：5
5颜跃进,李舟军,陈火旺.基于FP-Tree有效挖掘最大频繁项集[J].软件学报,2005,16(2):215-222. 被引量：68
6尹群,王丽珍,田启明.一种基于概率的加权关联规则挖掘算法[J].计算机应用,2005,25(4):805-807. 被引量：18
7陆介平,杨明,孙志挥,鞠时光.快速挖掘全局最大频繁项目集[J].软件学报,2005,16(4):553-560. 被引量：27
8员巧云,程刚.近年来我国数据挖掘研究综述[J].情报学报,2005,24(2):250-256. 被引量：46
9何飞,罗三定,沙莎.基于领域本体的知识关联研究[J].湖南城市学院学报（自然科学版）,2005,14(1):69-71. 被引量：9
10秦亮曦,史忠植.关联规则研究综述[J].广西大学学报（自然科学版）,2005,30(4):310-317. 被引量：22

引证文献23

1赵栋,卢炎生,王涛.一种挖掘free项目集的快速算法[J].小型微型计算机系统,2004,25(10):1853-1856.
2杨君锐.关联规则增量式快速更新方法的研究[J].微电子学与计算机,2004,21(9):120-124. 被引量：7
3刘贞,张小真.基于最小聚类单元的商圈聚类方法研究[J].西南师范大学学报（自然科学版）,2004,29(6):949-952. 被引量：2
4肖冰,王伟,邓飞其.一种多维关联规则算法的研究[J].重庆工商大学学报（自然科学版）,2005,22(4):339-342. 被引量：4
5杨玉强,赵连朋.基于数据网格进行知识关联规则挖掘方法研究[J].计算机工程与应用,2007,43(13):167-169. 被引量：1
6赖英旭,刘增辉.基于关联规则的未知病毒检测方法研究[J].计算机工程与应用,2008,44(7):133-135. 被引量：3
7张春生,宋琳琳.分段支持度Apriori算法及应用[J].计算机工程与应用,2010,46(16):157-159. 被引量：10
8杜垒,王俊京.一种新的最大频繁项集挖掘算法[J].科技信息,2011(14).
9刘杰,尹秋萍,郑洁,雷亚兰,章汉旺.转录因子C/EBP-β在人子宫内膜中的表达[J].中国妇幼保健,2012,27(9):1385-1387. 被引量：3
10王红梅,胡明.基于散列的频繁项集分组算法[J].计算机应用,2013,33(11):3045-3048. 被引量：1

二级引证文献87

1徐龙,杨君锐.基于数据库变化的关联规则增量式更新算法[J].重庆科技学院学报（自然科学版）,2007,9(4):67-70. 被引量：1
2董祥和,仲丛友,董荣和.一种有效的关联规则增量更新算法[J].微电子学与计算机,2009,26(3):113-116.
3纪秀辉,周亮.图像数据挖掘过程和方法的研究[J].硅谷,2009,2(3).
4胡锦丽.二次挖掘的关联规则增量更新算法[J].福建商业高等专科学校学报,2007(1):99-102.
5张旺光,庄毅.M+树:一种新型、高效的动态哈希算法[J].计算机工程,2004,30(16):94-95. 被引量：2
6邵雷兵,庄毅.一种基于学习的自适应哈希算法研究[J].微电子学与计算机,2004,21(8):68-72. 被引量：1
7周庆利,贺贤明.一种复合式索引结构及其性能比较研究[J].微电子学与计算机,2004,21(10):71-73. 被引量：2
8颜跃进,李舟军,陈火旺.基于FP-Tree有效挖掘最大频繁项集[J].软件学报,2005,16(2):215-222. 被引量：68
9颜跃进,李舟军,陈火旺.一种挖掘最大频繁项集的深度优先算法[J].计算机研究与发展,2005,42(3):462-467. 被引量：20
10杜培军,高松洁.高光谱遥感数据挖掘若干基本问题的研究[J].遥感信息,2005,27(3):53-57. 被引量：5

1张倩,王治和,杨俊.一种Apriori的改进算法[J].沈阳理工大学学报,2006,25(1):40-42. 被引量：3
2何勇军.在人脸识别中用Fisher基本思想调整训练样本[J].中国高新技术企业,2007(8):84-85.
3潘大胜.一种多段支持度数据挖掘算法[J].萍乡学院学报,2015,32(3):86-90.
4朱其祥,徐勇,张林.基于改进Apriori算法的关联规则挖掘研究[J].计算机技术与发展,2006,16(7):102-104. 被引量：16
5蒋蜀,陈佩佩,谢立.并行数据库的研究[J].计算机研究与发展,1994,31(1):1-10. 被引量：6
6张伟丰,杨丽华.基于矩阵的多段支持度关联规则挖掘算法[J].湖北汽车工业学院学报,2014,28(2):72-76. 被引量：3
7张显全,刘丽娜,唐振军.基于凸多边形的凸壳算法[J].计算机科学,2006,33(9):218-221. 被引量：6
8綦孝姬,于红,刘溪婧,邵乐,梁晓娜.基于候选项目集特性的改进Apriori算法研究[J].郑州大学学报（理学版）,2009,41(1):36-39. 被引量：1
9郏方贵,泮海敏.一个高效剪枝的新关联规则挖掘算法[J].计算机应用研究,2004,21(11):168-169. 被引量：5
10柯莉珍,苏厚勤.NES-Join算法的改进算法[J].计算机应用与软件,2007,24(9):175-178.

计算机学报

2001年第6期

浏览历史

内容加载中请稍等...

多段支持度数据挖掘算法研究被引量：23

参考文献5

同被引文献150

引证文献23

二级引证文献87

相关作者

相关机构

相关主题

浏览历史

多段支持度数据挖掘算法研究 被引量：23

参考文献5

同被引文献150

引证文献23

二级引证文献87

相关作者

相关机构

相关主题

浏览历史

多段支持度数据挖掘算法研究被引量：23