数据立方体与频繁项集的统一计算框架研究

Unified Computing Framework for Data Cubes and Frequent Itemsets

下载PDF

导出

摘要数据立方体和频繁项集挖掘分别是数据仓库和数据挖掘领域的重要技术,已开展了大量的相关研究工作,取得了较好的进展.数据立方体和频繁项集挖掘依据各自的数据单元和项集构造了类似的代数格(Lattice)结构;数据立方体的等价类上界单元与频繁项集挖掘的闭项集也是相对应的.如果能够论证二者的统一性,则可以为彼此提供更广泛的研究思路,有利于两种技术的相互促进,如:在数据库中利用冰山立方体计算实现频繁项集挖掘来避免数据迁移、利用频繁项集挖掘算法优化数据立方体计算等.之前的工作没有将二者系统地结合起来研究,也没有建立二者之间较为完整的联系.本文在深入研究数据立方体的计算和频繁项集挖掘的过程后,将二者有效地结合在一起,提出了统一的计算框架,给出了二者众多计算性质和方法之间的映射关系,进行了相关概念泛化,具体地建立了冰山立方体、浓缩立方体和商立方体等主要数据立方体计算与相应频繁项集挖掘方法的对应关系.通过算法和实验进一步论证统一计算的有效性:(1)将频繁项集挖掘事务集导入关系数据库,用冰山立方体计算方式进行频繁项集挖掘,从而在数据库中用标准的或扩展的SQL可以实现对关系表进行频繁项集挖掘;(2)验证了浓缩立方体与频繁项集挖掘的统一性并对比了计算效率;(3)将基本表转换为频繁项集挖掘事务集,引入高效的频繁项集挖掘算法LCM计算商立方体,以提升数据立方体计算效率.在公开的真实数据集和人工合成的数据集上验证二者结合、统一计算的正确性,通过改变元组数、维数和倾斜度进行对比验证有效性.实验发现,在大数据集上可令时间效率提升高达92%. Data cube and frequent itemset mining are essential technologies in the field of data warehouse and data mining respectively.A lot of relevant research work has been conducted,and impressive progress has been achieved.Data cube and frequent itemset mining construct similar algebraic lattice structures according to their data cell and itemset.Simultaneously,the upper bounds of the equivalent class of the data cube correspond to the closed itemset of frequent itemset mining.If the unity of the two lattice structures can be argued,they can provide broader research ideas for each other and facilitate the mutual promotion of the two techniques.For example,we can use iceberg cube computing to implement frequent itemset mining to avoid data migration in databases,and use frequent itemset mining algorithms to optimize data cube computing.Previous studies have not studied the two concepts with systematic combination,nor have they established a complete connection between them.After intensely studying the computation of data cube and the process of frequent itemset mining,this paper combines data cube and frequent itemset mining effectively.The paper proposes a unified computation framework,and gives the mapping relationship between multitudes of computational properties and methods of two lattice structure.Therefore,the high-performance lattice structure computation methods and application algorithms in the two fields can be integrated,thus improving the performance of lattice structure usage and enhancing the efficiency and accuracy of data analysis.On the basis of these results,related concept generalization was performed.Specifically,the corresponding relationship between the computation method of classic data cubee such as the iceberg cube,the condensed cube,and the quotient cube and the corresponding frequent itemset mining method is established.The effectiveness of the unified computation framework is further demonstrated by algorithms and experiments.First,the transaction datasets of frequent itemset mining are imported into the relational database,and the frequent itemset mining is performed with the iceberg cube computation method,so that frequent itemset mining of relational tables can be implemented in the database with standard SQL-92 or extended SQL.Secondly,the unification of the condensed cube and the frequent itemset mining is verified by experiments and the computation efficiency is compared.Finally,the base table is converted into a transaction dataset of frequent itemset mining,and LCM,an effective algorithm for frequent itemset mining,is introduced to implement the quotient cube computation to improve the computation efficiency of the data cube.At the meanwhile,we give an example to illustrate and explain the unified algorithm.The correctness of the combination and unified framework is verified by the experiments both on the publicly available real datasets and the synthetic datasets.Besides,the effectiveness is compared and verified by changing the number of tuples,dimensions,and skewness.It is found that the time efficiency can be improved by up to 92%on large datasets.

作者徐静文游进国王全鹍黄星瑞贾连印 XU Jing-Wen;YOU Jin-Guo;WANG Quan-Kun;HUANG Xing-Rui;JIA Lian-Yin(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500;Yunnan Key Laboratory of Artificial Intelligence,Kunming 650500)

机构地区昆明理工大学信息工程与自动化学院云南省人工智能重点实验室

出处《计算机学报》 EI CAS CSCD 北大核心 2023年第4期780-802,共23页 Chinese Journal of Computers

基金国家自然科学基金项目(No.62062046,No.61462050)资助.

关键词数据立方体频繁项集挖掘格结构统一计算方法计算效率 data cube frequent itemset mining lattice structure unified computation method computation efficiency

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献11

1李盛恩,王珊.封闭数据立方体技术研究[J].软件学报,2004,15(8):1165-1171. 被引量：25
2向隆刚,龚健雅.一种高度浓缩和语义保持的数据立方[J].计算机研究与发展,2007,44(5):837-844. 被引量：5
3师智斌,黄厚宽.基于形式概念分析的约简数据立方体研究[J].计算机研究与发展,2009,46(11):1956-1962. 被引量：6
4刘军煜,贾修一.一种利用关联规则挖掘的多标记分类算法[J].软件学报,2017,28(11):2865-2878. 被引量：34
5于自强,禹晓辉,董吉文,王琳.分布式多数据流频繁伴随模式挖掘[J].软件学报,2019,30(4):1078-1093. 被引量：10
6张绍雪,王丽珍,陈文和.CPM-MCHM:一种基于极大团和哈希表的空间并置模式挖掘算法[J].计算机学报,2022,45(3):526-541. 被引量：5
7梁文娟,陈红,赵素云,李翠平.一种面向数据流top-k频繁模式发布的差分隐私保护方案[J].计算机学报,2021,44(4):741-760. 被引量：4
8欧阳佳,印鉴,肖政宏,赵慧民,刘少鹏,梁鹏,肖茵茵.面向频繁项集挖掘的本地差分隐私事务数据收集方法[J].软件学报,2021,32(11):3541-3562. 被引量：7
9张静恬,伍赛,陈刚,寿黎但,陈珂.基于多维数据集的异常子群发现技术[J].计算机学报,2019,42(8):1671-1685. 被引量：11
10李玲,印莹,赵宇海,王国仁,董祥军.基于解耦概要图的大规模图数据高效分布式挖掘算法[J].计算机学报,2020,43(7):1183-1198. 被引量：5

二级参考文献48

1曲开社,翟岩慧.偏序集、包含度与形式概念分析[J].计算机学报,2006,29(2):219-226. 被引量：52
2Lakshmanan LVS, Pei J, Han JW. Quotient cube: How to summarize the semantics of a data cube. In: Bressan S, Chaudhri AB, Lee ML, Yu JX, Lacroix Z, eds. Proc. of the 23rd Int'l Conf. on Very Large Data Bases. Hong Kong: Morgan Kaufmann, 2002. 778～789.
3Sismanis Y, Deligiannakis A, Roussopoulos N, Kotidis Y. Dwarf: Shrinking the PetaCube. In: Franklin MJ, Moon B, Ailamaki A, eds. Proc. of the 2002 ACM SIGMOD Int'l Conf. on Management of Data. Madison: ACM Press, 2002. 464～475.
4Mumick IS, Quass D, Mumick BS. Maintenance of data cubes and summary tables in a warehouse. In: Peckham J, ed. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. Tucson: ACM Press, 1997. 100-111.
5Hahn C, Warren S, London J. Edited synoptic cloud reports from ships and land stations over the globe. 1996. http://cdiac.esd.ornl.gov/cdiac/ndps/ndp026b.html
6Gray J, Bosworth A, Layman A, Pirahesh H. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. In: Su SYW, ed. Proc. of the 12th Int'l Conf. on Data Engineering. New Orleans: IEEE Computer Society, 1996. 152～159.
7Agarwal S, Agrawal R, Deshpande PM, Gupta A, Naughton JF, Ramarkrishman R, Sarawagi S. On the computation of multidimensional aggregates. In: Vijayaraman TM, Buchmann AP, Mohan C, Sarda NL, eds. Proc. of the 22nd Int'l Conf. on Very Large Data Bases. Mumb
8Zhao Y, Deshpande PM, Naughton JF. An array-based algorithm for simultaneous multidimensional. In: Peckham J, ed. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. Tucson: ACM Press, 1997. 159-170.
9Ross KA, Srivastava D. Fast computation of sparse datacubes. In: Jarke M, Carey MJ, Dittrich KR, Lochovsky FH, Loucopoulos P, Jeusfeld MA, eds. Proc. of the 23rd Int'l Conf. on Very Large Data Bases. Athens: Morgan Kaufmann, 1997. 116～125.
10Harinarayan V, Rajaraman A, Ullman JD. Implementing data cubes efficiently. In: Jagadish HV, Mumick IS, eds. Proc. of the 1996 ACM SIGMOD Int'l Conf. on Management of Data. Montreal: ACM Press, 1996. 205-216.

共引文献102

1廖纪勇,吴晟,刘爱莲.一种基于邻接矩阵的频繁项集挖掘算法[J].数据通信,2020(6):30-34. 被引量：1
2邹杰军,王欣,石俊豪,兰卓,方宇,张翀,谢文波,沈玲珍.面向大图的Top-Rank-K频繁模式挖掘算法[J].南京大学学报（自然科学版）,2024,60(1):38-52.
3郑莉,陈素峰.航迹点搁浅风险数据关联规则挖掘方法研究[J].舰船科学技术,2019,41(24):43-45.
4孟敏.基于Apriori算法的船用物联网多来源数据深度挖掘方法[J].舰船科学技术,2019,0(24):193-195. 被引量：1
5万润君,郭嗣琮,刘海涛,曾繁慧.适于高维数据的多标记学习层次树模型[J].辽宁工程技术大学学报（自然科学版）,2022,41(1):73-78.
6冷芳玲,鲍玉斌,于戈,高伟.基于MapReduce的封闭数据立方[J].计算机研究与发展,2011,48(S3):232-238. 被引量：4
7牟雁超,李红燕,王腾蛟.PHCC:一种处理稀疏变化的封闭数据立方体算法[J].计算机研究与发展,2013,50(S2):85-93. 被引量：2
8Sheng-EnLi,ShanWang.Semi-Closed Cube： An Effective Approach to Trading Off Data Cube Size and Query Response Time[J].Journal of Computer Science & Technology,2005,20(3):367-372. 被引量：2
9吴杰,蒋外文.基于集合运算的数据立方体结构[J].计算机应用研究,2007,24(11):225-227.
10陈富强,奚建清.一种新的封闭立方体查询算法[J].微计算机应用,2008,29(4):63-66. 被引量：1

1解孟涛,刘俊标,王鹏飞,张雨露,韩立.电子束曲面直写的Monte Carlo仿真与实验[J].光学精密工程,2022,30(18):2232-2240. 被引量：1
2郑海峰.工业机器人性能试验所用最大立方体计算方法研究[J].机器人技术与应用,2020(3):17-21.
3高凡,乐鹏,姜良存,曹志鹏,梁哲恒,上官博屹,胡磊,赵帅锋.GeoCube:面向大规模分析的多源对地观测时空立方体[J].遥感学报,2022,26(6):1051-1066. 被引量：2
4刘宁,周宇豪.元宇宙出版的风险警惕与责任伦理[J].中国出版,2023(4):38-43.
5陈宁,黄瑜玺,陈石,高鋆.铁路集装箱动态封闭环识别方法[J].综合运输,2023,45(3):98-103.
6刘继承,吴昊,王文伟,胡静波.结合深度神经网络的特征选择算法研究[J].武汉理工大学学报（信息与管理工程版）,2023,45(1):49-53. 被引量：4
7徐洪海,张勇,刘文松,胡飞,潘尚举.一种基于多脉冲统计方法的局部放电自动定位方法[J].电气应用,2023,42(3):92-97.
8庞晓健,李少森,张文哲,陈刚,苏扬.面向电力工业互联网的海量告警实时无损压缩方法[J].中文科技期刊数据库（全文版）工程技术,2020(11):0049-0051.
9付振星,宁爱兵,曾宾,程志浩,张惠珍.TST问题的降阶回溯算法[J].计算机时代,2023(4):39-43.
10韩霄.网络安全大数据下的靶标系统的构建[J].微型电脑应用,2023,39(3):153-155.

计算机学报

2023年第4期

浏览历史

内容加载中请稍等...

数据立方体与频繁项集的统一计算框架研究

参考文献11

二级参考文献48

共引文献102

相关作者

相关机构

相关主题

浏览历史