期刊文献+

数据立方体与频繁项集的统一计算框架研究

Unified Computing Framework for Data Cubes and Frequent Itemsets
下载PDF
导出
摘要 数据立方体和频繁项集挖掘分别是数据仓库和数据挖掘领域的重要技术,已开展了大量的相关研究工作,取得了较好的进展.数据立方体和频繁项集挖掘依据各自的数据单元和项集构造了类似的代数格(Lattice)结构;数据立方体的等价类上界单元与频繁项集挖掘的闭项集也是相对应的.如果能够论证二者的统一性,则可以为彼此提供更广泛的研究思路,有利于两种技术的相互促进,如:在数据库中利用冰山立方体计算实现频繁项集挖掘来避免数据迁移、利用频繁项集挖掘算法优化数据立方体计算等.之前的工作没有将二者系统地结合起来研究,也没有建立二者之间较为完整的联系.本文在深入研究数据立方体的计算和频繁项集挖掘的过程后,将二者有效地结合在一起,提出了统一的计算框架,给出了二者众多计算性质和方法之间的映射关系,进行了相关概念泛化,具体地建立了冰山立方体、浓缩立方体和商立方体等主要数据立方体计算与相应频繁项集挖掘方法的对应关系.通过算法和实验进一步论证统一计算的有效性:(1)将频繁项集挖掘事务集导入关系数据库,用冰山立方体计算方式进行频繁项集挖掘,从而在数据库中用标准的或扩展的SQL可以实现对关系表进行频繁项集挖掘;(2)验证了浓缩立方体与频繁项集挖掘的统一性并对比了计算效率;(3)将基本表转换为频繁项集挖掘事务集,引入高效的频繁项集挖掘算法LCM计算商立方体,以提升数据立方体计算效率.在公开的真实数据集和人工合成的数据集上验证二者结合、统一计算的正确性,通过改变元组数、维数和倾斜度进行对比验证有效性.实验发现,在大数据集上可令时间效率提升高达92%. Data cube and frequent itemset mining are essential technologies in the field of data warehouse and data mining respectively.A lot of relevant research work has been conducted,and impressive progress has been achieved.Data cube and frequent itemset mining construct similar algebraic lattice structures according to their data cell and itemset.Simultaneously,the upper bounds of the equivalent class of the data cube correspond to the closed itemset of frequent itemset mining.If the unity of the two lattice structures can be argued,they can provide broader research ideas for each other and facilitate the mutual promotion of the two techniques.For example,we can use iceberg cube computing to implement frequent itemset mining to avoid data migration in databases,and use frequent itemset mining algorithms to optimize data cube computing.Previous studies have not studied the two concepts with systematic combination,nor have they established a complete connection between them.After intensely studying the computation of data cube and the process of frequent itemset mining,this paper combines data cube and frequent itemset mining effectively.The paper proposes a unified computation framework,and gives the mapping relationship between multitudes of computational properties and methods of two lattice structure.Therefore,the high-performance lattice structure computation methods and application algorithms in the two fields can be integrated,thus improving the performance of lattice structure usage and enhancing the efficiency and accuracy of data analysis.On the basis of these results,related concept generalization was performed.Specifically,the corresponding relationship between the computation method of classic data cubee such as the iceberg cube,the condensed cube,and the quotient cube and the corresponding frequent itemset mining method is established.The effectiveness of the unified computation framework is further demonstrated by algorithms and experiments.First,the transaction datasets of frequent itemset mining are imported into the relational database,and the frequent itemset mining is performed with the iceberg cube computation method,so that frequent itemset mining of relational tables can be implemented in the database with standard SQL-92 or extended SQL.Secondly,the unification of the condensed cube and the frequent itemset mining is verified by experiments and the computation efficiency is compared.Finally,the base table is converted into a transaction dataset of frequent itemset mining,and LCM,an effective algorithm for frequent itemset mining,is introduced to implement the quotient cube computation to improve the computation efficiency of the data cube.At the meanwhile,we give an example to illustrate and explain the unified algorithm.The correctness of the combination and unified framework is verified by the experiments both on the publicly available real datasets and the synthetic datasets.Besides,the effectiveness is compared and verified by changing the number of tuples,dimensions,and skewness.It is found that the time efficiency can be improved by up to 92%on large datasets.
作者 徐静文 游进国 王全鹍 黄星瑞 贾连印 XU Jing-Wen;YOU Jin-Guo;WANG Quan-Kun;HUANG Xing-Rui;JIA Lian-Yin(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500;Yunnan Key Laboratory of Artificial Intelligence,Kunming 650500)
出处 《计算机学报》 EI CAS CSCD 北大核心 2023年第4期780-802,共23页 Chinese Journal of Computers
基金 国家自然科学基金项目(No.62062046,No.61462050)资助.
关键词 数据立方体 频繁项集挖掘 格结构 统一计算方法 计算效率 data cube frequent itemset mining lattice structure unified computation method computation efficiency
  • 相关文献

参考文献11

二级参考文献48

  • 1曲开社,翟岩慧.偏序集、包含度与形式概念分析[J].计算机学报,2006,29(2):219-226. 被引量:52
  • 2Lakshmanan LVS, Pei J, Han JW. Quotient cube: How to summarize the semantics of a data cube. In: Bressan S, Chaudhri AB, Lee ML, Yu JX, Lacroix Z, eds. Proc. of the 23rd Int'l Conf. on Very Large Data Bases. Hong Kong: Morgan Kaufmann, 2002. 778~789.
  • 3Sismanis Y, Deligiannakis A, Roussopoulos N, Kotidis Y. Dwarf: Shrinking the PetaCube. In: Franklin MJ, Moon B, Ailamaki A, eds. Proc. of the 2002 ACM SIGMOD Int'l Conf. on Management of Data. Madison: ACM Press, 2002. 464~475.
  • 4Mumick IS, Quass D, Mumick BS. Maintenance of data cubes and summary tables in a warehouse. In: Peckham J, ed. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. Tucson: ACM Press, 1997. 100-111.
  • 5Hahn C, Warren S, London J. Edited synoptic cloud reports from ships and land stations over the globe. 1996. http://cdiac.esd.ornl.gov/cdiac/ndps/ndp026b.html
  • 6Gray J, Bosworth A, Layman A, Pirahesh H. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. In: Su SYW, ed. Proc. of the 12th Int'l Conf. on Data Engineering. New Orleans: IEEE Computer Society, 1996. 152~159.
  • 7Agarwal S, Agrawal R, Deshpande PM, Gupta A, Naughton JF, Ramarkrishman R, Sarawagi S. On the computation of multidimensional aggregates. In: Vijayaraman TM, Buchmann AP, Mohan C, Sarda NL, eds. Proc. of the 22nd Int'l Conf. on Very Large Data Bases. Mumb
  • 8Zhao Y, Deshpande PM, Naughton JF. An array-based algorithm for simultaneous multidimensional. In: Peckham J, ed. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. Tucson: ACM Press, 1997. 159-170.
  • 9Ross KA, Srivastava D. Fast computation of sparse datacubes. In: Jarke M, Carey MJ, Dittrich KR, Lochovsky FH, Loucopoulos P, Jeusfeld MA, eds. Proc. of the 23rd Int'l Conf. on Very Large Data Bases. Athens: Morgan Kaufmann, 1997. 116~125.
  • 10Harinarayan V, Rajaraman A, Ullman JD. Implementing data cubes efficiently. In: Jagadish HV, Mumick IS, eds. Proc. of the 1996 ACM SIGMOD Int'l Conf. on Management of Data. Montreal: ACM Press, 1996. 205-216.

共引文献102

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部