摘要
针对传统高效用项集挖掘算法在具有不同类型标签事务中报告假阳性高效用项集的问题,提出两个基于统计显著性检验的高效用项集挖掘算法——FHUI和PHUI算法。这两个算法首先找到所有待检验高效用项集并依据项集长度进行分组;然后,FHUI算法根据项集自身的频率分布生成零分布,PHUI算法根据事务内置换策略或事务间置换策略构造置换事务集合来生成零分布。最后,FHUI和PHUI算法从零分布中计算出p值并运用错误发现率剔除假阳性高效用项集。基准事务集合实验结果显示FHUI和PHUI算法能够剔除大量的假阳性高效用项集,在后续分类任务中取得了更高的正确率;仿真事务集合实验结果显示FHUI和PHUI算法报告的项集中假阳性高效用项集数量占比低于4.8%且平均效用高于39000。实验结果证明,在具有不同类型的标签事务中,FHUI和PHUI算法报告的统计显著高效用项集可靠性和实用性更强。
Aiming at the problem of traditional high utility itemset mining algorithms reporting false positive high utility itemsets in transactions with class labels,this paper proposed two high utility itemset mining algorithms called FHUI and PHUI.The FHUI and PHUI firstly found all the candidates and grouped them by length.Then,the FHUI established null distributions with the frequency distributions,while the PHUI established null distributions by the permutation strategy within or between transactions.Finally,the FHUI and PHUI calculated the p values from the null distributions and exploited the false discovery rate to eliminate the false positive high utility itemsets.The experiments on the benchmark data sets show that the FHUI and PHUI can eliminate a large number of false positive itemsets,which allows them to achieve higher accuracy rates in the classification tasks.The experiments on synthetic data sets reveal that the proportions of false positive itemsets reported by FHUI and PHUI are lower than 4.8%and the average utility values are higher than 39000.Experimental results prove that the statistically significant high utility itemsets reported by the FHUI and PHUI are more reliable and practical in transactions with class labels.
作者
吴军
魏丹丹
欧阳艾嘉
王亚
Wu Jun;Wei Dandan;Ouyang Aijia;Wang Ya(School of Information Engineering,Zunyi Normal University,Zunyi Guizhou 563000,China)
出处
《计算机应用研究》
CSCD
北大核心
2024年第10期2970-2977,共8页
Application Research of Computers
基金
国家自然科学基金资助项目(62066049)
贵州省教育厅高等学校青年资助项目(黔教技[2022]313,黔教合KY[2022]015)
贵州省科技厅科技支撑计划资助项目(黔科合支撑[2023]257)
遵义市科技合作资助项目(遵市科合HZ字(2022)123)。