In this letter, on the basis of Frequent Pattern(FP) tree, the support function to update FP-tree is introduced, then an Incremental FP (IFP) algorithm for mining association rules is proposed. IFP algorithm considers...In this letter, on the basis of Frequent Pattern(FP) tree, the support function to update FP-tree is introduced, then an Incremental FP (IFP) algorithm for mining association rules is proposed. IFP algorithm considers not only adding new data into the database but also reducing old data from the database. Furthermore, it can predigest five cases to three cases.The algorithm proposed in this letter can avoid generating lots of candidate items, and it is high efficient.展开更多
Discovering regularities between entities in temporal graphs is vital for many real-world applications(e.g.,social recommendation,emergency event detection,and cyberattack event detection).This paper proposes temporal...Discovering regularities between entities in temporal graphs is vital for many real-world applications(e.g.,social recommendation,emergency event detection,and cyberattack event detection).This paper proposes temporal graph association rules(TGARs)that extend traditional graph-pattern association rules in a static graph by incorporating the unique temporal information and constraints.We introduce quality measures(e.g.,support,confidence,and diversification)to characterize meaningful TGARs that are useful and diversified.In addition,the proposed support metric is an upper bound for alternative metrics,allowing us to guarantee a superset of patterns.We extend conventional confidence measures in terms of maximal occurrences of TGARs.The diversification score strikes a balance between interestingness and diversity.Although the problem is NP-hard,we develop an effective discovery algorithm for TGARs that integrates TGARs generation and TGARs selection and shows that mining TGARs is feasible over a temporal graph.We propose pruning strategies to filter TGARs that have low support or cannot make top-k as early as possible.Moreover,we design an auxiliary data structure to prune the TGARs that do not meet the constraints during the TGARs generation process to avoid conducting repeated subgraph matching for each extension in the search space.We experimentally verify the effectiveness,efficiency,and scalability of our algorithms in discovering diversified top-k TGARs from temporal graphs in real-life applications.展开更多
Maximum frequent pattern generation from a large database of transactions and items for association rule mining is an important research topic in data mining. Association rule mining aims to discover interesting corre...Maximum frequent pattern generation from a large database of transactions and items for association rule mining is an important research topic in data mining. Association rule mining aims to discover interesting correlations, frequent patterns, associations, or causal structures between items hidden in a large database. By exploiting quantum computing, we propose an efficient quantum search algorithm design to discover the maximum frequent patterns. We modified Grover’s search algorithm so that a subspace of arbitrary symmetric states is used instead of the whole search space. We presented a novel quantum oracle design that employs a quantum counter to count the maximum frequent items and a quantum comparator to check with a minimum support threshold. The proposed derived algorithm increases the rate of the correct solutions since the search is only in a subspace. Furthermore, our algorithm significantly scales and optimizes the required number of qubits in design, which directly reflected positively on the performance. Our proposed design can accommodate more transactions and items and still have a good performance with a small number of qubits.展开更多
By analyzing the existing prefix-tree data structure, an improved pattern tree was introduced for processing new transactions. It firstly stored transactions in a lexicographic order tree and then restructured the tre...By analyzing the existing prefix-tree data structure, an improved pattern tree was introduced for processing new transactions. It firstly stored transactions in a lexicographic order tree and then restructured the tree by sorting each path in a frequency-descending order. While updating the improved pattern tree, there was no need to rescan the entire new database or reconstruct a new tree for incremental updating. A test was performed on synthetic dataset T10I4D100K with 100,000 transactions and 870 items. Experimental results show that the smaller the minimum support threshold, the faster the improved pattern tree achieves over CanTree for all datasets. As the minimum support threshold increased from 2% to 3.5%, the runtime decreased from 452.71 s to 186.26 s. Meanwhile, the runtime required by CanTree decreased from 1,367.03 s to 432.19 s. When the database was updated, the execution time of im- proved pattern tree consisted of construction of original improved pattern trees and reconstruction of initial tree. The experiment results showed that the runtime was saved by about 15% compared with that of CanTree. As the number of transactions increased, the runtime of improved pattern tree was about 25% shorter than that of FP-tree. The improved pattern tree also required less memory than CanTree.展开更多
Despite advances in technological complexity and efforts,software repository maintenance requires reusing the data to reduce the effort and complexity.However,increasing ambiguity,irrelevance,and bugs while extracting...Despite advances in technological complexity and efforts,software repository maintenance requires reusing the data to reduce the effort and complexity.However,increasing ambiguity,irrelevance,and bugs while extracting similar data during software development generate a large amount of data from those data that reside in repositories.Thus,there is a need for a repository mining technique for relevant and bug-free data prediction.This paper proposes a fault prediction approach using a data-mining technique to find good predictors for high-quality software.To predict errors in mining data,the Apriori algorithm was used to discover association rules by fixing confidence at more than 40%and support at least 30%.The pruning strategy was adopted based on evaluation measures.Next,the rules were extracted from three projects of different domains;the extracted rules were then combined to obtain the most popular rules based on the evaluation measure values.To evaluate the proposed approach,we conducted an experimental study to compare the proposed rules with existing ones using four different industrial projects.The evaluation showed that the results of our proposal are promising.Practitioners and developers can utilize these rules for defect prediction during early software development.展开更多
This paper presents a new efficient algorithm for mining frequent closed itemsets. It enumerates the closed set of frequent itemsets by using a novel compound frequent itemset tree that facilitates fast growth and eff...This paper presents a new efficient algorithm for mining frequent closed itemsets. It enumerates the closed set of frequent itemsets by using a novel compound frequent itemset tree that facilitates fast growth and efficient pruning of search space. It also employs a hybrid approach that adapts search strategies, representations of projected transaction subsets, and projecting methods to the characteristics of the dataset. Efficient local pruning, global subsumption checking, and fast hashing methods are detailed in this paper. The principle that balances the overheads of search space growth and pruning is also discussed. Extensive experimental evaluations on real world and artificial datasets showed that our algorithm outperforms CHARM by a factor of five and is one to three orders of magnitude more efficient than CLOSET and MAFIA.展开更多
目的:通过数据挖掘总结稳定型心绞痛的证候分布特点及组方用药规律。方法:搜索国家知识基础设施数据库(CNKI)、中国学术期刊数据库(CSPD)、中文科技期刊数据库(CCD)、PubMed、Web of Science数据库从建库至2022年2月收录的有关中医药治...目的:通过数据挖掘总结稳定型心绞痛的证候分布特点及组方用药规律。方法:搜索国家知识基础设施数据库(CNKI)、中国学术期刊数据库(CSPD)、中文科技期刊数据库(CCD)、PubMed、Web of Science数据库从建库至2022年2月收录的有关中医药治疗稳定型心绞痛的相关文献,建立方药数据库,并对高频药物进行关联规则及系统聚类分析。结果:共纳入文献105篇,处方111首,涉及患者9 786例,涵盖中医证型21种,频次最高证型为气虚血瘀,共提取出7个病性证素和2个病位证素。涉及中药123味,共计使用频次1 162次,使用频次前5位的中药分别是丹参70次、川芎69次、甘草54次、黄芪53次、当归50次。前30味高频药物关联规则分析显示,支持度最高的前3位药对为川芎-当归、川芎-赤芍、川芎-红花,聚类分析得到4个药物核心组合。结论:稳定型心绞痛常见气虚血瘀、痰瘀互结、气滞血瘀证;病性证素以实性居多,常见血瘀、痰浊,虚性证素主要为气虚;用药主要以活血化瘀、理气通滞、化痰降浊、益气通脉为法,是现代中医治疗稳定型心绞痛的主导思想。展开更多
目的:运用关联规则和复杂系统熵聚类分析治疗儿童和青少年近视的选穴规律。方法:计算机检索中国知网(CNKI)、万方(Wan-fang)、维普(VIP)、中国生物医学文献数据库(CBM)、PubMed、Web of science、Embase等中英文数据库自建库以来有关耳...目的:运用关联规则和复杂系统熵聚类分析治疗儿童和青少年近视的选穴规律。方法:计算机检索中国知网(CNKI)、万方(Wan-fang)、维普(VIP)、中国生物医学文献数据库(CBM)、PubMed、Web of science、Embase等中英文数据库自建库以来有关耳穴贴压治疗儿童青少年近视的临床研究,建立耳穴处方数据库,并采用中医传承辅助平台(V2.5)进行数据分析。结果:最终纳入文献231篇,包含61310例患者,涉及耳穴51个;选穴数量多为6~8个,干预周期平均6.5周;高频耳穴为肝穴、眼穴、肾穴等;得到由肝穴、眼穴、肾穴、屏间前穴和屏间后穴交叉配伍组成的常用耳穴组合10组,获取强关联耳穴组合规则18条;提取包含3个耳穴的核心组合19组;演化出耳穴新方6个,包括心穴+脑干穴+胰胆穴+肾穴+脾穴、皮质下穴+角窝中穴+脑干穴+胃穴、屏间后穴+屏间前穴+耳背心穴+肝穴+肾穴、屏间后穴+眼穴+大肠穴+肾穴、脑干穴+新眼穴+胃穴+胰胆穴、交感穴+神门穴+新眼穴+内分泌穴+胰胆穴。结论:本研究探索出的耳穴配伍规律、核心组合及新处方,可为今后临床治疗儿童和青少年近视提供诊治新思路。展开更多
挖掘最大频繁项目集是多种数据挖掘应用中的关键问题,之前的很多研究都是采用Apriori类的候选项目集生成-检验方法.然而,候选项目集产生的代价是很高的,尤其是在存在大量强模式和/或长模式的时候.提出了一种快速的基于频繁模式树(FP-tr...挖掘最大频繁项目集是多种数据挖掘应用中的关键问题,之前的很多研究都是采用Apriori类的候选项目集生成-检验方法.然而,候选项目集产生的代价是很高的,尤其是在存在大量强模式和/或长模式的时候.提出了一种快速的基于频繁模式树(FP-tree)的最大频繁项目集挖掘DMFIA(discover maximum frequent itemsets algorithm)及其更新算法UMFIA(update maximum frequent itemsets algorithm).算法UMFIA将充分利用以前的挖掘结果来减少在更新的数据库中发现新的最大频繁项目集的费用.展开更多
基金Supported in part by the National Natural Science Foundation of China(No.60073012),Natural Science Foundation of Jiangsu(BK2001004)
文摘In this letter, on the basis of Frequent Pattern(FP) tree, the support function to update FP-tree is introduced, then an Incremental FP (IFP) algorithm for mining association rules is proposed. IFP algorithm considers not only adding new data into the database but also reducing old data from the database. Furthermore, it can predigest five cases to three cases.The algorithm proposed in this letter can avoid generating lots of candidate items, and it is high efficient.
基金This work was partially supported by the National Key Research and Development Program(No.2018YFB1800203)National Natural Science Foundation of China(No.U19B2024)Postgraduate Scientific Research Innovation Project of Hunan Province(No.CX20210038).
文摘Discovering regularities between entities in temporal graphs is vital for many real-world applications(e.g.,social recommendation,emergency event detection,and cyberattack event detection).This paper proposes temporal graph association rules(TGARs)that extend traditional graph-pattern association rules in a static graph by incorporating the unique temporal information and constraints.We introduce quality measures(e.g.,support,confidence,and diversification)to characterize meaningful TGARs that are useful and diversified.In addition,the proposed support metric is an upper bound for alternative metrics,allowing us to guarantee a superset of patterns.We extend conventional confidence measures in terms of maximal occurrences of TGARs.The diversification score strikes a balance between interestingness and diversity.Although the problem is NP-hard,we develop an effective discovery algorithm for TGARs that integrates TGARs generation and TGARs selection and shows that mining TGARs is feasible over a temporal graph.We propose pruning strategies to filter TGARs that have low support or cannot make top-k as early as possible.Moreover,we design an auxiliary data structure to prune the TGARs that do not meet the constraints during the TGARs generation process to avoid conducting repeated subgraph matching for each extension in the search space.We experimentally verify the effectiveness,efficiency,and scalability of our algorithms in discovering diversified top-k TGARs from temporal graphs in real-life applications.
文摘Maximum frequent pattern generation from a large database of transactions and items for association rule mining is an important research topic in data mining. Association rule mining aims to discover interesting correlations, frequent patterns, associations, or causal structures between items hidden in a large database. By exploiting quantum computing, we propose an efficient quantum search algorithm design to discover the maximum frequent patterns. We modified Grover’s search algorithm so that a subspace of arbitrary symmetric states is used instead of the whole search space. We presented a novel quantum oracle design that employs a quantum counter to count the maximum frequent items and a quantum comparator to check with a minimum support threshold. The proposed derived algorithm increases the rate of the correct solutions since the search is only in a subspace. Furthermore, our algorithm significantly scales and optimizes the required number of qubits in design, which directly reflected positively on the performance. Our proposed design can accommodate more transactions and items and still have a good performance with a small number of qubits.
基金Supported by National Natural Science Foundation of China (No.50975193)Specialized Research Fund for Doctoral Program of Higher Education of China (No.20060056016)
文摘By analyzing the existing prefix-tree data structure, an improved pattern tree was introduced for processing new transactions. It firstly stored transactions in a lexicographic order tree and then restructured the tree by sorting each path in a frequency-descending order. While updating the improved pattern tree, there was no need to rescan the entire new database or reconstruct a new tree for incremental updating. A test was performed on synthetic dataset T10I4D100K with 100,000 transactions and 870 items. Experimental results show that the smaller the minimum support threshold, the faster the improved pattern tree achieves over CanTree for all datasets. As the minimum support threshold increased from 2% to 3.5%, the runtime decreased from 452.71 s to 186.26 s. Meanwhile, the runtime required by CanTree decreased from 1,367.03 s to 432.19 s. When the database was updated, the execution time of im- proved pattern tree consisted of construction of original improved pattern trees and reconstruction of initial tree. The experiment results showed that the runtime was saved by about 15% compared with that of CanTree. As the number of transactions increased, the runtime of improved pattern tree was about 25% shorter than that of FP-tree. The improved pattern tree also required less memory than CanTree.
基金This research was financially supported in part by the Ministry of Trade,Industry and Energy(MOTIE)and Korea Institute for Advancement of Technology(KIAT)through the International Cooperative R&D program.(Project No.P0016038)in part by the MSIT(Ministry of Science and ICT),Korea,under the ITRC(Information Technology Research Center)support program(IITP-2021-2016-0-00312)supervised by the IITP(Institute for Information&communications Technology Planning&Evaluation).
文摘Despite advances in technological complexity and efforts,software repository maintenance requires reusing the data to reduce the effort and complexity.However,increasing ambiguity,irrelevance,and bugs while extracting similar data during software development generate a large amount of data from those data that reside in repositories.Thus,there is a need for a repository mining technique for relevant and bug-free data prediction.This paper proposes a fault prediction approach using a data-mining technique to find good predictors for high-quality software.To predict errors in mining data,the Apriori algorithm was used to discover association rules by fixing confidence at more than 40%and support at least 30%.The pruning strategy was adopted based on evaluation measures.Next,the rules were extracted from three projects of different domains;the extracted rules were then combined to obtain the most popular rules based on the evaluation measure values.To evaluate the proposed approach,we conducted an experimental study to compare the proposed rules with existing ones using four different industrial projects.The evaluation showed that the results of our proposal are promising.Practitioners and developers can utilize these rules for defect prediction during early software development.
文摘This paper presents a new efficient algorithm for mining frequent closed itemsets. It enumerates the closed set of frequent itemsets by using a novel compound frequent itemset tree that facilitates fast growth and efficient pruning of search space. It also employs a hybrid approach that adapts search strategies, representations of projected transaction subsets, and projecting methods to the characteristics of the dataset. Efficient local pruning, global subsumption checking, and fast hashing methods are detailed in this paper. The principle that balances the overheads of search space growth and pruning is also discussed. Extensive experimental evaluations on real world and artificial datasets showed that our algorithm outperforms CHARM by a factor of five and is one to three orders of magnitude more efficient than CLOSET and MAFIA.
文摘目的:通过数据挖掘总结稳定型心绞痛的证候分布特点及组方用药规律。方法:搜索国家知识基础设施数据库(CNKI)、中国学术期刊数据库(CSPD)、中文科技期刊数据库(CCD)、PubMed、Web of Science数据库从建库至2022年2月收录的有关中医药治疗稳定型心绞痛的相关文献,建立方药数据库,并对高频药物进行关联规则及系统聚类分析。结果:共纳入文献105篇,处方111首,涉及患者9 786例,涵盖中医证型21种,频次最高证型为气虚血瘀,共提取出7个病性证素和2个病位证素。涉及中药123味,共计使用频次1 162次,使用频次前5位的中药分别是丹参70次、川芎69次、甘草54次、黄芪53次、当归50次。前30味高频药物关联规则分析显示,支持度最高的前3位药对为川芎-当归、川芎-赤芍、川芎-红花,聚类分析得到4个药物核心组合。结论:稳定型心绞痛常见气虚血瘀、痰瘀互结、气滞血瘀证;病性证素以实性居多,常见血瘀、痰浊,虚性证素主要为气虚;用药主要以活血化瘀、理气通滞、化痰降浊、益气通脉为法,是现代中医治疗稳定型心绞痛的主导思想。
文摘目的:运用关联规则和复杂系统熵聚类分析治疗儿童和青少年近视的选穴规律。方法:计算机检索中国知网(CNKI)、万方(Wan-fang)、维普(VIP)、中国生物医学文献数据库(CBM)、PubMed、Web of science、Embase等中英文数据库自建库以来有关耳穴贴压治疗儿童青少年近视的临床研究,建立耳穴处方数据库,并采用中医传承辅助平台(V2.5)进行数据分析。结果:最终纳入文献231篇,包含61310例患者,涉及耳穴51个;选穴数量多为6~8个,干预周期平均6.5周;高频耳穴为肝穴、眼穴、肾穴等;得到由肝穴、眼穴、肾穴、屏间前穴和屏间后穴交叉配伍组成的常用耳穴组合10组,获取强关联耳穴组合规则18条;提取包含3个耳穴的核心组合19组;演化出耳穴新方6个,包括心穴+脑干穴+胰胆穴+肾穴+脾穴、皮质下穴+角窝中穴+脑干穴+胃穴、屏间后穴+屏间前穴+耳背心穴+肝穴+肾穴、屏间后穴+眼穴+大肠穴+肾穴、脑干穴+新眼穴+胃穴+胰胆穴、交感穴+神门穴+新眼穴+内分泌穴+胰胆穴。结论:本研究探索出的耳穴配伍规律、核心组合及新处方,可为今后临床治疗儿童和青少年近视提供诊治新思路。
文摘挖掘最大频繁项目集是多种数据挖掘应用中的关键问题,之前的很多研究都是采用Apriori类的候选项目集生成-检验方法.然而,候选项目集产生的代价是很高的,尤其是在存在大量强模式和/或长模式的时候.提出了一种快速的基于频繁模式树(FP-tree)的最大频繁项目集挖掘DMFIA(discover maximum frequent itemsets algorithm)及其更新算法UMFIA(update maximum frequent itemsets algorithm).算法UMFIA将充分利用以前的挖掘结果来减少在更新的数据库中发现新的最大频繁项目集的费用.