Association rules mining is a major data mining field that leads to discovery of associations and correlations among items in today’s big data environment. The conventional association rule mining focuses mainly on p...Association rules mining is a major data mining field that leads to discovery of associations and correlations among items in today’s big data environment. The conventional association rule mining focuses mainly on positive itemsets generated from frequently occurring itemsets (PFIS). However, there has been a significant study focused on infrequent itemsets with utilization of negative association rules to mine interesting frequent itemsets (NFIS) from transactions. In this work, we propose an efficient backward calculating negative frequent itemset algorithm namely EBC-NFIS for computing backward supports that can extract both positive and negative frequent itemsets synchronously from dataset. EBC-NFIS algorithm is based on popular e-NFIS algorithm that computes supports of negative itemsets from the supports of positive itemsets. The proposed algorithm makes use of previously computed supports from memory to minimize the computation time. In addition, association rules, i.e. positive and negative association rules (PNARs) are generated from discovered frequent itemsets using EBC-NFIS algorithm. The efficiency of the proposed algorithm is verified by several experiments and comparing results with e-NFIS algorithm. The experimental results confirm that the proposed algorithm successfully discovers NFIS and PNARs and runs significantly faster than conventional e-NFIS algorithm.展开更多
Frequent itemset mining is an essential problem in data mining and plays a key role in many data mining applications.However,users’personal privacy will be leaked in the mining process.In recent years,application of ...Frequent itemset mining is an essential problem in data mining and plays a key role in many data mining applications.However,users’personal privacy will be leaked in the mining process.In recent years,application of local differential privacy protection models to mine frequent itemsets is a relatively reliable and secure protection method.Local differential privacy means that users first perturb the original data and then send these data to the aggregator,preventing the aggregator from revealing the user’s private information.We propose a novel framework that implements frequent itemset mining under local differential privacy and is applicable to user’s multi-attribute.The main technique has bitmap encoding for converting the user’s original data into a binary string.It also includes how to choose the best perturbation algorithm for varying user attributes,and uses the frequent pattern tree(FP-tree)algorithm to mine frequent itemsets.Finally,we incorporate the threshold random response(TRR)algorithm in the framework and compare it with the existing algorithms,and demonstrate that the TRR algorithm has higher accuracy for mining frequent itemsets.展开更多
With the advent of the IoT era, the amount of real-time data that is processed in data centers has increased explosively. As a result, stream mining, extracting useful knowledge from a huge amount of data in real time...With the advent of the IoT era, the amount of real-time data that is processed in data centers has increased explosively. As a result, stream mining, extracting useful knowledge from a huge amount of data in real time, is attracting more and more attention. It is said, however, that real- time stream processing will become more difficult in the near future, because the performance of processing applications continues to increase at a rate of 10% - 15% each year, while the amount of data to be processed is increasing exponentially. In this study, we focused on identifying a promising stream mining algorithm, specifically a Frequent Itemset Mining (FIsM) algorithm, then we improved its performance using an FPGA. FIsM algorithms are important and are basic data- mining techniques used to discover association rules from transactional databases. We improved on an approximate FIsM algorithm proposed recently so that it would fit onto hardware architecture efficiently. We then ran experiments on an FPGA. As a result, we have been able to achieve a speed 400% faster than the original algorithm implemented on a CPU. Moreover, our FPGA prototype showed a 20 times speed improvement compared to the CPU version.展开更多
Numerous models have been proposed to reduce the classification error of Na¨ ve Bayes by weakening its attribute independence assumption and some have demonstrated remarkable error performance. Considering that e...Numerous models have been proposed to reduce the classification error of Na¨ ve Bayes by weakening its attribute independence assumption and some have demonstrated remarkable error performance. Considering that ensemble learning is an effective method of reducing the classification error of the classifier, this paper proposes a double-layer Bayesian classifier ensembles (DLBCE) algorithm based on frequent itemsets. DLBCE constructs a double-layer Bayesian classifier (DLBC) for each frequent itemset the new instance contained and finally ensembles all the classifiers by assigning different weight to different classifier according to the conditional mutual information. The experimental results show that the proposed algorithm outperforms other outstanding algorithms.展开更多
Current technology for frequent itemset mining mostly applies to the data stored in a single transaction database. This paper presents a novel algorithm MultiClose for frequent itemset mining in data warehouses. Multi...Current technology for frequent itemset mining mostly applies to the data stored in a single transaction database. This paper presents a novel algorithm MultiClose for frequent itemset mining in data warehouses. MultiClose respectively computes the results in single dimension tables and merges the results with a very efficient approach. Close itemsets technique is used to improve the performance of the algorithm. The authors propose an efficient implementation for star schemas in which their al- gorithm outperforms state-of-the-art single-table algorithms.展开更多
Most of the existing text clustering algorithms overlook the fact that one document is a word sequence with semantic information. There is some important semantic information existed in the positions of words in the s...Most of the existing text clustering algorithms overlook the fact that one document is a word sequence with semantic information. There is some important semantic information existed in the positions of words in the sequence. In this paper, a novel method named Frequent Itemset-based Clustering with Window (FICW) was proposed, which makes use of the semantic information for text clustering with a window constraint. The experimental results obtained from tests on three (hypertext) text sets show that FICW outperforms the method compared in both clustering accuracy and efficiency.展开更多
The problem of association rule mining has gained considerable prominence in the data mining community for its use as an important tool of knowledge discovery from large scale databases. And there has been a spurt of ...The problem of association rule mining has gained considerable prominence in the data mining community for its use as an important tool of knowledge discovery from large scale databases. And there has been a spurt of research activities around this problem. However, traditional association rule mining may often derive many rules in which people are uninterested. This paper reports a generalization of association rule mining called φ association rule mining. It allows people to have different interests on different itemsets that arethe need of real application. Also, it can help to derive interesting rules and substantially reduce the amount of rules. An algorithm based on FP tree for mining φ frequent itemset is presented. It is shown by experiments that the proposed methodis efficient and scalable over large databases.展开更多
The classical algorithm of finding association rules generated by a frequent itemset has to generate all non-empty subsets of the frequent itemset as candidate set of consequents. Xiongfei Li aimed at this and propose...The classical algorithm of finding association rules generated by a frequent itemset has to generate all non-empty subsets of the frequent itemset as candidate set of consequents. Xiongfei Li aimed at this and proposed an improved algorithm. The algorithm finds all consequents layer by layer, so it is breadth-first. In this paper, we propose a new algorithm Generate Rules by using Set-Enumeration Tree (GRSET) which uses the structure of Set-Enumeration Tree and depth-first method to find all consequents of the association rules one by one and get all association rules correspond to the consequents. Experiments show GRSET algorithm to be practicable and efficient.展开更多
Local differential privacy(LDP)approaches to collecting sensitive information for frequent itemset mining(FIM)can reliably guarantee privacy.Most current approaches to FIM under LDP add"padding and sampling"...Local differential privacy(LDP)approaches to collecting sensitive information for frequent itemset mining(FIM)can reliably guarantee privacy.Most current approaches to FIM under LDP add"padding and sampling"steps to obtain frequent itemsets and their frequencies because each user transaction represents a set of items.The current state-of-the-art approach,namely set-value itemset mining(SVSM),must balance variance and bias to achieve accurate results.Thus,an unbiased FIM approach with lower variance is highly promising.To narrow this gap,we propose an Item-Level LDP frequency oracle approach,named the Integrated-with-Hadamard-Transform-Based Frequency Oracle(IHFO).For the first time,Hadamard encoding is introduced to a set of values to encode all items into a fixed vector,and perturbation can be subsequently applied to the vector.An FIM approach,called optimized united itemset mining(O-UISM),is pro-posed to combine the padding-and-sampling-based frequency oracle(PSFO)and the IHFO into a framework for acquiring accurate frequent itemsets with their frequencies.Finally,we theoretically and experimentally demonstrate that O-UISM significantly outperforms the extant approaches in finding frequent itemsets and estimating their frequencies under the same privacy guarantee.展开更多
A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. Due to this reason, most algorithms for data streams sacrifice the correctness of their results for fast processin...A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. Due to this reason, most algorithms for data streams sacrifice the correctness of their results for fast processing time. The processing time is greatly influenced by the amount of information that should be maintained. This issue becomes more serious in finding frequent itemsets or frequency counting over an online transactional data stream since there can be a large number of itemsets to be monitored. We have proposed a method called the estDec method for finding frequent itemsets over an online data stream. In order to reduce the number of monitored itemsets in this method, monitoring the count of an itemset is delayed until its support is large enough to become a frequent itemset in the near future. For this purpose, the count of an itemset should be estimated. Consequently, how to estimate the count of an itemset is a critical issue in minimizing memory usage as well as processing time. In this paper, the effects of various count estimation methods for finding frequent itemsets are analyzed in terms of mining accuracy, memory usage and processing time.展开更多
Large high-dimensional data have posed great challenges to existing algorithms for frequent itemsets mining.To solve the problem,a hybrid method,consisting of a novel row enumeration algorithm and a column enumeration...Large high-dimensional data have posed great challenges to existing algorithms for frequent itemsets mining.To solve the problem,a hybrid method,consisting of a novel row enumeration algorithm and a column enumeration algorithm,is proposed.The intention of the hybrid method is to decompose the mining task into two subtasks and then choose appropriate algorithms to solve them respectively.The novel algorithm,i.e.,Inter-transaction is based on the characteristic that there are few common items between or among long transactions.In addition,an optimization technique is adopted to improve the performance of the intersection of bit-vectors.Experiments on synthetic data show that our method achieves high performance in large high-dimensional data.展开更多
文摘Association rules mining is a major data mining field that leads to discovery of associations and correlations among items in today’s big data environment. The conventional association rule mining focuses mainly on positive itemsets generated from frequently occurring itemsets (PFIS). However, there has been a significant study focused on infrequent itemsets with utilization of negative association rules to mine interesting frequent itemsets (NFIS) from transactions. In this work, we propose an efficient backward calculating negative frequent itemset algorithm namely EBC-NFIS for computing backward supports that can extract both positive and negative frequent itemsets synchronously from dataset. EBC-NFIS algorithm is based on popular e-NFIS algorithm that computes supports of negative itemsets from the supports of positive itemsets. The proposed algorithm makes use of previously computed supports from memory to minimize the computation time. In addition, association rules, i.e. positive and negative association rules (PNARs) are generated from discovered frequent itemsets using EBC-NFIS algorithm. The efficiency of the proposed algorithm is verified by several experiments and comparing results with e-NFIS algorithm. The experimental results confirm that the proposed algorithm successfully discovers NFIS and PNARs and runs significantly faster than conventional e-NFIS algorithm.
基金This paper is supported by the Inner Mongolia Natural Science Foundation(Grant Number:2018MS06026,Sponsored Authors:Liu,H.and Ma,X.,Sponsors’Websites:http://kjt.nmg.gov.cn/)the Science and Technology Program of Inner Mongolia Autonomous Region(Grant Number:2019GG116,Sponsored Authors:Liu,H.and Ma,X.,Sponsors’Websites:http://kjt.nmg.gov.cn/).
文摘Frequent itemset mining is an essential problem in data mining and plays a key role in many data mining applications.However,users’personal privacy will be leaked in the mining process.In recent years,application of local differential privacy protection models to mine frequent itemsets is a relatively reliable and secure protection method.Local differential privacy means that users first perturb the original data and then send these data to the aggregator,preventing the aggregator from revealing the user’s private information.We propose a novel framework that implements frequent itemset mining under local differential privacy and is applicable to user’s multi-attribute.The main technique has bitmap encoding for converting the user’s original data into a binary string.It also includes how to choose the best perturbation algorithm for varying user attributes,and uses the frequent pattern tree(FP-tree)algorithm to mine frequent itemsets.Finally,we incorporate the threshold random response(TRR)algorithm in the framework and compare it with the existing algorithms,and demonstrate that the TRR algorithm has higher accuracy for mining frequent itemsets.
文摘With the advent of the IoT era, the amount of real-time data that is processed in data centers has increased explosively. As a result, stream mining, extracting useful knowledge from a huge amount of data in real time, is attracting more and more attention. It is said, however, that real- time stream processing will become more difficult in the near future, because the performance of processing applications continues to increase at a rate of 10% - 15% each year, while the amount of data to be processed is increasing exponentially. In this study, we focused on identifying a promising stream mining algorithm, specifically a Frequent Itemset Mining (FIsM) algorithm, then we improved its performance using an FPGA. FIsM algorithms are important and are basic data- mining techniques used to discover association rules from transactional databases. We improved on an approximate FIsM algorithm proposed recently so that it would fit onto hardware architecture efficiently. We then ran experiments on an FPGA. As a result, we have been able to achieve a speed 400% faster than the original algorithm implemented on a CPU. Moreover, our FPGA prototype showed a 20 times speed improvement compared to the CPU version.
基金supported by National Natural Science Foundation of China (Nos. 61073133, 60973067, and 61175053)Fundamental Research Funds for the Central Universities of China(No. 2011ZD010)
文摘Numerous models have been proposed to reduce the classification error of Na¨ ve Bayes by weakening its attribute independence assumption and some have demonstrated remarkable error performance. Considering that ensemble learning is an effective method of reducing the classification error of the classifier, this paper proposes a double-layer Bayesian classifier ensembles (DLBCE) algorithm based on frequent itemsets. DLBCE constructs a double-layer Bayesian classifier (DLBC) for each frequent itemset the new instance contained and finally ensembles all the classifiers by assigning different weight to different classifier according to the conditional mutual information. The experimental results show that the proposed algorithm outperforms other outstanding algorithms.
文摘Current technology for frequent itemset mining mostly applies to the data stored in a single transaction database. This paper presents a novel algorithm MultiClose for frequent itemset mining in data warehouses. MultiClose respectively computes the results in single dimension tables and merges the results with a very efficient approach. Close itemsets technique is used to improve the performance of the algorithm. The authors propose an efficient implementation for star schemas in which their al- gorithm outperforms state-of-the-art single-table algorithms.
基金Supported by the Natural Science Foundation ofHubei Province(ABA048)
文摘Most of the existing text clustering algorithms overlook the fact that one document is a word sequence with semantic information. There is some important semantic information existed in the positions of words in the sequence. In this paper, a novel method named Frequent Itemset-based Clustering with Window (FICW) was proposed, which makes use of the semantic information for text clustering with a window constraint. The experimental results obtained from tests on three (hypertext) text sets show that FICW outperforms the method compared in both clustering accuracy and efficiency.
文摘The problem of association rule mining has gained considerable prominence in the data mining community for its use as an important tool of knowledge discovery from large scale databases. And there has been a spurt of research activities around this problem. However, traditional association rule mining may often derive many rules in which people are uninterested. This paper reports a generalization of association rule mining called φ association rule mining. It allows people to have different interests on different itemsets that arethe need of real application. Also, it can help to derive interesting rules and substantially reduce the amount of rules. An algorithm based on FP tree for mining φ frequent itemset is presented. It is shown by experiments that the proposed methodis efficient and scalable over large databases.
基金Supported by the National Natural Science Foundation of China (No.60474022) the Natural Science Foundation of Henan Province(No. G2002026,200510475028)
文摘The classical algorithm of finding association rules generated by a frequent itemset has to generate all non-empty subsets of the frequent itemset as candidate set of consequents. Xiongfei Li aimed at this and proposed an improved algorithm. The algorithm finds all consequents layer by layer, so it is breadth-first. In this paper, we propose a new algorithm Generate Rules by using Set-Enumeration Tree (GRSET) which uses the structure of Set-Enumeration Tree and depth-first method to find all consequents of the association rules one by one and get all association rules correspond to the consequents. Experiments show GRSET algorithm to be practicable and efficient.
基金supported by the National Natural Science Foundation of China under Grant Nos.61772537,61772536,62072460,62076245,and 62172424the National Key Research and Development Program of China under Grant No.2018YFB1004401Beijing Natural Science Foundation under Grant No.4212022.
文摘Local differential privacy(LDP)approaches to collecting sensitive information for frequent itemset mining(FIM)can reliably guarantee privacy.Most current approaches to FIM under LDP add"padding and sampling"steps to obtain frequent itemsets and their frequencies because each user transaction represents a set of items.The current state-of-the-art approach,namely set-value itemset mining(SVSM),must balance variance and bias to achieve accurate results.Thus,an unbiased FIM approach with lower variance is highly promising.To narrow this gap,we propose an Item-Level LDP frequency oracle approach,named the Integrated-with-Hadamard-Transform-Based Frequency Oracle(IHFO).For the first time,Hadamard encoding is introduced to a set of values to encode all items into a fixed vector,and perturbation can be subsequently applied to the vector.An FIM approach,called optimized united itemset mining(O-UISM),is pro-posed to combine the padding-and-sampling-based frequency oracle(PSFO)and the IHFO into a framework for acquiring accurate frequent itemsets with their frequencies.Finally,we theoretically and experimentally demonstrate that O-UISM significantly outperforms the extant approaches in finding frequent itemsets and estimating their frequencies under the same privacy guarantee.
基金This work was supported by the National High Technology Research and Development 863 Program of China under Grant Nos. 2015AA011505, 2015AA015306, and 2012AA010902, the National Natural Science Foundation of China under Grant Nos. 61202055, 61221062, 61521092, 61303053, 61432016, 61402445, and 61672492, and the National Key Research and Development Program of China under Grant No. 2016YFB1000402.
文摘A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. Due to this reason, most algorithms for data streams sacrifice the correctness of their results for fast processing time. The processing time is greatly influenced by the amount of information that should be maintained. This issue becomes more serious in finding frequent itemsets or frequency counting over an online transactional data stream since there can be a large number of itemsets to be monitored. We have proposed a method called the estDec method for finding frequent itemsets over an online data stream. In order to reduce the number of monitored itemsets in this method, monitoring the count of an itemset is delayed until its support is large enough to become a frequent itemset in the near future. For this purpose, the count of an itemset should be estimated. Consequently, how to estimate the count of an itemset is a critical issue in minimizing memory usage as well as processing time. In this paper, the effects of various count estimation methods for finding frequent itemsets are analyzed in terms of mining accuracy, memory usage and processing time.
基金This work is partially supported by the Hong Kong RGC Project under Grant No. N_HKUST637/13, the National Basic Research 973 Program of China under Grant No. 2014CB340303, the National Natural Science Foundation of China under Grant Nos. 61328202 and 61300031, Microsoft Research Asia Gift Grant, Google Faculty Award 2013, and Microsoft Research Asia Fellowship 2012.
基金The work was supported in part by Research Fund for the Doctoral Program of Higher Education of China(No.20060255006)
文摘Large high-dimensional data have posed great challenges to existing algorithms for frequent itemsets mining.To solve the problem,a hybrid method,consisting of a novel row enumeration algorithm and a column enumeration algorithm,is proposed.The intention of the hybrid method is to decompose the mining task into two subtasks and then choose appropriate algorithms to solve them respectively.The novel algorithm,i.e.,Inter-transaction is based on the characteristic that there are few common items between or among long transactions.In addition,an optimization technique is adopted to improve the performance of the intersection of bit-vectors.Experiments on synthetic data show that our method achieves high performance in large high-dimensional data.