Despite advances in technological complexity and efforts,software repository maintenance requires reusing the data to reduce the effort and complexity.However,increasing ambiguity,irrelevance,and bugs while extracting...Despite advances in technological complexity and efforts,software repository maintenance requires reusing the data to reduce the effort and complexity.However,increasing ambiguity,irrelevance,and bugs while extracting similar data during software development generate a large amount of data from those data that reside in repositories.Thus,there is a need for a repository mining technique for relevant and bug-free data prediction.This paper proposes a fault prediction approach using a data-mining technique to find good predictors for high-quality software.To predict errors in mining data,the Apriori algorithm was used to discover association rules by fixing confidence at more than 40%and support at least 30%.The pruning strategy was adopted based on evaluation measures.Next,the rules were extracted from three projects of different domains;the extracted rules were then combined to obtain the most popular rules based on the evaluation measure values.To evaluate the proposed approach,we conducted an experimental study to compare the proposed rules with existing ones using four different industrial projects.The evaluation showed that the results of our proposal are promising.Practitioners and developers can utilize these rules for defect prediction during early software development.展开更多
Maximum frequent pattern generation from a large database of transactions and items for association rule mining is an important research topic in data mining. Association rule mining aims to discover interesting corre...Maximum frequent pattern generation from a large database of transactions and items for association rule mining is an important research topic in data mining. Association rule mining aims to discover interesting correlations, frequent patterns, associations, or causal structures between items hidden in a large database. By exploiting quantum computing, we propose an efficient quantum search algorithm design to discover the maximum frequent patterns. We modified Grover’s search algorithm so that a subspace of arbitrary symmetric states is used instead of the whole search space. We presented a novel quantum oracle design that employs a quantum counter to count the maximum frequent items and a quantum comparator to check with a minimum support threshold. The proposed derived algorithm increases the rate of the correct solutions since the search is only in a subspace. Furthermore, our algorithm significantly scales and optimizes the required number of qubits in design, which directly reflected positively on the performance. Our proposed design can accommodate more transactions and items and still have a good performance with a small number of qubits.展开更多
Previous weighted frequent pattern (WFP) mining algorithms are not suitable for data streams for they need multiple database scans. In this paper, we present an efficient algorithm SWFP-Miner to mine weighted freque...Previous weighted frequent pattern (WFP) mining algorithms are not suitable for data streams for they need multiple database scans. In this paper, we present an efficient algorithm SWFP-Miner to mine weighted frequent pattern over data streams. SWFP-Miner is based on sliding window and can discover important frequent pattern from the recent data. A new refined weight definition is proposed to keep the downward closure property, and two pruning strategies are presented to prune the weighted infrequent pattern. Experimental studies are performed to evaluate the effectiveness and efficiency of SWFP-Miner.展开更多
Due to the increasing availability and sophistication of data recording techniques, multiple information sources and distributed computing are becoming the important trends of modern information systems. Many applicat...Due to the increasing availability and sophistication of data recording techniques, multiple information sources and distributed computing are becoming the important trends of modern information systems. Many applications such as security informatics and social computing require a ubiquitous data analysis platform so that decisions can be made rapidly under distributed and dynamic system environments. Although data mining has now been popularly used to achieve such goals, building a data mining system is, however, a nontrivial task, which may require a complete understanding on numerous data mining techniques as well as solid programming skills. Employing agent techniques for data analysis thus becomes increasingly important, especially for users not familiar with engineering and computational sciences, to implement an effective ubiquitous mining platform. Such data mining agents should, in practice, be intelligent, complete, and compact. In this paper, we present an interactive data mining agent - OIDM (online interactive data mining), which provides three categories (classification, association analysis, and clustering) of data mining tools, and interacts with the user to facilitate the mining process. The interactive mining is accomplished through interviewing the user about the data mining task to gain efficient and intelligent data mining control. OIDM can help users find appropriate mining algorithms, refine and compare the mining process, and finally achieve the best mining results. Such interactive data mining agent techniques provide alternative solutions to rapidly deploy data mining techniques to broader areas of data intelligence and knowledge informaties.展开更多
Frequent itemset mining serves as the main method of association rule mining.With the limitations in computing space and performance,the association of frequent items in large data mining requires both extensive time ...Frequent itemset mining serves as the main method of association rule mining.With the limitations in computing space and performance,the association of frequent items in large data mining requires both extensive time and effort,particularly when the datasets become increasingly larger.In the process of associated data mining in a big data environment,the MapReduce programming model is typically used to perform task partitioning and parallel processing,which could improve the execution effciency of the algorithm.However,to ensure that the associated rule is not destroyed during task partitioning and parallel processing,the inner-relationship data must be stored in the computer space.Because inner-relationship data are redundant,storage of these data will significantly increase the space usage in comparison with the original dataset.In this study,we find that the formation of the frequent pattern(FP)mining algorithm depends mainly on the conditional pattern bases.Based on the parallel frequent pattern(PFP)algorithm theory,the grouping model divides frequent items into several groups according to their frequencies.We propose a non-group PFP(NG-PFP)mining algorithm that cancels the grouping model and reduces the data redundancy between sub-tasks.Moreover,we present the NG-PFP algorithm for task partition and parallel processing,and its performance in the Hadoop cluster environment is analyzed and discussed.Experimental results indicate that the non-group model shows obvious improvement in terms of computational effciency and the space utilization rate.展开更多
Mining frequent pattern in transaction database, time series databases, and many other kinds of databases have been studied popularly in data mining research. Most of the previous studies adopt Apriori like candidat...Mining frequent pattern in transaction database, time series databases, and many other kinds of databases have been studied popularly in data mining research. Most of the previous studies adopt Apriori like candidate set generation and test approach. However, candidate set generation is very costly. Han J. proposed a novel algorithm FP growth that could generate frequent pattern without candidate set. Based on the analysis of the algorithm FP growth, this paper proposes a concept of equivalent FP tree and proposes an improved algorithm, denoted as FP growth * , which is much faster in speed, and easy to realize. FP growth * adopts a modified structure of FP tree and header table, and only generates a header table in each recursive operation and projects the tree to the original FP tree. The two algorithms get the same frequent pattern set in the same transaction database, but the performance study on computer shows that the speed of the improved algorithm, FP growth * , is at least two times as fast as that of FP growth.展开更多
Because mining complete set of frequent patterns from dense database could be impractical, an interesting alternative has been proposed recently. Instead of mining the complete set of frequent patterns, the new model ...Because mining complete set of frequent patterns from dense database could be impractical, an interesting alternative has been proposed recently. Instead of mining the complete set of frequent patterns, the new model only finds out the maximal frequent patterns, which can generate all frequent patterns. FP-growth algorithm is one of the most efficient frequent-pattern mining methods published so far. However, because FP-tree and conditional FP-trees must be two-way traversable, a great deal memory is needed in process of mining. This paper proposes an efficient algorithm Unid_FP-Max for mining maximal frequent patterns based on unidirectional FP-tree. Because of generation method of unidirectional FP-tree and conditional unidirectional FP-trees, the algorithm reduces the space consumption to the fullest extent. With the development of two techniques: single path pruning and header table pruning which can cut down many conditional unidirectional FP-trees generated recursively in mining process, Unid_FP-Max further lowers the expense of time and space.展开更多
In this paper, we propose an efficient algorithm, called FFP-Growth (shortfor fast FP-Growth) , to mine frequent itemsets. Similar to FP-Growth, FFP-Growth searches theFP-tree in the bottom-up order, but need not cons...In this paper, we propose an efficient algorithm, called FFP-Growth (shortfor fast FP-Growth) , to mine frequent itemsets. Similar to FP-Growth, FFP-Growth searches theFP-tree in the bottom-up order, but need not construct conditional pattern bases and sub-FP-trees,thus, saving a substantial amount of time and space, and the FP-tree created by it is much smallerthan that created by TD-FP-Growth, hence improving efficiency. At the same time, FFP-Growth can beeasily extended for reducing the search space as TD-FP-Growth (M) and TD-FP-Growth (C). Experimentalresults show that the algorithm of this paper is effective and efficient.展开更多
This paper presents a new efficient algorithm for mining frequent closed itemsets. It enumerates the closed set of frequent itemsets by using a novel compound frequent itemset tree that facilitates fast growth and eff...This paper presents a new efficient algorithm for mining frequent closed itemsets. It enumerates the closed set of frequent itemsets by using a novel compound frequent itemset tree that facilitates fast growth and efficient pruning of search space. It also employs a hybrid approach that adapts search strategies, representations of projected transaction subsets, and projecting methods to the characteristics of the dataset. Efficient local pruning, global subsumption checking, and fast hashing methods are detailed in this paper. The principle that balances the overheads of search space growth and pruning is also discussed. Extensive experimental evaluations on real world and artificial datasets showed that our algorithm outperforms CHARM by a factor of five and is one to three orders of magnitude more efficient than CLOSET and MAFIA.展开更多
It is nontrivial to maintain such discovered frequent query patterns in real XML-DBMS because the transaction database of queries may allow frequent updates and such updates may not only invalidate some existing frequ...It is nontrivial to maintain such discovered frequent query patterns in real XML-DBMS because the transaction database of queries may allow frequent updates and such updates may not only invalidate some existing frequent query patterns but also generate some new frequent query patterns. In this paper, two incremental updating algorithms, FUX-QMiner and FUXQMiner, are proposed for efficient maintenance of discovered frequent query patterns and generation the new frequent query patterns when new XMI, queries are added into the database. Experimental results from our implementation show that the proposed algorithms have good performance. Key words XML - frequent query pattern - incremental algorithm - data mining CLC number TP 311 Foudation item: Supported by the Youthful Foundation for Scientific Research of University of Shanghai for Science and TechnologyBiography: PENG Dun-lu (1974-), male, Associate professor, Ph.D, research direction: data mining, Web service and its application, peerto-peer computing.展开更多
In this letter, on the basis of Frequent Pattern(FP) tree, the support function to update FP-tree is introduced, then an Incremental FP (IFP) algorithm for mining association rules is proposed. IFP algorithm considers...In this letter, on the basis of Frequent Pattern(FP) tree, the support function to update FP-tree is introduced, then an Incremental FP (IFP) algorithm for mining association rules is proposed. IFP algorithm considers not only adding new data into the database but also reducing old data from the database. Furthermore, it can predigest five cases to three cases.The algorithm proposed in this letter can avoid generating lots of candidate items, and it is high efficient.展开更多
This paper proposes an intelligent management system (IMS) to help managers in their delicate and tedious task of exploiting the plethora of data (indicators) contained in management dashboards. This system is based o...This paper proposes an intelligent management system (IMS) to help managers in their delicate and tedious task of exploiting the plethora of data (indicators) contained in management dashboards. This system is based on intelligent agents, ontologies and data mining. It is implemented by PASSI (Process for Agent Societies Specification and Implementation) methods for agent design and implementation, the Methodology for Knowledge Modeling and Hot-Winters for data prediction. Intelligent agents not only track indicators but also store the knowledge of managers within the company. Ontologies are used to manage the representation and presentation aspects of knowledge. Data mining makes it possible to: make the most of all available data;model the industrial process of data selection, exploration and modeling;and transform behaviors into predictive indicators. An instance of the IMS named SYGISS, currently in operation within a large brewery organization, allows us to observe very interesting results: the extraction of indicators is done in less than 5 minutes whereas manual extraction used to take 14 days;the generation of dashboards is instantaneous whereas it used to take 12 hours;the interpretation of indicators is instantaneous whereas it used to take a day;forecasts are possible and are done in less than 5 minutes whereas they did not exist with the old management. These important contributions help to optimize the management of this organization.展开更多
基金This research was financially supported in part by the Ministry of Trade,Industry and Energy(MOTIE)and Korea Institute for Advancement of Technology(KIAT)through the International Cooperative R&D program.(Project No.P0016038)in part by the MSIT(Ministry of Science and ICT),Korea,under the ITRC(Information Technology Research Center)support program(IITP-2021-2016-0-00312)supervised by the IITP(Institute for Information&communications Technology Planning&Evaluation).
文摘Despite advances in technological complexity and efforts,software repository maintenance requires reusing the data to reduce the effort and complexity.However,increasing ambiguity,irrelevance,and bugs while extracting similar data during software development generate a large amount of data from those data that reside in repositories.Thus,there is a need for a repository mining technique for relevant and bug-free data prediction.This paper proposes a fault prediction approach using a data-mining technique to find good predictors for high-quality software.To predict errors in mining data,the Apriori algorithm was used to discover association rules by fixing confidence at more than 40%and support at least 30%.The pruning strategy was adopted based on evaluation measures.Next,the rules were extracted from three projects of different domains;the extracted rules were then combined to obtain the most popular rules based on the evaluation measure values.To evaluate the proposed approach,we conducted an experimental study to compare the proposed rules with existing ones using four different industrial projects.The evaluation showed that the results of our proposal are promising.Practitioners and developers can utilize these rules for defect prediction during early software development.
文摘Maximum frequent pattern generation from a large database of transactions and items for association rule mining is an important research topic in data mining. Association rule mining aims to discover interesting correlations, frequent patterns, associations, or causal structures between items hidden in a large database. By exploiting quantum computing, we propose an efficient quantum search algorithm design to discover the maximum frequent patterns. We modified Grover’s search algorithm so that a subspace of arbitrary symmetric states is used instead of the whole search space. We presented a novel quantum oracle design that employs a quantum counter to count the maximum frequent items and a quantum comparator to check with a minimum support threshold. The proposed derived algorithm increases the rate of the correct solutions since the search is only in a subspace. Furthermore, our algorithm significantly scales and optimizes the required number of qubits in design, which directly reflected positively on the performance. Our proposed design can accommodate more transactions and items and still have a good performance with a small number of qubits.
文摘Previous weighted frequent pattern (WFP) mining algorithms are not suitable for data streams for they need multiple database scans. In this paper, we present an efficient algorithm SWFP-Miner to mine weighted frequent pattern over data streams. SWFP-Miner is based on sliding window and can discover important frequent pattern from the recent data. A new refined weight definition is proposed to keep the downward closure property, and two pruning strategies are presented to prune the weighted infrequent pattern. Experimental studies are performed to evaluate the effectiveness and efficiency of SWFP-Miner.
基金supported by the National Basic Research 973 Program of China under Grant No. 2009CB326203the National Natural Science Foundation of China under Grant Nos. 60828005 and 60674109the Chinese Academy of Sciences under International Partnership Grant No. 2F05N01
文摘Due to the increasing availability and sophistication of data recording techniques, multiple information sources and distributed computing are becoming the important trends of modern information systems. Many applications such as security informatics and social computing require a ubiquitous data analysis platform so that decisions can be made rapidly under distributed and dynamic system environments. Although data mining has now been popularly used to achieve such goals, building a data mining system is, however, a nontrivial task, which may require a complete understanding on numerous data mining techniques as well as solid programming skills. Employing agent techniques for data analysis thus becomes increasingly important, especially for users not familiar with engineering and computational sciences, to implement an effective ubiquitous mining platform. Such data mining agents should, in practice, be intelligent, complete, and compact. In this paper, we present an interactive data mining agent - OIDM (online interactive data mining), which provides three categories (classification, association analysis, and clustering) of data mining tools, and interacts with the user to facilitate the mining process. The interactive mining is accomplished through interviewing the user about the data mining task to gain efficient and intelligent data mining control. OIDM can help users find appropriate mining algorithms, refine and compare the mining process, and finally achieve the best mining results. Such interactive data mining agent techniques provide alternative solutions to rapidly deploy data mining techniques to broader areas of data intelligence and knowledge informaties.
基金project supported by the Fundamental Research Funds for the Central Universities,China(No.2412015KJ005)the Twelfth Five-Year Plan of the Education Department of Jilin Province,China(No.557)the Thirteenth Five-Year Plan for Scientific Research of the Education Department of Jilin Province,China(No.JJKH20191197KJ)
文摘Frequent itemset mining serves as the main method of association rule mining.With the limitations in computing space and performance,the association of frequent items in large data mining requires both extensive time and effort,particularly when the datasets become increasingly larger.In the process of associated data mining in a big data environment,the MapReduce programming model is typically used to perform task partitioning and parallel processing,which could improve the execution effciency of the algorithm.However,to ensure that the associated rule is not destroyed during task partitioning and parallel processing,the inner-relationship data must be stored in the computer space.Because inner-relationship data are redundant,storage of these data will significantly increase the space usage in comparison with the original dataset.In this study,we find that the formation of the frequent pattern(FP)mining algorithm depends mainly on the conditional pattern bases.Based on the parallel frequent pattern(PFP)algorithm theory,the grouping model divides frequent items into several groups according to their frequencies.We propose a non-group PFP(NG-PFP)mining algorithm that cancels the grouping model and reduces the data redundancy between sub-tasks.Moreover,we present the NG-PFP algorithm for task partition and parallel processing,and its performance in the Hadoop cluster environment is analyzed and discussed.Experimental results indicate that the non-group model shows obvious improvement in terms of computational effciency and the space utilization rate.
基金theFundoftheNationalManagementBureauofTraditionalChineseMedicine(No .2 0 0 0 J P 5 4 )
文摘Mining frequent pattern in transaction database, time series databases, and many other kinds of databases have been studied popularly in data mining research. Most of the previous studies adopt Apriori like candidate set generation and test approach. However, candidate set generation is very costly. Han J. proposed a novel algorithm FP growth that could generate frequent pattern without candidate set. Based on the analysis of the algorithm FP growth, this paper proposes a concept of equivalent FP tree and proposes an improved algorithm, denoted as FP growth * , which is much faster in speed, and easy to realize. FP growth * adopts a modified structure of FP tree and header table, and only generates a header table in each recursive operation and projects the tree to the original FP tree. The two algorithms get the same frequent pattern set in the same transaction database, but the performance study on computer shows that the speed of the improved algorithm, FP growth * , is at least two times as fast as that of FP growth.
基金Supported by the National Natural Science Foundation of China ( No.60474022)Henan Innovation Project for University Prominent Research Talents (No.2007KYCX018)
文摘Because mining complete set of frequent patterns from dense database could be impractical, an interesting alternative has been proposed recently. Instead of mining the complete set of frequent patterns, the new model only finds out the maximal frequent patterns, which can generate all frequent patterns. FP-growth algorithm is one of the most efficient frequent-pattern mining methods published so far. However, because FP-tree and conditional FP-trees must be two-way traversable, a great deal memory is needed in process of mining. This paper proposes an efficient algorithm Unid_FP-Max for mining maximal frequent patterns based on unidirectional FP-tree. Because of generation method of unidirectional FP-tree and conditional unidirectional FP-trees, the algorithm reduces the space consumption to the fullest extent. With the development of two techniques: single path pruning and header table pruning which can cut down many conditional unidirectional FP-trees generated recursively in mining process, Unid_FP-Max further lowers the expense of time and space.
文摘In this paper, we propose an efficient algorithm, called FFP-Growth (shortfor fast FP-Growth) , to mine frequent itemsets. Similar to FP-Growth, FFP-Growth searches theFP-tree in the bottom-up order, but need not construct conditional pattern bases and sub-FP-trees,thus, saving a substantial amount of time and space, and the FP-tree created by it is much smallerthan that created by TD-FP-Growth, hence improving efficiency. At the same time, FFP-Growth can beeasily extended for reducing the search space as TD-FP-Growth (M) and TD-FP-Growth (C). Experimentalresults show that the algorithm of this paper is effective and efficient.
文摘This paper presents a new efficient algorithm for mining frequent closed itemsets. It enumerates the closed set of frequent itemsets by using a novel compound frequent itemset tree that facilitates fast growth and efficient pruning of search space. It also employs a hybrid approach that adapts search strategies, representations of projected transaction subsets, and projecting methods to the characteristics of the dataset. Efficient local pruning, global subsumption checking, and fast hashing methods are detailed in this paper. The principle that balances the overheads of search space growth and pruning is also discussed. Extensive experimental evaluations on real world and artificial datasets showed that our algorithm outperforms CHARM by a factor of five and is one to three orders of magnitude more efficient than CLOSET and MAFIA.
文摘It is nontrivial to maintain such discovered frequent query patterns in real XML-DBMS because the transaction database of queries may allow frequent updates and such updates may not only invalidate some existing frequent query patterns but also generate some new frequent query patterns. In this paper, two incremental updating algorithms, FUX-QMiner and FUXQMiner, are proposed for efficient maintenance of discovered frequent query patterns and generation the new frequent query patterns when new XMI, queries are added into the database. Experimental results from our implementation show that the proposed algorithms have good performance. Key words XML - frequent query pattern - incremental algorithm - data mining CLC number TP 311 Foudation item: Supported by the Youthful Foundation for Scientific Research of University of Shanghai for Science and TechnologyBiography: PENG Dun-lu (1974-), male, Associate professor, Ph.D, research direction: data mining, Web service and its application, peerto-peer computing.
基金Supported in part by the National Natural Science Foundation of China(No.60073012),Natural Science Foundation of Jiangsu(BK2001004)
文摘In this letter, on the basis of Frequent Pattern(FP) tree, the support function to update FP-tree is introduced, then an Incremental FP (IFP) algorithm for mining association rules is proposed. IFP algorithm considers not only adding new data into the database but also reducing old data from the database. Furthermore, it can predigest five cases to three cases.The algorithm proposed in this letter can avoid generating lots of candidate items, and it is high efficient.
文摘This paper proposes an intelligent management system (IMS) to help managers in their delicate and tedious task of exploiting the plethora of data (indicators) contained in management dashboards. This system is based on intelligent agents, ontologies and data mining. It is implemented by PASSI (Process for Agent Societies Specification and Implementation) methods for agent design and implementation, the Methodology for Knowledge Modeling and Hot-Winters for data prediction. Intelligent agents not only track indicators but also store the knowledge of managers within the company. Ontologies are used to manage the representation and presentation aspects of knowledge. Data mining makes it possible to: make the most of all available data;model the industrial process of data selection, exploration and modeling;and transform behaviors into predictive indicators. An instance of the IMS named SYGISS, currently in operation within a large brewery organization, allows us to observe very interesting results: the extraction of indicators is done in less than 5 minutes whereas manual extraction used to take 14 days;the generation of dashboards is instantaneous whereas it used to take 12 hours;the interpretation of indicators is instantaneous whereas it used to take a day;forecasts are possible and are done in less than 5 minutes whereas they did not exist with the old management. These important contributions help to optimize the management of this organization.