To select the best interestingness measure appropriate for evaluating the correlation between Chinese medicine (CM) syndrome elements and symptoms, 60 objective interestingness measures were selected from differen...To select the best interestingness measure appropriate for evaluating the correlation between Chinese medicine (CM) syndrome elements and symptoms, 60 objective interestingness measures were selected from different subjects. Firstly, a hypothesis for a good measure was proposed. Based on the hypothesis, an experiment was designed to evaluate the measures. The experiment was based on the clinical record database of past dynasties including 51 186 clinical cases. The selected data set in this study had 44 600 records. Cold and heat were selected as the experimental CM syndrome elements. Three indicators calculated according to the distances between two CM syndrome elements were obtained in the experiment and combined into one indicator. The Z score, φ-coefficient, and Kappa were selected from 60 measures after the experiment. The Z score and φ-coefficient were selected according to subjective interestingness. Finally, the φ-coefficient was selected as the best measure for its low The method introduced in this paper may be used in other similar territories.展开更多
The information content of rules is categorized into inner mutual information content and outer impartation information content. Actually, the conventional objective interestingness measures based on information theor...The information content of rules is categorized into inner mutual information content and outer impartation information content. Actually, the conventional objective interestingness measures based on information theory are all inner mutual information, which represent the confidence of rules and the mutual information between the antecedent and consequent. Moreover, almost all of these measures lose sight of the outer impartation information, which is conveyed to the user and help the user to make decisions. We put forward the viewpoint that the outer impartation information content of rules and rule sets can be represented by the relations from input universe to output universe. By binary relations, the interaction of rules in a rule set can be easily represented by operators: union and intersection. Based on the entropy of relations, the outer impartation information content of rules and rule sets are well measured. Then, the conditional information content of rules and rule sets, the independence of rules and rule sets and the inconsistent knowledge of rule sets are defined and measured. The properties of these new measures are discussed and some interesting results are proven, such as the information content of a rule set may be bigger than the sum of the information content of rules in the rule set, and the conditional information content of rules may be negative. At last, the applications of these new measures are discussed. The new method for the appraisement of rule mining algorithm, and two rule pruning algorithms, λ-choice and RPClC, are put forward. These new methods and algorithms have predominance in satisfying the need of more efficient decision information.展开更多
From a data mining perspective, sequence classification is to build a classifier using frequent sequential patterns. However, mining for a complete set of sequential patterns on a large dataset can be extremely time-c...From a data mining perspective, sequence classification is to build a classifier using frequent sequential patterns. However, mining for a complete set of sequential patterns on a large dataset can be extremely time-consuming and the large number of patterns discovered also makes the pattern selection and classifier building very time-consuming. The fact is that, in sequence classification, it is much more important to discover discriminative patterns than a complete pattern set. In this paper, we propose a novel hierarchical algorithm to build sequential classifiers using discriminative sequential patterns. Firstly, we mine for the sequential patterns which axe the most strongly correlated to each target class. In this step, an aggressive strategy is employed to select a small set of sequential patterns. Secondly, pattern pruning and serial coverage test are done on the mined patterns. The patterns that pass the serial test are used to build the sub-classifier at the first level of the final classifier. And thirdly, the training samples that cannot be covered are fed back to the sequential pattern mining stage with updated parameters. This process continues until predefined interestingness measure thresholds are reached, or all samples axe covered. The patterns generated in each loop form the sub-classifier at each level of the final classifier. Within this framework, the searching space can be reduced dramatically while a good classification performance is achieved. The proposed algorithm is tested in a real-world business application for debt prevention in social security area. The novel sequence classification algorithm shows the effectiveness and efficiency for predicting debt occurrences based on customer activity sequence data.展开更多
基金Supported by National Natural Science Foundation of China (No.30772695,No.81001500)11th Five-Year National Science Support Project of China(No.2006BA108B01-05)National Science and Technology Major Projects(No.2009ZX10005-019)
文摘To select the best interestingness measure appropriate for evaluating the correlation between Chinese medicine (CM) syndrome elements and symptoms, 60 objective interestingness measures were selected from different subjects. Firstly, a hypothesis for a good measure was proposed. Based on the hypothesis, an experiment was designed to evaluate the measures. The experiment was based on the clinical record database of past dynasties including 51 186 clinical cases. The selected data set in this study had 44 600 records. Cold and heat were selected as the experimental CM syndrome elements. Three indicators calculated according to the distances between two CM syndrome elements were obtained in the experiment and combined into one indicator. The Z score, φ-coefficient, and Kappa were selected from 60 measures after the experiment. The Z score and φ-coefficient were selected according to subjective interestingness. Finally, the φ-coefficient was selected as the best measure for its low The method introduced in this paper may be used in other similar territories.
基金the National Natural Science Foundation of China (Grant Nos. 60774049 and 40672195)Natural Science Foundation of Beijing (Grant No. 4062020)+1 种基金National 973 Fundamental Research Project of China (Grant No. 2002CB312200)the Youth Foundation of Beijing Normal University
文摘The information content of rules is categorized into inner mutual information content and outer impartation information content. Actually, the conventional objective interestingness measures based on information theory are all inner mutual information, which represent the confidence of rules and the mutual information between the antecedent and consequent. Moreover, almost all of these measures lose sight of the outer impartation information, which is conveyed to the user and help the user to make decisions. We put forward the viewpoint that the outer impartation information content of rules and rule sets can be represented by the relations from input universe to output universe. By binary relations, the interaction of rules in a rule set can be easily represented by operators: union and intersection. Based on the entropy of relations, the outer impartation information content of rules and rule sets are well measured. Then, the conditional information content of rules and rule sets, the independence of rules and rule sets and the inconsistent knowledge of rule sets are defined and measured. The properties of these new measures are discussed and some interesting results are proven, such as the information content of a rule set may be bigger than the sum of the information content of rules in the rule set, and the conditional information content of rules may be negative. At last, the applications of these new measures are discussed. The new method for the appraisement of rule mining algorithm, and two rule pruning algorithms, λ-choice and RPClC, are put forward. These new methods and algorithms have predominance in satisfying the need of more efficient decision information.
基金supported by Australian Research Council Linkage Project under Grant No. LP0775041the Early Career Researcher Grant under Grant No. 2007002448 from University of Technology, Sydney, Australia
文摘From a data mining perspective, sequence classification is to build a classifier using frequent sequential patterns. However, mining for a complete set of sequential patterns on a large dataset can be extremely time-consuming and the large number of patterns discovered also makes the pattern selection and classifier building very time-consuming. The fact is that, in sequence classification, it is much more important to discover discriminative patterns than a complete pattern set. In this paper, we propose a novel hierarchical algorithm to build sequential classifiers using discriminative sequential patterns. Firstly, we mine for the sequential patterns which axe the most strongly correlated to each target class. In this step, an aggressive strategy is employed to select a small set of sequential patterns. Secondly, pattern pruning and serial coverage test are done on the mined patterns. The patterns that pass the serial test are used to build the sub-classifier at the first level of the final classifier. And thirdly, the training samples that cannot be covered are fed back to the sequential pattern mining stage with updated parameters. This process continues until predefined interestingness measure thresholds are reached, or all samples axe covered. The patterns generated in each loop form the sub-classifier at each level of the final classifier. Within this framework, the searching space can be reduced dramatically while a good classification performance is achieved. The proposed algorithm is tested in a real-world business application for debt prevention in social security area. The novel sequence classification algorithm shows the effectiveness and efficiency for predicting debt occurrences based on customer activity sequence data.