The problem of taking a set of data and separating it into subgroups where the elements of each subgroup are more similar to each other than they are to elements not in the subgroup has been extensively studied throug...The problem of taking a set of data and separating it into subgroups where the elements of each subgroup are more similar to each other than they are to elements not in the subgroup has been extensively studied through the statistical method of cluster analysis. In this paper we want to discuss the application of this method to the field of education: particularly, we want to present the use of cluster analysis to separate students into groups that can be recognized and characterized by common traits in their answers to a questionnaire, without any prior knowledge of what form those groups would take (unsupervised classification). We start from a detailed study of the data processing needed by cluster analysis. Then two methods commonly used in cluster analysis are before described only from a theoretical point a view and after in the Section 4 through an example of application to data coming from an open-ended questionnaire administered to a sample of university students. In particular we describe and criticize the variables and parameters used to show the results of the cluster analysis methods.展开更多
To make the quantitative results of nuclear magnetic resonance(NMR) transverse relaxation(T;) spectrums reflect the type and pore structure of reservoir more directly, an unsupervised clustering method was developed t...To make the quantitative results of nuclear magnetic resonance(NMR) transverse relaxation(T;) spectrums reflect the type and pore structure of reservoir more directly, an unsupervised clustering method was developed to obtain the quantitative pore structure information from the NMR T;spectrums based on the Gaussian mixture model(GMM). Firstly, We conducted the principal component analysis on T;spectrums in order to reduce the dimension data and the dependence of the original variables. Secondly, the dimension-reduced data was fitted using the GMM probability density function, and the model parameters and optimal clustering numbers were obtained according to the expectation-maximization algorithm and the change of the Akaike information criterion. Finally, the T;spectrum features and pore structure types of different clustering groups were analyzed and compared with T;geometric mean and T;arithmetic mean. The effectiveness of the algorithm has been verified by numerical simulation and field NMR logging data. The research shows that the clustering results based on GMM method have good correlations with the shape and distribution of the T;spectrum, pore structure, and petroleum productivity, providing a new means for quantitative identification of pore structure, reservoir grading, and oil and gas productivity evaluation.展开更多
Airborne Light Detection And Ranging(LiDAR)can provide high-quality three-dimensional information for the safety inspection of electricity corridors.However,the robust extraction of transmission lines from airborne po...Airborne Light Detection And Ranging(LiDAR)can provide high-quality three-dimensional information for the safety inspection of electricity corridors.However,the robust extraction of transmission lines from airborne point cloud data is still greatly challenging.Therefore,this paper proposes a robust transmission line extraction method based on model fitting from airborne point cloud data.First,the candidate power line generation method based on height information is used to reduce the computational complexity at the subsequent steps and the false positives in the extracted results.Then,on the basis of the block-and-slice-constraint Euclidean clustering,a linear structure recognition method based on RANdom SAmple Consensus(RANSAC)is proposed to produce the initial individual transmission line components.Finally,a robust nonlinear least square-based fitting method is developed for the individual transmission line to generate the parameters of its mathematical model for further optimizing the extraction.Experiments were performed on LiDAR point cloud data captured from the helicopter and Unmanned Aerial Vehicle(UAV)platform.Results indicate that the proposed method can efficiently extract the different types of transmission lines along electricity corridors,with the average precision of approximately 98.1%,the average recall of approximately 95.9%,and the average quality of approximately 94.2%,respectively.展开更多
The fight against fraud and trafficking is a fundamental mission of customs. The conditions for carrying out this mission depend both on the evolution of economic issues and on the behaviour of the actors in charge of...The fight against fraud and trafficking is a fundamental mission of customs. The conditions for carrying out this mission depend both on the evolution of economic issues and on the behaviour of the actors in charge of its implementation. As part of the customs clearance process, customs are nowadays confronted with an increasing volume of goods in connection with the development of international trade. Automated risk management is therefore required to limit intrusive control. In this article, we propose an unsupervised classification method to extract knowledge rules from a database of customs offences in order to identify abnormal behaviour resulting from customs control. The idea is to apply the Apriori principle on the basis of frequent grounds on a database relating to customs offences in customs procedures to uncover potential rules of association between a customs operation and an offence for the purpose of extracting knowledge governing the occurrence of fraud. This mass of often heterogeneous and complex data thus generates new needs that knowledge extraction methods must be able to meet. The assessment of infringements inevitably requires a proper identification of the risks. It is an original approach based on data mining or data mining to build association rules in two steps: first, search for frequent patterns (support >= minimum support) then from the frequent patterns, produce association rules (Trust >= Minimum Trust). The simulations carried out highlighted three main association rules: forecasting rules, targeting rules and neutral rules with the introduction of a third indicator of rule relevance which is the Lift measure. Confidence in the first two rules has been set at least 50%.展开更多
Objective:To analyze the component law of Chinese patent medicines for anti-influenza and develop new prescriptions for anti-influenza by unsupervised data mining methods. Methods: Chinese patent medicine recipes for ...Objective:To analyze the component law of Chinese patent medicines for anti-influenza and develop new prescriptions for anti-influenza by unsupervised data mining methods. Methods: Chinese patent medicine recipes for anti-influenza were collected and recorded in the database, and then the correlation coefficient between herbs, core combinations of herbs and new prescriptions were analyzed by using modified mutual information, complex system entropy cluster and unsupervised hierarchical clustering, respectively. Results: Based on analysis of 126 Chinese patent medicine recipes, the frequency of each herb occurrence in these recipes, 54 frequently-used herb pairs, 34 core combinations were determined, and 4 new recipes for influenza were developed. Conclusion: Unsupervised data mining methods are able to mine the component law quickly and develop new prescriptions.展开更多
Supervised learning methods(eg.PLS-DA,SVM,etc.) have been widely used with laser-induced breakdown spectroscopy(LIBS) to classify materials;however,it may induce a low correct classification rate if a test sample ...Supervised learning methods(eg.PLS-DA,SVM,etc.) have been widely used with laser-induced breakdown spectroscopy(LIBS) to classify materials;however,it may induce a low correct classification rate if a test sample type is not included in the training dataset.Unsupervised cluster analysis methods(hierarchical clustering analysis,K-means clustering analysis,and iterative self-organizing data analysis technique) are investigated in plastics classification based on the line intensities of LIBS emission in this paper.The results of hierarchical clustering analysis using four different similarity measuring methods(single linkage,complete linkage,unweighted pair-group average,and weighted pair-group average) are compared.In K-means clustering analysis,four kinds of choosing initial centers methods are applied in our case and their results are compared.The classification results of hierarchical clustering analysis,K-means clustering analysis,and ISODATA are analyzed.The experiment results demonstrated cluster analysis methods can be applied to plastics discrimination with LIBS.展开更多
Predicting protein-coding genes still remains a significant challenge. Although a variety of computational programs that use commonly machine learning methods have emerged, the accuracy of predictions remains a low le...Predicting protein-coding genes still remains a significant challenge. Although a variety of computational programs that use commonly machine learning methods have emerged, the accuracy of predictions remains a low level when implementing in large genomic sequences. Moreover, computational gene finding in newly se- quenced genomes is especially a difficult task due to the absence of a training set of abundant validated genes. Here we present a new gene-finding program, SCGPred, to improve the accuracy of prediction by combining multiple sources of evidence. SCGPred can perform both supervised method in previously well-studied genomes and unsupervised one in novel genomes. By testing with datasets composed of large DNA sequences from human and a novel genome of Ustilago maydi, SCGPred gains a significant improvement in comparison to the popular ab initio gene predictors. We also demonstrate that SCGPred can significantly improve prediction in novel genomes by combining several foreign gene finders with similarity alignments, which is superior to other unsupervised methods. Therefore, SCGPred can serve as an alternative gene-finding tool for newly sequenced eukaryotic genomes. The program is freely available at http://bio.scu.edu.cn/SCGPred/.展开更多
文摘The problem of taking a set of data and separating it into subgroups where the elements of each subgroup are more similar to each other than they are to elements not in the subgroup has been extensively studied through the statistical method of cluster analysis. In this paper we want to discuss the application of this method to the field of education: particularly, we want to present the use of cluster analysis to separate students into groups that can be recognized and characterized by common traits in their answers to a questionnaire, without any prior knowledge of what form those groups would take (unsupervised classification). We start from a detailed study of the data processing needed by cluster analysis. Then two methods commonly used in cluster analysis are before described only from a theoretical point a view and after in the Section 4 through an example of application to data coming from an open-ended questionnaire administered to a sample of university students. In particular we describe and criticize the variables and parameters used to show the results of the cluster analysis methods.
基金Supported by the National Natural Science Foundation of China (42174142)National Science and Technology Major Project (2017ZX05039-002)+2 种基金Operation Fund of China National Petroleum Corporation Logging Key Laboratory (2021DQ20210107-11)Fundamental Research Funds for Central Universities (19CX02006A)Major Science and Technology Project of China National Petroleum Corporation (ZD2019-183-006)。
文摘To make the quantitative results of nuclear magnetic resonance(NMR) transverse relaxation(T;) spectrums reflect the type and pore structure of reservoir more directly, an unsupervised clustering method was developed to obtain the quantitative pore structure information from the NMR T;spectrums based on the Gaussian mixture model(GMM). Firstly, We conducted the principal component analysis on T;spectrums in order to reduce the dimension data and the dependence of the original variables. Secondly, the dimension-reduced data was fitted using the GMM probability density function, and the model parameters and optimal clustering numbers were obtained according to the expectation-maximization algorithm and the change of the Akaike information criterion. Finally, the T;spectrum features and pore structure types of different clustering groups were analyzed and compared with T;geometric mean and T;arithmetic mean. The effectiveness of the algorithm has been verified by numerical simulation and field NMR logging data. The research shows that the clustering results based on GMM method have good correlations with the shape and distribution of the T;spectrum, pore structure, and petroleum productivity, providing a new means for quantitative identification of pore structure, reservoir grading, and oil and gas productivity evaluation.
基金National Natural Science Foundation of China(No.41872207).
文摘Airborne Light Detection And Ranging(LiDAR)can provide high-quality three-dimensional information for the safety inspection of electricity corridors.However,the robust extraction of transmission lines from airborne point cloud data is still greatly challenging.Therefore,this paper proposes a robust transmission line extraction method based on model fitting from airborne point cloud data.First,the candidate power line generation method based on height information is used to reduce the computational complexity at the subsequent steps and the false positives in the extracted results.Then,on the basis of the block-and-slice-constraint Euclidean clustering,a linear structure recognition method based on RANdom SAmple Consensus(RANSAC)is proposed to produce the initial individual transmission line components.Finally,a robust nonlinear least square-based fitting method is developed for the individual transmission line to generate the parameters of its mathematical model for further optimizing the extraction.Experiments were performed on LiDAR point cloud data captured from the helicopter and Unmanned Aerial Vehicle(UAV)platform.Results indicate that the proposed method can efficiently extract the different types of transmission lines along electricity corridors,with the average precision of approximately 98.1%,the average recall of approximately 95.9%,and the average quality of approximately 94.2%,respectively.
文摘The fight against fraud and trafficking is a fundamental mission of customs. The conditions for carrying out this mission depend both on the evolution of economic issues and on the behaviour of the actors in charge of its implementation. As part of the customs clearance process, customs are nowadays confronted with an increasing volume of goods in connection with the development of international trade. Automated risk management is therefore required to limit intrusive control. In this article, we propose an unsupervised classification method to extract knowledge rules from a database of customs offences in order to identify abnormal behaviour resulting from customs control. The idea is to apply the Apriori principle on the basis of frequent grounds on a database relating to customs offences in customs procedures to uncover potential rules of association between a customs operation and an offence for the purpose of extracting knowledge governing the occurrence of fraud. This mass of often heterogeneous and complex data thus generates new needs that knowledge extraction methods must be able to meet. The assessment of infringements inevitably requires a proper identification of the risks. It is an original approach based on data mining or data mining to build association rules in two steps: first, search for frequent patterns (support >= minimum support) then from the frequent patterns, produce association rules (Trust >= Minimum Trust). The simulations carried out highlighted three main association rules: forecasting rules, targeting rules and neutral rules with the introduction of a third indicator of rule relevance which is the Lift measure. Confidence in the first two rules has been set at least 50%.
基金supported by Scientific Research Special Project of TCM Profession (200907001E)Science and Technology Special Major Project for "Significant New Drugs Formulation" (2009ZX09301-005-02)
文摘Objective:To analyze the component law of Chinese patent medicines for anti-influenza and develop new prescriptions for anti-influenza by unsupervised data mining methods. Methods: Chinese patent medicine recipes for anti-influenza were collected and recorded in the database, and then the correlation coefficient between herbs, core combinations of herbs and new prescriptions were analyzed by using modified mutual information, complex system entropy cluster and unsupervised hierarchical clustering, respectively. Results: Based on analysis of 126 Chinese patent medicine recipes, the frequency of each herb occurrence in these recipes, 54 frequently-used herb pairs, 34 core combinations were determined, and 4 new recipes for influenza were developed. Conclusion: Unsupervised data mining methods are able to mine the component law quickly and develop new prescriptions.
基金supported by Beijing Natural Science Foundation of China(No.4132063)
文摘Supervised learning methods(eg.PLS-DA,SVM,etc.) have been widely used with laser-induced breakdown spectroscopy(LIBS) to classify materials;however,it may induce a low correct classification rate if a test sample type is not included in the training dataset.Unsupervised cluster analysis methods(hierarchical clustering analysis,K-means clustering analysis,and iterative self-organizing data analysis technique) are investigated in plastics classification based on the line intensities of LIBS emission in this paper.The results of hierarchical clustering analysis using four different similarity measuring methods(single linkage,complete linkage,unweighted pair-group average,and weighted pair-group average) are compared.In K-means clustering analysis,four kinds of choosing initial centers methods are applied in our case and their results are compared.The classification results of hierarchical clustering analysis,K-means clustering analysis,and ISODATA are analyzed.The experiment results demonstrated cluster analysis methods can be applied to plastics discrimination with LIBS.
基金This work was partially supported by the National Natural Science Foundation of China (No.30470984)
文摘Predicting protein-coding genes still remains a significant challenge. Although a variety of computational programs that use commonly machine learning methods have emerged, the accuracy of predictions remains a low level when implementing in large genomic sequences. Moreover, computational gene finding in newly se- quenced genomes is especially a difficult task due to the absence of a training set of abundant validated genes. Here we present a new gene-finding program, SCGPred, to improve the accuracy of prediction by combining multiple sources of evidence. SCGPred can perform both supervised method in previously well-studied genomes and unsupervised one in novel genomes. By testing with datasets composed of large DNA sequences from human and a novel genome of Ustilago maydi, SCGPred gains a significant improvement in comparison to the popular ab initio gene predictors. We also demonstrate that SCGPred can significantly improve prediction in novel genomes by combining several foreign gene finders with similarity alignments, which is superior to other unsupervised methods. Therefore, SCGPred can serve as an alternative gene-finding tool for newly sequenced eukaryotic genomes. The program is freely available at http://bio.scu.edu.cn/SCGPred/.