Due to the limitation and hesitation in one's knowledge, the membership degree of an element to a given set usually has a few different values, in which the conventional fuzzy sets are invalid. Hesitant fuzzy sets ar...Due to the limitation and hesitation in one's knowledge, the membership degree of an element to a given set usually has a few different values, in which the conventional fuzzy sets are invalid. Hesitant fuzzy sets are a powerful tool to treat this case. The present paper focuses on investigating the clustering technique for hesitant fuzzy sets based on the K-means clustering algorithm which takes the results of hierarchical clustering as the initial clusters. Finally, two examples demonstrate the validity of our algorithm.展开更多
A new algorithm named kernel bisecting k-means and sample removal(KBK-SR) is proposed as sampling preprocessing for support vector machine(SVM) training to improve the efficiency.The proposed algorithm tends to quickl...A new algorithm named kernel bisecting k-means and sample removal(KBK-SR) is proposed as sampling preprocessing for support vector machine(SVM) training to improve the efficiency.The proposed algorithm tends to quickly produce balanced clusters of similar sizes in the kernel feature space,which makes it efficient and effective for reducing training samples.Theoretical analysis and experimental results on three UCI real data benchmarks both show that,with very short sampling time,the proposed algorithm dramatically accelerates SVM sampling and training while maintaining high test accuracy.展开更多
Clustering is a group of unsupervised statistical techniques commonly used in many disciplines. Considering their applications to fish abundance data, many technical details need to be considered to ensure reasonable ...Clustering is a group of unsupervised statistical techniques commonly used in many disciplines. Considering their applications to fish abundance data, many technical details need to be considered to ensure reasonable interpretation. However, the reliability and stability of the clustering methods have rarely been studied in the contexts of fisheries. This study presents an intensive evaluation of three common clustering methods, including hierarchical clustering(HC), K-means(KM), and expectation-maximization(EM) methods, based on fish community surveys in the coastal waters of Shandong, China. We evaluated the performances of these three methods considering different numbers of clusters, data size, and data transformation approaches, focusing on the consistency validation using the index of average proportion of non-overlap(APN). The results indicate that the three methods tend to be inconsistent in the optimal number of clusters. EM showed relatively better performances to avoid unbalanced classification, whereas HC and KM provided more stable clustering results. Data transformation including scaling, square-root, and log-transformation had substantial influences on the clustering results, especially for KM. Moreover, transformation also influenced clustering stability, wherein scaling tended to provide a stable solution at the same number of clusters. The APN values indicated improved stability with increasing data size, and the effect leveled off over 70 samples in general and most quickly in EM. We conclude that the best clustering method can be chosen depending on the aim of the study and the number of clusters. In general, KM is relatively robust in our tests. We also provide recommendations for future application of clustering analyses. This study is helpful to ensure the credibility of the application and interpretation of clustering methods.展开更多
In recent years, microarray technology has been widely applied in biological and clinical studies for simultaneous monitoring of gene expression in thousands of genes. Gene clustering analysis is found useful for disc...In recent years, microarray technology has been widely applied in biological and clinical studies for simultaneous monitoring of gene expression in thousands of genes. Gene clustering analysis is found useful for discovering groups of correlated genes potentially co-regulated or associated to the disease or conditions under investigation. Many clustering methods including k-means, fuzzy c-means, and hierarchical clustering have been widely used in literatures. Yet no comprehensive comparative study has been performed to evaluate the effectiveness of these methods, specially, in yeast saccharomyces cerevisiae. In this paper, these three gene clustering methods are compared. Classification accuracy and CPU time cost are employed for measuring performance of these algorithms. Our results show that hierarchical clustering outperforms k-means and fuzzy c-means clustering. The analysis provides deep insight to the complicated gene clustering problem of expression profile and serves as a practical guideline for routine microarray cluster analysis of gene expression.展开更多
Machine learning implementations are being done in a long way in science and technology and especially in medical stream. In this article, we are focusing on machine learning implementation on mall customers and based...Machine learning implementations are being done in a long way in science and technology and especially in medical stream. In this article, we are focusing on machine learning implementation on mall customers and based on their income and how they can invest in the purchase in a mall. This explains the features like Customer ID, gender, age, income, and spending score. There, we mentioned a score in purchasing the goods in the mall. In this scenario, we are implementing clustering mechanisms, and here we apply the dataset of mall customers which is a public dataset and create clusters related to the customer purchase. We implement machine learning models for the prediction of whether the visited customer will purchase any product or not. For this kind of works, we require many of the inputs like the features mentioned in the paper. To maintain the features, we require a model with machine learning capability. We are performing K-Means clustering and Hierarchical clustering mechanisms, and finally, we implement a confusion matrix to achieve and identify the highest accuracy in those two algorithms. Here, we consider machine learning mechanisms to predict the category of the customer about whether they can buy a product or not based on the independent variables. This work presents you a simple machine learning prediction model based on which we can predict the category of the customer based on clustering. Before clustering, we don’t know to what group they belong to. But after clustering, we can identify the category that data node belongs to. In this article, we are mentioning the process of determining the employee based information using machine learning clustering mechanisms.展开更多
Raw data are classified using clustering techniques in a reasonable manner to create disjoint clusters.A lot of clustering algorithms based on specific parameters have been proposed to access a high volume of datasets...Raw data are classified using clustering techniques in a reasonable manner to create disjoint clusters.A lot of clustering algorithms based on specific parameters have been proposed to access a high volume of datasets.This paper focuses on cluster analysis based on neutrosophic set implication,i.e.,a k-means algorithm with a threshold-based clustering technique.This algorithm addresses the shortcomings of the k-means clustering algorithm by overcoming the limitations of the threshold-based clustering algorithm.To evaluate the validity of the proposed method,several validity measures and validity indices are applied to the Iris dataset(from the University of California,Irvine,Machine Learning Repository)along with k-means and threshold-based clustering algorithms.The proposed method results in more segregated datasets with compacted clusters,thus achieving higher validity indices.The method also eliminates the limitations of threshold-based clustering algorithm and validates measures and respective indices along with k-means and threshold-based clustering algorithms.展开更多
For a city,analyzing its advantages,disadvantages and the level of economic development in a country is important,especially for the cities in China developing at flying speed.The corresponding literatures for the cit...For a city,analyzing its advantages,disadvantages and the level of economic development in a country is important,especially for the cities in China developing at flying speed.The corresponding literatures for the cities in China have not considered the indicators of economy and industry in detail.In this paper,based on multiple indicators of economy and industry,the urban hierarchical structure of 285 cities above the prefecture level in China is investigated.The indicators from the economy,industry,infrastructure,medical care,population,education,culture,and employment levels are selected to establish a new indicator system for analyzing urban hierarchical structure.The factor analysis method is used to investigate the relationship between the variables of selected indicators and obtain the score of each common factor and comprehensive scores and rankings for 285 cities above the prefecture level in China.According to the comprehensive scores,285 cities above the prefecture level are clustered into 15 levels by using K-means clustering algorithm.Then,the hierarchical structure system of the cities above the prefecture level in China is obtained and corresponding policy implications are proposed.The results and implications can not only be applied to the urban planning and development in China but also offer a reference on other developing countries.The methodologies used in this paper can also be applied to study the urban hierarchical structure in other countries.展开更多
Big data analytics and data mining are techniques used to analyze data and to extract hidden information.Traditional approaches to analysis and extraction do not work well for big data because this data is complex and...Big data analytics and data mining are techniques used to analyze data and to extract hidden information.Traditional approaches to analysis and extraction do not work well for big data because this data is complex and of very high volume. A major data mining technique known as data clustering groups the data into clusters and makes it easy to extract information from these clusters. However, existing clustering algorithms, such as k-means and hierarchical, are not efficient as the quality of the clusters they produce is compromised. Therefore, there is a need to design an efficient and highly scalable clustering algorithm. In this paper, we put forward a new clustering algorithm called hybrid clustering in order to overcome the disadvantages of existing clustering algorithms. We compare the new hybrid algorithm with existing algorithms on the bases of precision, recall, F-measure, execution time, and accuracy of results. From the experimental results, it is clear that the proposed hybrid clustering algorithm is more accurate, and has better precision, recall, and F-measure values.展开更多
The exploitation of systems using solar energy as a source of energy is not fluctuations free because of short passage of clouds on solar radiation. The amplitude, the persistence and the frequency of these fluctuatio...The exploitation of systems using solar energy as a source of energy is not fluctuations free because of short passage of clouds on solar radiation. The amplitude, the persistence and the frequency of these fluctuations should be analyzed with appropriate tools, instead of focusing on their location over time. The analysis of these fluctuations should use the instantaneous clearness index whose distribution is given as a first approximation which is independent not only of the season but also of the site. It is important to evaluate the potential solar energy in a region. Indeed such evaluation helps the decision-makers in their reflections on agricultural or photovoltaic solar projects. Then this study was conducted for a predictive purpose. The method used in our work combines the classification method which is the hierarchical ascending classification and two partitioning methods, the principal component?analysis and the K-means method. The partitioning method enabled to?achieve a number of well-known situations (in advance) that are representative of the day. The study was based on the data of a climatic weather station in the district of Yamoussoukro located in the center region of Côte d’Ivoire during the 2017 year. Using the clearness index, the study allowed the classification of the solar radiation in the region. Thus, it showed that only 346 days of the 365 days in 2017 were classified (95%). We identified three clusters of days, the cloudy sky (29%), the partly cloudy sky?(32%) and the clear sky (39%). The statistical tests used for the characterization?of these clusters will be detailed in a future study.展开更多
The term “customer churn” is used in the industry of information and communication technology (ICT) to indicate those customers who are about to leave for a new competitor, or end their subscription. Predicting this...The term “customer churn” is used in the industry of information and communication technology (ICT) to indicate those customers who are about to leave for a new competitor, or end their subscription. Predicting this behavior is very important for real life market and competition, and it is essential to manage it. In this paper, three hybrid models are investigated to develop an accurate and efficient churn prediction model. The three models are based on two phases;the clustering phase and the prediction phase. In the first phase, customer data is filtered. The second phase predicts the customer behavior. The first model investigates the k-means algorithm for data filtering, and Multilayer Perceptron Artificial Neural Networks (MLP-ANN) for prediction. The second model uses hierarchical clustering with MLP-ANN. The third one uses self organizing maps (SOM) with MLP-ANN. The three models are developed based on real data then the accuracy and churn rate values are calculated and compared. The comparison with the other models shows that the three hybrid models outperformed single common models.展开更多
基金Supported by the National Natural Science Foundation of China(61273209)
文摘Due to the limitation and hesitation in one's knowledge, the membership degree of an element to a given set usually has a few different values, in which the conventional fuzzy sets are invalid. Hesitant fuzzy sets are a powerful tool to treat this case. The present paper focuses on investigating the clustering technique for hesitant fuzzy sets based on the K-means clustering algorithm which takes the results of hierarchical clustering as the initial clusters. Finally, two examples demonstrate the validity of our algorithm.
基金National Natural Science Foundation of China (No. 60975083)Key Grant Project,Ministry of Education,China(No. 104145)
文摘A new algorithm named kernel bisecting k-means and sample removal(KBK-SR) is proposed as sampling preprocessing for support vector machine(SVM) training to improve the efficiency.The proposed algorithm tends to quickly produce balanced clusters of similar sizes in the kernel feature space,which makes it efficient and effective for reducing training samples.Theoretical analysis and experimental results on three UCI real data benchmarks both show that,with very short sampling time,the proposed algorithm dramatically accelerates SVM sampling and training while maintaining high test accuracy.
基金provided by the Marine S&T Fund of Shandong Province for Pilot National Laboratory for Marine Science and Technology (Qingdao) (No.2018SDKJ0501-2)。
文摘Clustering is a group of unsupervised statistical techniques commonly used in many disciplines. Considering their applications to fish abundance data, many technical details need to be considered to ensure reasonable interpretation. However, the reliability and stability of the clustering methods have rarely been studied in the contexts of fisheries. This study presents an intensive evaluation of three common clustering methods, including hierarchical clustering(HC), K-means(KM), and expectation-maximization(EM) methods, based on fish community surveys in the coastal waters of Shandong, China. We evaluated the performances of these three methods considering different numbers of clusters, data size, and data transformation approaches, focusing on the consistency validation using the index of average proportion of non-overlap(APN). The results indicate that the three methods tend to be inconsistent in the optimal number of clusters. EM showed relatively better performances to avoid unbalanced classification, whereas HC and KM provided more stable clustering results. Data transformation including scaling, square-root, and log-transformation had substantial influences on the clustering results, especially for KM. Moreover, transformation also influenced clustering stability, wherein scaling tended to provide a stable solution at the same number of clusters. The APN values indicated improved stability with increasing data size, and the effect leveled off over 70 samples in general and most quickly in EM. We conclude that the best clustering method can be chosen depending on the aim of the study and the number of clusters. In general, KM is relatively robust in our tests. We also provide recommendations for future application of clustering analyses. This study is helpful to ensure the credibility of the application and interpretation of clustering methods.
基金supported by the National Natural Science Foundation of China under Grant No. 30525030,60701015, and 60736029
文摘In recent years, microarray technology has been widely applied in biological and clinical studies for simultaneous monitoring of gene expression in thousands of genes. Gene clustering analysis is found useful for discovering groups of correlated genes potentially co-regulated or associated to the disease or conditions under investigation. Many clustering methods including k-means, fuzzy c-means, and hierarchical clustering have been widely used in literatures. Yet no comprehensive comparative study has been performed to evaluate the effectiveness of these methods, specially, in yeast saccharomyces cerevisiae. In this paper, these three gene clustering methods are compared. Classification accuracy and CPU time cost are employed for measuring performance of these algorithms. Our results show that hierarchical clustering outperforms k-means and fuzzy c-means clustering. The analysis provides deep insight to the complicated gene clustering problem of expression profile and serves as a practical guideline for routine microarray cluster analysis of gene expression.
文摘Machine learning implementations are being done in a long way in science and technology and especially in medical stream. In this article, we are focusing on machine learning implementation on mall customers and based on their income and how they can invest in the purchase in a mall. This explains the features like Customer ID, gender, age, income, and spending score. There, we mentioned a score in purchasing the goods in the mall. In this scenario, we are implementing clustering mechanisms, and here we apply the dataset of mall customers which is a public dataset and create clusters related to the customer purchase. We implement machine learning models for the prediction of whether the visited customer will purchase any product or not. For this kind of works, we require many of the inputs like the features mentioned in the paper. To maintain the features, we require a model with machine learning capability. We are performing K-Means clustering and Hierarchical clustering mechanisms, and finally, we implement a confusion matrix to achieve and identify the highest accuracy in those two algorithms. Here, we consider machine learning mechanisms to predict the category of the customer about whether they can buy a product or not based on the independent variables. This work presents you a simple machine learning prediction model based on which we can predict the category of the customer based on clustering. Before clustering, we don’t know to what group they belong to. But after clustering, we can identify the category that data node belongs to. In this article, we are mentioning the process of determining the employee based information using machine learning clustering mechanisms.
文摘Raw data are classified using clustering techniques in a reasonable manner to create disjoint clusters.A lot of clustering algorithms based on specific parameters have been proposed to access a high volume of datasets.This paper focuses on cluster analysis based on neutrosophic set implication,i.e.,a k-means algorithm with a threshold-based clustering technique.This algorithm addresses the shortcomings of the k-means clustering algorithm by overcoming the limitations of the threshold-based clustering algorithm.To evaluate the validity of the proposed method,several validity measures and validity indices are applied to the Iris dataset(from the University of California,Irvine,Machine Learning Repository)along with k-means and threshold-based clustering algorithms.The proposed method results in more segregated datasets with compacted clusters,thus achieving higher validity indices.The method also eliminates the limitations of threshold-based clustering algorithm and validates measures and respective indices along with k-means and threshold-based clustering algorithms.
基金supported by National Key Research and Development Program of China(Grant No.2018YFC0704903).
文摘For a city,analyzing its advantages,disadvantages and the level of economic development in a country is important,especially for the cities in China developing at flying speed.The corresponding literatures for the cities in China have not considered the indicators of economy and industry in detail.In this paper,based on multiple indicators of economy and industry,the urban hierarchical structure of 285 cities above the prefecture level in China is investigated.The indicators from the economy,industry,infrastructure,medical care,population,education,culture,and employment levels are selected to establish a new indicator system for analyzing urban hierarchical structure.The factor analysis method is used to investigate the relationship between the variables of selected indicators and obtain the score of each common factor and comprehensive scores and rankings for 285 cities above the prefecture level in China.According to the comprehensive scores,285 cities above the prefecture level are clustered into 15 levels by using K-means clustering algorithm.Then,the hierarchical structure system of the cities above the prefecture level in China is obtained and corresponding policy implications are proposed.The results and implications can not only be applied to the urban planning and development in China but also offer a reference on other developing countries.The methodologies used in this paper can also be applied to study the urban hierarchical structure in other countries.
文摘Big data analytics and data mining are techniques used to analyze data and to extract hidden information.Traditional approaches to analysis and extraction do not work well for big data because this data is complex and of very high volume. A major data mining technique known as data clustering groups the data into clusters and makes it easy to extract information from these clusters. However, existing clustering algorithms, such as k-means and hierarchical, are not efficient as the quality of the clusters they produce is compromised. Therefore, there is a need to design an efficient and highly scalable clustering algorithm. In this paper, we put forward a new clustering algorithm called hybrid clustering in order to overcome the disadvantages of existing clustering algorithms. We compare the new hybrid algorithm with existing algorithms on the bases of precision, recall, F-measure, execution time, and accuracy of results. From the experimental results, it is clear that the proposed hybrid clustering algorithm is more accurate, and has better precision, recall, and F-measure values.
文摘The exploitation of systems using solar energy as a source of energy is not fluctuations free because of short passage of clouds on solar radiation. The amplitude, the persistence and the frequency of these fluctuations should be analyzed with appropriate tools, instead of focusing on their location over time. The analysis of these fluctuations should use the instantaneous clearness index whose distribution is given as a first approximation which is independent not only of the season but also of the site. It is important to evaluate the potential solar energy in a region. Indeed such evaluation helps the decision-makers in their reflections on agricultural or photovoltaic solar projects. Then this study was conducted for a predictive purpose. The method used in our work combines the classification method which is the hierarchical ascending classification and two partitioning methods, the principal component?analysis and the K-means method. The partitioning method enabled to?achieve a number of well-known situations (in advance) that are representative of the day. The study was based on the data of a climatic weather station in the district of Yamoussoukro located in the center region of Côte d’Ivoire during the 2017 year. Using the clearness index, the study allowed the classification of the solar radiation in the region. Thus, it showed that only 346 days of the 365 days in 2017 were classified (95%). We identified three clusters of days, the cloudy sky (29%), the partly cloudy sky?(32%) and the clear sky (39%). The statistical tests used for the characterization?of these clusters will be detailed in a future study.
文摘The term “customer churn” is used in the industry of information and communication technology (ICT) to indicate those customers who are about to leave for a new competitor, or end their subscription. Predicting this behavior is very important for real life market and competition, and it is essential to manage it. In this paper, three hybrid models are investigated to develop an accurate and efficient churn prediction model. The three models are based on two phases;the clustering phase and the prediction phase. In the first phase, customer data is filtered. The second phase predicts the customer behavior. The first model investigates the k-means algorithm for data filtering, and Multilayer Perceptron Artificial Neural Networks (MLP-ANN) for prediction. The second model uses hierarchical clustering with MLP-ANN. The third one uses self organizing maps (SOM) with MLP-ANN. The three models are developed based on real data then the accuracy and churn rate values are calculated and compared. The comparison with the other models shows that the three hybrid models outperformed single common models.