Aiming at the importance of the analysis for public opinion on Internet, the authors propose a high-performance extraction method for public opinion. In this method, the space model for classification is adopted to de...Aiming at the importance of the analysis for public opinion on Internet, the authors propose a high-performance extraction method for public opinion. In this method, the space model for classification is adopted to describe the relationship between words and categories. The combined feature selection method is used to remove noisy words from the original feature space effectively. Then the category weight of words is calculated by the improved formula combining the frequency of words and distribution of words. Finally, the class weights of the not-categorized documents based on the category weight of words are obtained for realizing opinion extraction. Experiment results show that the method has comparatively high classification and good stability.展开更多
In the K-means clustering algorithm, each data point is uniquely placed into one category. The clustering quality is heavily dependent on the initial cluster centroid. Different initializations can yield varied result...In the K-means clustering algorithm, each data point is uniquely placed into one category. The clustering quality is heavily dependent on the initial cluster centroid. Different initializations can yield varied results; local adjustment cannot save the clustering result from poor local optima. If there is an anomaly in a cluster, it will seriously affect the cluster mean value. The K-means clustering algorithm is only suitable for clusters with convex shapes. We therefore propose a novel clustering algorithm CARDBK—"centroid all rank distance(CARD)" which means that all centroids are sorted by distance value from one point and "BK" are the initials of "batch K-means"—in which one point not only modifies a cluster centroid nearest to this point but also modifies multiple clusters centroids adjacent to this point, and the degree of influence of a point on a cluster centroid depends on the distance value between this point and the other nearer cluster centroids. Experimental results showed that our CARDBK algorithm outperformed other algorithms when tested on a number of different data sets based on the following performance indexes: entropy, purity, F1 value, Rand index and normalized mutual information(NMI). Our algorithm manifested to be more stable, linearly scalable and faster.展开更多
Hard competition learning has the feature that each point modifies only one cluster centroid that wins. Correspondingly, soft competition learning has the feature that each point modifies not only the cluster centroid...Hard competition learning has the feature that each point modifies only one cluster centroid that wins. Correspondingly, soft competition learning has the feature that each point modifies not only the cluster centroid that wins, but also many other cluster centroids near this point. A soft competition learning method is proposed. Centroid all rank distance (CARD), CARDx, and centroid all rank distance batch K-means (CARDBK) are three clustering algorithms that adopt the proposed soft competition learning method. Among them the extent to which one point affects a cluster centroid depends on the distances from this point to the other nearer cluster centroids, rather than just the rank number of the distance from this point to this cluster centroid among the distances from this point to all cluster centroids. In addition, the validation experiments are carried out in order to compare the three soft competition learning algorithms CARD, CARDx, and CARDBK with several hard competition learning algorithms as well as neural gas (NG) algorithm on five data sets from different sources. Judging from the values of five performance indexes in the clustering results, this kind of soft competition learning method has better clustering effect and efficiency, and has linear scalability.展开更多
基金Supported by the National High Technology Research and Development Program of China (2005AA147030)
文摘Aiming at the importance of the analysis for public opinion on Internet, the authors propose a high-performance extraction method for public opinion. In this method, the space model for classification is adopted to describe the relationship between words and categories. The combined feature selection method is used to remove noisy words from the original feature space effectively. Then the category weight of words is calculated by the improved formula combining the frequency of words and distribution of words. Finally, the class weights of the not-categorized documents based on the category weight of words are obtained for realizing opinion extraction. Experiment results show that the method has comparatively high classification and good stability.
基金Supported by the Social Science Foundation of Shaanxi Province of China(2018P03)the Humanities and Social Sciences Research Youth Fund Project of Ministry of Education of China(13YJCZH251)
文摘In the K-means clustering algorithm, each data point is uniquely placed into one category. The clustering quality is heavily dependent on the initial cluster centroid. Different initializations can yield varied results; local adjustment cannot save the clustering result from poor local optima. If there is an anomaly in a cluster, it will seriously affect the cluster mean value. The K-means clustering algorithm is only suitable for clusters with convex shapes. We therefore propose a novel clustering algorithm CARDBK—"centroid all rank distance(CARD)" which means that all centroids are sorted by distance value from one point and "BK" are the initials of "batch K-means"—in which one point not only modifies a cluster centroid nearest to this point but also modifies multiple clusters centroids adjacent to this point, and the degree of influence of a point on a cluster centroid depends on the distance value between this point and the other nearer cluster centroids. Experimental results showed that our CARDBK algorithm outperformed other algorithms when tested on a number of different data sets based on the following performance indexes: entropy, purity, F1 value, Rand index and normalized mutual information(NMI). Our algorithm manifested to be more stable, linearly scalable and faster.
基金supported by the Project of Natural Science Foundation Research Project of Shaanxi Province of China (2015JM6318)the Humanities and Social Sciences Research Youth Fund Project of Ministry of Education of China (13YJCZH251)
文摘Hard competition learning has the feature that each point modifies only one cluster centroid that wins. Correspondingly, soft competition learning has the feature that each point modifies not only the cluster centroid that wins, but also many other cluster centroids near this point. A soft competition learning method is proposed. Centroid all rank distance (CARD), CARDx, and centroid all rank distance batch K-means (CARDBK) are three clustering algorithms that adopt the proposed soft competition learning method. Among them the extent to which one point affects a cluster centroid depends on the distances from this point to the other nearer cluster centroids, rather than just the rank number of the distance from this point to this cluster centroid among the distances from this point to all cluster centroids. In addition, the validation experiments are carried out in order to compare the three soft competition learning algorithms CARD, CARDx, and CARDBK with several hard competition learning algorithms as well as neural gas (NG) algorithm on five data sets from different sources. Judging from the values of five performance indexes in the clustering results, this kind of soft competition learning method has better clustering effect and efficiency, and has linear scalability.