This paper presents a fuzzy logic approach to efficiently perform unsupervised character classification for improvement in robustness, correctness and speed of a character recognition system. The characters are first ...This paper presents a fuzzy logic approach to efficiently perform unsupervised character classification for improvement in robustness, correctness and speed of a character recognition system. The characters are first split into eight typographical categories. The classification scheme uses pattern matching to classify the characters in each category into a set of fuzzy prototypes based on a nonlinear weighted similarity function. The fuzzy unsupervised character classification, which is natural in the repre...展开更多
Because the small CACHE size of computers, the scanning speed of DFA based multi-pattern string-matching algorithms slows down rapidly especially when the number of patterns is very large. For solving such problems, w...Because the small CACHE size of computers, the scanning speed of DFA based multi-pattern string-matching algorithms slows down rapidly especially when the number of patterns is very large. For solving such problems, we cut down the scanning time of those algorithms (i.e. DFA based) by rearranging the states table and shrinking the DFA alphabet size. Both the methods can decrease the probability of large-scale random memory accessing and increase the probability of continuously memory accessing. Then the hitting rate of the CACHE is increased and the searching time of on the DFA is reduced. Shrinking the alphabet size of the DFA also reduces the storage complication. The AC++algorithm, by optimizing the Aho-Corasick (i.e. AC) algorithm using such methods, proves the theoretical analysis. And the experimentation results show that the scanning time of AC++and the storage occupied is better than that of AC in most cases and the result is much attractive when the number of patterns is very large. Because DFA is a widely used base algorithm in may string matching algorithms, such as DAWG, SBOM etc., the optimizing method discussed is significant in practice.展开更多
In this work,a system for recognition of newspaper printed in Gurumukhi script is presented.Four feature extraction techniques,namely,zoning features,diagonal features,parabola curve fitting based features,and power c...In this work,a system for recognition of newspaper printed in Gurumukhi script is presented.Four feature extraction techniques,namely,zoning features,diagonal features,parabola curve fitting based features,and power curve fitting based features are considered for extracting the statistical properties of the characters printed in the newspaper.Different combinations of these features are also applied to improve the recognition accuracy.For recognition,four classification techniques,namely,k-NN,linear-SVM,decision tree,and random forest are used.A database for the experiments is collected from three major Gurumukhi script newspapers which are Ajit,Jagbani and Punjabi Tribune.Using 5-fold cross validation and random forest classifier,a recognition accuracy of 96.19%with a combination of zoning features,diagonal features and parabola curve fitting based features has been reported.A recognition accuracy of 95.21%with a partitioning strategy of data set(70%data as training data and remaining 30%data as testing data)has been achieved.展开更多
Most clustering algorithms need to describe the similarity of objects by a predefined distance function. Three distance functions which are widely used in two traditional clustering algorithms k-means and hierarchical...Most clustering algorithms need to describe the similarity of objects by a predefined distance function. Three distance functions which are widely used in two traditional clustering algorithms k-means and hierarchical clustering were investigated. Both theoretical analysis and detailed experimental results were given. It is shown that a distance function greatly affects clustering results and can be used to detect the outlier of a cluster by the comparison of such different results and give the shape information of clusters. In practice situation, it is suggested to use different distance function separately, compare the clustering results and pick out the 搒wing points? And such points may leak out more information for data analysts.展开更多
In this paper, a discriminative structured dictionary learning algorithm is presented. To enhance the dictionary's discriminative power, the reconstruction error, classification error and inhomogeneous representat...In this paper, a discriminative structured dictionary learning algorithm is presented. To enhance the dictionary's discriminative power, the reconstruction error, classification error and inhomogeneous representation error are integrated into the objective function. The proposed approach learns a single structured dictionary and a linear classifier jointly. The learned dictionary encourages the samples from the same class to have similar sparse codes, and the samples from different classes to have dissimilar sparse codes. The solution to the objective function is achieved by employing a feature-sign search algorithm and Lagrange dual method. Experimental results on three public databases demonstrate that the proposed approach outperforms several recently proposed dictionary learning techniques for classification.展开更多
The source parameters of the Yingjiang earthquake sequences in 2008 are obtained by applying spectral analysis and Brunes source model,based on the digital waveform data recorded by the Yunnan Digital Seismic Network....The source parameters of the Yingjiang earthquake sequences in 2008 are obtained by applying spectral analysis and Brunes source model,based on the digital waveform data recorded by the Yunnan Digital Seismic Network.The correlation coefficients are calculated using the low-frequency spectral amplitudes of 2 events recorded by a same station,then,events with similar focal mechanism are grouped using the clustering analysis method.Compared to the obtained focal mechanisms,it is found that there are good correlations with the azimuth of P axes in each clustering group,and the larger the correlation coefficient,the closer the azimuths of P axes.We divide the Yingjiang area into 3 regions to analyze the stress level and stress direction by combining the source parameters and the mean focal mechanism of each group.The results show:The change and transformation of the focal mechanism types at different stages can represent the temporal characteristics of the regional stress field.If the earthquake focal mechanism types are concentrated in a time period and switch to the direction of regional stress field,it may be a sign of strong earthquake.There is some relationship between the stress drop and the type of focal mechanism.Those earthquakes with stress fields revealed by focal mechanism types closer to the regional tectonic stress field will have higher stress drop,while those with the focal mechanism-revealed stress fields differing a lot from the regional tectonic stress field will generally have a lower stress drop.展开更多
Most of the earlier work on clustering mainly focused on numeric data whoseinherent geometric properties can be exploited to naturally define distance functions between datapoints. However, data mining applications fr...Most of the earlier work on clustering mainly focused on numeric data whoseinherent geometric properties can be exploited to naturally define distance functions between datapoints. However, data mining applications frequently involve many datasets that also consists ofmixed numeric and categorical attributes. In this paper we present a clustering algorithm which isbased on the k-means algorithm. The algorithm clusters objects with numeric and categoricalattributes in a way similar to k-means. The object similarity measure is derived from both numericand categorical attributes. When applied to numeric data, the algorithm is identical to the k-means.The main result of this paper is to provide a method to update the 'cluster centers' of clusteringobjects described by mixed numeric and categorical attributes in the clustering process to minimizethe clustering cost function. The clustering performance of the algorithm is demonstrated with thetwo well known data sets, namely credit approval and abalone databases.展开更多
文摘This paper presents a fuzzy logic approach to efficiently perform unsupervised character classification for improvement in robustness, correctness and speed of a character recognition system. The characters are first split into eight typographical categories. The classification scheme uses pattern matching to classify the characters in each category into a set of fuzzy prototypes based on a nonlinear weighted similarity function. The fuzzy unsupervised character classification, which is natural in the repre...
文摘Because the small CACHE size of computers, the scanning speed of DFA based multi-pattern string-matching algorithms slows down rapidly especially when the number of patterns is very large. For solving such problems, we cut down the scanning time of those algorithms (i.e. DFA based) by rearranging the states table and shrinking the DFA alphabet size. Both the methods can decrease the probability of large-scale random memory accessing and increase the probability of continuously memory accessing. Then the hitting rate of the CACHE is increased and the searching time of on the DFA is reduced. Shrinking the alphabet size of the DFA also reduces the storage complication. The AC++algorithm, by optimizing the Aho-Corasick (i.e. AC) algorithm using such methods, proves the theoretical analysis. And the experimentation results show that the scanning time of AC++and the storage occupied is better than that of AC in most cases and the result is much attractive when the number of patterns is very large. Because DFA is a widely used base algorithm in may string matching algorithms, such as DAWG, SBOM etc., the optimizing method discussed is significant in practice.
文摘In this work,a system for recognition of newspaper printed in Gurumukhi script is presented.Four feature extraction techniques,namely,zoning features,diagonal features,parabola curve fitting based features,and power curve fitting based features are considered for extracting the statistical properties of the characters printed in the newspaper.Different combinations of these features are also applied to improve the recognition accuracy.For recognition,four classification techniques,namely,k-NN,linear-SVM,decision tree,and random forest are used.A database for the experiments is collected from three major Gurumukhi script newspapers which are Ajit,Jagbani and Punjabi Tribune.Using 5-fold cross validation and random forest classifier,a recognition accuracy of 96.19%with a combination of zoning features,diagonal features and parabola curve fitting based features has been reported.A recognition accuracy of 95.21%with a partitioning strategy of data set(70%data as training data and remaining 30%data as testing data)has been achieved.
文摘Most clustering algorithms need to describe the similarity of objects by a predefined distance function. Three distance functions which are widely used in two traditional clustering algorithms k-means and hierarchical clustering were investigated. Both theoretical analysis and detailed experimental results were given. It is shown that a distance function greatly affects clustering results and can be used to detect the outlier of a cluster by the comparison of such different results and give the shape information of clusters. In practice situation, it is suggested to use different distance function separately, compare the clustering results and pick out the 搒wing points? And such points may leak out more information for data analysts.
基金Supported by the National Natural Science Foundation of China(No.61379014)
文摘In this paper, a discriminative structured dictionary learning algorithm is presented. To enhance the dictionary's discriminative power, the reconstruction error, classification error and inhomogeneous representation error are integrated into the objective function. The proposed approach learns a single structured dictionary and a linear classifier jointly. The learned dictionary encourages the samples from the same class to have similar sparse codes, and the samples from different classes to have dissimilar sparse codes. The solution to the objective function is achieved by employing a feature-sign search algorithm and Lagrange dual method. Experimental results on three public databases demonstrate that the proposed approach outperforms several recently proposed dictionary learning techniques for classification.
基金funded under the National Science and Technology Support Program of the 12th "Five-year Plan",China(2012BAK19B02)
文摘The source parameters of the Yingjiang earthquake sequences in 2008 are obtained by applying spectral analysis and Brunes source model,based on the digital waveform data recorded by the Yunnan Digital Seismic Network.The correlation coefficients are calculated using the low-frequency spectral amplitudes of 2 events recorded by a same station,then,events with similar focal mechanism are grouped using the clustering analysis method.Compared to the obtained focal mechanisms,it is found that there are good correlations with the azimuth of P axes in each clustering group,and the larger the correlation coefficient,the closer the azimuths of P axes.We divide the Yingjiang area into 3 regions to analyze the stress level and stress direction by combining the source parameters and the mean focal mechanism of each group.The results show:The change and transformation of the focal mechanism types at different stages can represent the temporal characteristics of the regional stress field.If the earthquake focal mechanism types are concentrated in a time period and switch to the direction of regional stress field,it may be a sign of strong earthquake.There is some relationship between the stress drop and the type of focal mechanism.Those earthquakes with stress fields revealed by focal mechanism types closer to the regional tectonic stress field will have higher stress drop,while those with the focal mechanism-revealed stress fields differing a lot from the regional tectonic stress field will generally have a lower stress drop.
文摘Most of the earlier work on clustering mainly focused on numeric data whoseinherent geometric properties can be exploited to naturally define distance functions between datapoints. However, data mining applications frequently involve many datasets that also consists ofmixed numeric and categorical attributes. In this paper we present a clustering algorithm which isbased on the k-means algorithm. The algorithm clusters objects with numeric and categoricalattributes in a way similar to k-means. The object similarity measure is derived from both numericand categorical attributes. When applied to numeric data, the algorithm is identical to the k-means.The main result of this paper is to provide a method to update the 'cluster centers' of clusteringobjects described by mixed numeric and categorical attributes in the clustering process to minimizethe clustering cost function. The clustering performance of the algorithm is demonstrated with thetwo well known data sets, namely credit approval and abalone databases.