Because the small CACHE size of computers, the scanning speed of DFA based multi-pattern string-matching algorithms slows down rapidly especially when the number of patterns is very large. For solving such problems, w...Because the small CACHE size of computers, the scanning speed of DFA based multi-pattern string-matching algorithms slows down rapidly especially when the number of patterns is very large. For solving such problems, we cut down the scanning time of those algorithms (i.e. DFA based) by rearranging the states table and shrinking the DFA alphabet size. Both the methods can decrease the probability of large-scale random memory accessing and increase the probability of continuously memory accessing. Then the hitting rate of the CACHE is increased and the searching time of on the DFA is reduced. Shrinking the alphabet size of the DFA also reduces the storage complication. The AC++algorithm, by optimizing the Aho-Corasick (i.e. AC) algorithm using such methods, proves the theoretical analysis. And the experimentation results show that the scanning time of AC++and the storage occupied is better than that of AC in most cases and the result is much attractive when the number of patterns is very large. Because DFA is a widely used base algorithm in may string matching algorithms, such as DAWG, SBOM etc., the optimizing method discussed is significant in practice.展开更多
Since webpage classification is different from traditional text classification with its irregular words and phrases,massive and unlabeled features,which makes it harder for us to obtain effective feature.To cope with ...Since webpage classification is different from traditional text classification with its irregular words and phrases,massive and unlabeled features,which makes it harder for us to obtain effective feature.To cope with this problem,we propose two scenarios to extract meaningful strings based on document clustering and term clustering with multi-strategies to optimize a Vector Space Model(VSM) in order to improve webpage classification.The results show that document clustering work better than term clustering in coping with document content.However,a better overall performance is obtained by spectral clustering with document clustering.Moreover,owing to image existing in a same webpage with document content,the proposed method is also applied to extract image meaningful terms,and experiment results also show its effectiveness in improving webpage classification.展开更多
A multi-character recognition method based on hidden Markov model (HMM) was presented. The method can reduce the calculation load of correlation and improve recognition accuracy compared with singlecharacter recogni...A multi-character recognition method based on hidden Markov model (HMM) was presented. The method can reduce the calculation load of correlation and improve recognition accuracy compared with singlecharacter recognition in video. The characteristics used for recognizing include the shape character, the color character, the texture character and so on. Even our human being generally uses these characteristics to recognize objects in practice..4, recognition experiment of 17 fishes was carried out in the paper. The experimental results demonstrate the high veracity of the multi-character recognition algorithm. Together with the tracking process, it can handle dynamic objects, so the multi-character recognition is more like the human recognition, and has great application value.展开更多
文摘Because the small CACHE size of computers, the scanning speed of DFA based multi-pattern string-matching algorithms slows down rapidly especially when the number of patterns is very large. For solving such problems, we cut down the scanning time of those algorithms (i.e. DFA based) by rearranging the states table and shrinking the DFA alphabet size. Both the methods can decrease the probability of large-scale random memory accessing and increase the probability of continuously memory accessing. Then the hitting rate of the CACHE is increased and the searching time of on the DFA is reduced. Shrinking the alphabet size of the DFA also reduces the storage complication. The AC++algorithm, by optimizing the Aho-Corasick (i.e. AC) algorithm using such methods, proves the theoretical analysis. And the experimentation results show that the scanning time of AC++and the storage occupied is better than that of AC in most cases and the result is much attractive when the number of patterns is very large. Because DFA is a widely used base algorithm in may string matching algorithms, such as DAWG, SBOM etc., the optimizing method discussed is significant in practice.
基金supported by the National Natural Science Foundation of China under Grants No.61100205,No.60873001the HiTech Research and Development Program of China under Grant No.2011AA010705the Fundamental Research Funds for the Central Universities under Grant No.2009RC0212
文摘Since webpage classification is different from traditional text classification with its irregular words and phrases,massive and unlabeled features,which makes it harder for us to obtain effective feature.To cope with this problem,we propose two scenarios to extract meaningful strings based on document clustering and term clustering with multi-strategies to optimize a Vector Space Model(VSM) in order to improve webpage classification.The results show that document clustering work better than term clustering in coping with document content.However,a better overall performance is obtained by spectral clustering with document clustering.Moreover,owing to image existing in a same webpage with document content,the proposed method is also applied to extract image meaningful terms,and experiment results also show its effectiveness in improving webpage classification.
基金the National Natural Science Foundation of China (No.60675024)
文摘A multi-character recognition method based on hidden Markov model (HMM) was presented. The method can reduce the calculation load of correlation and improve recognition accuracy compared with singlecharacter recognition in video. The characteristics used for recognizing include the shape character, the color character, the texture character and so on. Even our human being generally uses these characteristics to recognize objects in practice..4, recognition experiment of 17 fishes was carried out in the paper. The experimental results demonstrate the high veracity of the multi-character recognition algorithm. Together with the tracking process, it can handle dynamic objects, so the multi-character recognition is more like the human recognition, and has great application value.