Many text classifications depend on statistical term measures to implement document representation. Such document representations ignore the lexical semantic contents of terms and the distilled mutual information, lea...Many text classifications depend on statistical term measures to implement document representation. Such document representations ignore the lexical semantic contents of terms and the distilled mutual information, leading to text classification errors.This work proposed a document representation method, Word Net-based lexical semantic VSM, to solve the problem. Using Word Net,this method constructed a data structure of semantic-element information to characterize lexical semantic contents, and adjusted EM modeling to disambiguate word stems. Then, in the lexical-semantic space of corpus, lexical-semantic eigenvector of document representation was built by calculating the weight of each synset, and applied to a widely-recognized algorithm NWKNN. On text corpus Reuter-21578 and its adjusted version of lexical replacement, the experimental results show that the lexical-semantic eigenvector performs F1 measure and scales of dimension better than term-statistic eigenvector based on TF-IDF. Formation of document representation eigenvectors ensures the method a wide prospect of classification applications in text corpus analysis.展开更多
The paper describes a texture-based fast text location scheme which operates directly in the Discrete Wavelet Transform (DWT) domain. By the distinguishing texture characteristics encoded in wavelet transform domain, ...The paper describes a texture-based fast text location scheme which operates directly in the Discrete Wavelet Transform (DWT) domain. By the distinguishing texture characteristics encoded in wavelet transform domain, the text is fast detected from complex background images stored in the compressed format such as JPEG2000 without full decompress. Compared with some traditional character location methods, the proposed scheme has the advantages of low computational cost, robust to size and font of characters and high accuracy. Preliminary experimental results show that the proposed scheme is efficient and effective.展开更多
基金Project(2012AA011205)supported by National High-Tech Research and Development Program(863 Program)of ChinaProjects(61272150,61379109,M1321007,61301136,61103034)supported by the National Natural Science Foundation of China+1 种基金Project(20120162110077)supported by Research Fund for the Doctoral Program of Higher Education of ChinaProject(11JJ1012)supported by Excellent Youth Foundation of Hunan Scientific Committee,China
文摘Many text classifications depend on statistical term measures to implement document representation. Such document representations ignore the lexical semantic contents of terms and the distilled mutual information, leading to text classification errors.This work proposed a document representation method, Word Net-based lexical semantic VSM, to solve the problem. Using Word Net,this method constructed a data structure of semantic-element information to characterize lexical semantic contents, and adjusted EM modeling to disambiguate word stems. Then, in the lexical-semantic space of corpus, lexical-semantic eigenvector of document representation was built by calculating the weight of each synset, and applied to a widely-recognized algorithm NWKNN. On text corpus Reuter-21578 and its adjusted version of lexical replacement, the experimental results show that the lexical-semantic eigenvector performs F1 measure and scales of dimension better than term-statistic eigenvector based on TF-IDF. Formation of document representation eigenvectors ensures the method a wide prospect of classification applications in text corpus analysis.
基金Supported by the National Natural Science Foundation of China(No.60402036)the Natural Science Foundation of Beijing(No.4042008).
文摘The paper describes a texture-based fast text location scheme which operates directly in the Discrete Wavelet Transform (DWT) domain. By the distinguishing texture characteristics encoded in wavelet transform domain, the text is fast detected from complex background images stored in the compressed format such as JPEG2000 without full decompress. Compared with some traditional character location methods, the proposed scheme has the advantages of low computational cost, robust to size and font of characters and high accuracy. Preliminary experimental results show that the proposed scheme is efficient and effective.