Text extraction is the key step in the character recognition;its accuracy highly relies on the location of the text region. In this paper, we propose a new method which can find the text location automatically to solv...Text extraction is the key step in the character recognition;its accuracy highly relies on the location of the text region. In this paper, we propose a new method which can find the text location automatically to solve some regional problems such as incomplete, false position or orientation deviation occurred in the low-contrast image text extraction. Firstly, we make some pre-processing for the original image, including color space transform, contrast-limited adaptive histogram equalization, Sobel edge detector, morphological method and eight neighborhood processing method (ENPM) etc., to provide some results to compare the different methods. Secondly, we use the connected component analysis (CCA) method to get several connected parts and non-connected parts, then use the morphology method and CCA again for the non-connected part to erode some noises, obtain another connected and non-connected parts. Thirdly, we compute the edge feature for all connected areas, combine Support Vector Machine (SVM) to classify the real text region, obtain the text location coordinates. Finally, we use the text region coordinate to extract the block including the text, then binarize, cluster and recognize all text information. At last, we calculate the precision rate and recall rate to evaluate the method for more than 200 images. The experiments show that the method we proposed is robust for low-contrast text images with the variations in font size and font color, different language, gloomy environment, etc.展开更多
Map recognition is an essential data input means of Geographic Information System (GIS). How to solve the problems in the procedure, such as recognition of maps with crisscross pipeline networks, classification of bui...Map recognition is an essential data input means of Geographic Information System (GIS). How to solve the problems in the procedure, such as recognition of maps with crisscross pipeline networks, classification of buildings and roads, and processing of connected text, is a critical step for GIS keeping high-speed development. In this paper, a new recognition method of pipeline maps is presented, and some common patterns of pipeline connection and component labels are established. Through pattern matching, pipelines and component labels are recognized and peeled off from maps. After this approach, maps simply consist of buildings and roads, which are recognized and classified with fuzzy classification method. In addition, the Double Sides Scan (DSS) technique is also described, through which the effect of connected text can be eliminated.展开更多
Nowadays,discourse analysis has taken its place in the study of text linguistics.As one of the essential properties of the text,coherence plays a very important role in the understanding of the text.The thesis is aimi...Nowadays,discourse analysis has taken its place in the study of text linguistics.As one of the essential properties of the text,coherence plays a very important role in the understanding of the text.The thesis is aiming at illustrating the realizations of coherence in two aspects:discourse connectives and cohesion,analyzing their differences as well as elaborating their different roles in contributing to successful communication.展开更多
Often we encounter documents with text printed on complex color background. Readability of textual contents in such documents is very poor due to complexity of the background and mix up of color(s) of foreground text ...Often we encounter documents with text printed on complex color background. Readability of textual contents in such documents is very poor due to complexity of the background and mix up of color(s) of foreground text with colors of background. Automatic segmentation of foreground text in such document images is very much essential for smooth reading of the document contents either by human or by machine. In this paper we propose a novel approach to extract the foreground text in color document images having complex background. The proposed approach is a hybrid approach which combines connected component and texture feature analysis of potential text regions. The proposed approach utilizes Canny edge detector to detect all possible text edge pixels. Connected component analysis is performed on these edge pixels to identify candidate text regions. Because of background complexity it is also possible that a non-text region may be identified as a text region. This problem is overcome by analyzing the texture features of potential text region corresponding to each connected component. An unsupervised local thresholding is devised to perform foreground segmentation in detected text regions. Finally the text regions which are noisy are identified and reprocessed to further enhance the quality of retrieved foreground. The proposed approach can handle document images with varying background of multiple colors and texture;and foreground text in any color, font, size and orientation. Experimental results show that the proposed algorithm detects on an average 97.12% of text regions in the source document. Readability of the extracted foreground text is illustrated through Optical character recognition (OCR) in case the text is in English. The proposed approach is compared with some existing methods of foreground separation in document images. Experimental results show that our approach performs better.展开更多
文摘Text extraction is the key step in the character recognition;its accuracy highly relies on the location of the text region. In this paper, we propose a new method which can find the text location automatically to solve some regional problems such as incomplete, false position or orientation deviation occurred in the low-contrast image text extraction. Firstly, we make some pre-processing for the original image, including color space transform, contrast-limited adaptive histogram equalization, Sobel edge detector, morphological method and eight neighborhood processing method (ENPM) etc., to provide some results to compare the different methods. Secondly, we use the connected component analysis (CCA) method to get several connected parts and non-connected parts, then use the morphology method and CCA again for the non-connected part to erode some noises, obtain another connected and non-connected parts. Thirdly, we compute the edge feature for all connected areas, combine Support Vector Machine (SVM) to classify the real text region, obtain the text location coordinates. Finally, we use the text region coordinate to extract the block including the text, then binarize, cluster and recognize all text information. At last, we calculate the precision rate and recall rate to evaluate the method for more than 200 images. The experiments show that the method we proposed is robust for low-contrast text images with the variations in font size and font color, different language, gloomy environment, etc.
文摘Map recognition is an essential data input means of Geographic Information System (GIS). How to solve the problems in the procedure, such as recognition of maps with crisscross pipeline networks, classification of buildings and roads, and processing of connected text, is a critical step for GIS keeping high-speed development. In this paper, a new recognition method of pipeline maps is presented, and some common patterns of pipeline connection and component labels are established. Through pattern matching, pipelines and component labels are recognized and peeled off from maps. After this approach, maps simply consist of buildings and roads, which are recognized and classified with fuzzy classification method. In addition, the Double Sides Scan (DSS) technique is also described, through which the effect of connected text can be eliminated.
文摘Nowadays,discourse analysis has taken its place in the study of text linguistics.As one of the essential properties of the text,coherence plays a very important role in the understanding of the text.The thesis is aiming at illustrating the realizations of coherence in two aspects:discourse connectives and cohesion,analyzing their differences as well as elaborating their different roles in contributing to successful communication.
文摘Often we encounter documents with text printed on complex color background. Readability of textual contents in such documents is very poor due to complexity of the background and mix up of color(s) of foreground text with colors of background. Automatic segmentation of foreground text in such document images is very much essential for smooth reading of the document contents either by human or by machine. In this paper we propose a novel approach to extract the foreground text in color document images having complex background. The proposed approach is a hybrid approach which combines connected component and texture feature analysis of potential text regions. The proposed approach utilizes Canny edge detector to detect all possible text edge pixels. Connected component analysis is performed on these edge pixels to identify candidate text regions. Because of background complexity it is also possible that a non-text region may be identified as a text region. This problem is overcome by analyzing the texture features of potential text region corresponding to each connected component. An unsupervised local thresholding is devised to perform foreground segmentation in detected text regions. Finally the text regions which are noisy are identified and reprocessed to further enhance the quality of retrieved foreground. The proposed approach can handle document images with varying background of multiple colors and texture;and foreground text in any color, font, size and orientation. Experimental results show that the proposed algorithm detects on an average 97.12% of text regions in the source document. Readability of the extracted foreground text is illustrated through Optical character recognition (OCR) in case the text is in English. The proposed approach is compared with some existing methods of foreground separation in document images. Experimental results show that our approach performs better.