Finding out out-of-vocabulary words is an urgent and difficult task in Chinese words segmentation. To avoid the defect causing by offline training in the traditional method, the paper proposes an improved prediction b...Finding out out-of-vocabulary words is an urgent and difficult task in Chinese words segmentation. To avoid the defect causing by offline training in the traditional method, the paper proposes an improved prediction by partical match (PPM) segmenting algorithm for Chinese words based on extracting local context information, which adds the context information of the testing text into the local PPM statistical model so as to guide the detection of new words. The algorithm focuses on the process of online segmentatien and new word detection which achieves a good effect in the close or opening test, and outperforms some well-known Chinese segmentation system to a certain extent.展开更多
A new joint decoding strategy that combines the character-based and word-based conditional random field model is proposed.In this segmentation framework,fragments are used to generate candidate Out-of-Vocabularies(OOV...A new joint decoding strategy that combines the character-based and word-based conditional random field model is proposed.In this segmentation framework,fragments are used to generate candidate Out-of-Vocabularies(OOVs).After the initial segmentation,the segmentation fragments are divided into two classes as "combination"(combining several fragments as an unknown word) and "segregation"(segregating to some words).So,more OOVs can be recalled.Moreover,for the characteristics of the cross-domain segmentation,context information is reasonably used to guide Chinese Word Segmentation(CWS).This method is proved to be effective through several experiments on the test data from Sighan Bakeoffs 2007 and Bakeoffs 2010.The rates of OOV recall obtain better performance and the overall segmentation performances achieve a good effect.展开更多
To identify Song Ci style automatically,we put forward a novel stylistic text categorization approach based on words and their semantic in this paper. And a modified special word segmentation method,a new semantic rel...To identify Song Ci style automatically,we put forward a novel stylistic text categorization approach based on words and their semantic in this paper. And a modified special word segmentation method,a new semantic relativity computing method based on HowNet along with the corresponding word sense disambiguation method are proposed to extract words and semantic features from Song Ci. Experiments are carried out and the results show that these methods are effective.展开更多
This paper is based on two existing theories about automatic indexing of thematic knowledge concept. The prohibit-word table with position information has been designed. The improved Maximum Matching-Minimum Backtrack...This paper is based on two existing theories about automatic indexing of thematic knowledge concept. The prohibit-word table with position information has been designed. The improved Maximum Matching-Minimum Backtracking method has been researched. Moreover it has been studied on improved indexing algorithm and application technology based on rules and thematic concept word table.展开更多
Automatic translation of Chinese text to Chinese Braille is important for blind people in China to acquire information using computers or smart phones. In this paper, a novel scheme of Chinese-Braille translation is p...Automatic translation of Chinese text to Chinese Braille is important for blind people in China to acquire information using computers or smart phones. In this paper, a novel scheme of Chinese-Braille translation is proposed. Under the scheme, a Braille word segmentation model based on statistical machine learning is trained on a Braille corpus, and Braille word segmentation is carried out using the statistical model directly without the stage of Chinese word segmentation. This method avoids establishing rules concerning syntactic and semantic information and uses statistical model to learn the rules stealthily and automatically. To further improve the performance, an algorithm of fusing the results of Chinese word segmentation and Braille word segmentation is also proposed. Our results show that the proposed method achieves accuracy of 92.81% for Braille word segmentation and considerably outperforms current approaches using the segmentation-merging scheme.展开更多
基金National Natural Science Foundation of China ( No.60903129)National High Technology Research and Development Program of China (No.2006AA010107, No.2006AA010108)Foundation of Fujian Province of China (No.2008F3105)
文摘Finding out out-of-vocabulary words is an urgent and difficult task in Chinese words segmentation. To avoid the defect causing by offline training in the traditional method, the paper proposes an improved prediction by partical match (PPM) segmenting algorithm for Chinese words based on extracting local context information, which adds the context information of the testing text into the local PPM statistical model so as to guide the detection of new words. The algorithm focuses on the process of online segmentatien and new word detection which achieves a good effect in the close or opening test, and outperforms some well-known Chinese segmentation system to a certain extent.
基金supported by the National Natural Science Foundation of China under Grants No.61173100,No.61173101the Fundamental Research Funds for the Central Universities under Grant No.DUT10RW202
文摘A new joint decoding strategy that combines the character-based and word-based conditional random field model is proposed.In this segmentation framework,fragments are used to generate candidate Out-of-Vocabularies(OOVs).After the initial segmentation,the segmentation fragments are divided into two classes as "combination"(combining several fragments as an unknown word) and "segregation"(segregating to some words).So,more OOVs can be recalled.Moreover,for the characteristics of the cross-domain segmentation,context information is reasonably used to guide Chinese Word Segmentation(CWS).This method is proved to be effective through several experiments on the test data from Sighan Bakeoffs 2007 and Bakeoffs 2010.The rates of OOV recall obtain better performance and the overall segmentation performances achieve a good effect.
基金National Natural Science Foundation of China ( No.60903129)National High Technology Research and Development Programs of China ( No.2006AA010107, No.2006AA010108)Foundations of Fujian Province of China ( No.2008F3105, No.2009J05156)
文摘To identify Song Ci style automatically,we put forward a novel stylistic text categorization approach based on words and their semantic in this paper. And a modified special word segmentation method,a new semantic relativity computing method based on HowNet along with the corresponding word sense disambiguation method are proposed to extract words and semantic features from Song Ci. Experiments are carried out and the results show that these methods are effective.
基金the Science Foundation of Shanghai Archive Bureau (0215)
文摘This paper is based on two existing theories about automatic indexing of thematic knowledge concept. The prohibit-word table with position information has been designed. The improved Maximum Matching-Minimum Backtracking method has been researched. Moreover it has been studied on improved indexing algorithm and application technology based on rules and thematic concept word table.
基金Fthe National Key Technology R&D Program of China(No.2014BAK15B02)the National Natural Science Foundation of China(No.61202209)
文摘Automatic translation of Chinese text to Chinese Braille is important for blind people in China to acquire information using computers or smart phones. In this paper, a novel scheme of Chinese-Braille translation is proposed. Under the scheme, a Braille word segmentation model based on statistical machine learning is trained on a Braille corpus, and Braille word segmentation is carried out using the statistical model directly without the stage of Chinese word segmentation. This method avoids establishing rules concerning syntactic and semantic information and uses statistical model to learn the rules stealthily and automatically. To further improve the performance, an algorithm of fusing the results of Chinese word segmentation and Braille word segmentation is also proposed. Our results show that the proposed method achieves accuracy of 92.81% for Braille word segmentation and considerably outperforms current approaches using the segmentation-merging scheme.