The paper presents some main progresses and achievements in Chinese information processing. It focuses on six aspects, i.e., Chinese syntactic analysis, Chinese semantic analysis, machine translation, information retr...The paper presents some main progresses and achievements in Chinese information processing. It focuses on six aspects, i.e., Chinese syntactic analysis, Chinese semantic analysis, machine translation, information retrieval, information extraction, and speech recognition and synthesis. The important techniques and possible key problems of the respective branch in the near future are discussed as well.展开更多
To facilitate the wider use of computers all over the world,it is necessary to provide National Language Support in the computer systems.This paper introduces some aspects of design and implementation of the UNIX-base...To facilitate the wider use of computers all over the world,it is necessary to provide National Language Support in the computer systems.This paper introduces some aspects of design and implementation of the UNIX-based Chinese Information Processing Systems(CIPS). Due to the special nature of the Oriental languages,and in order to be able to share resources and ex- change information between different countries,it is necessary to create a standard of multilingual informa- tion exchange code.The Unified Chinese/Japanese/Korean character code,Han Character Collec- lion(HCC),was proposed to ISO/IEC JTC1/SC2/ WG2 by China Computer and Information Pro- cessing Standardization Technical Committee.Based on this character set and the corresponding coding sys- tem,it is possible to create a true Internationalized UNIX System.展开更多
The resolution of overlapping ambiguity strings(OAS)is studied based on the maximum entropy model.There are two model outputs,where either the first two characters form a word or the last two characters form a word.Th...The resolution of overlapping ambiguity strings(OAS)is studied based on the maximum entropy model.There are two model outputs,where either the first two characters form a word or the last two characters form a word.The features of the model include one word in con-text of OAS,the current OAS and word probability relation of two kinds of segmentation results.OAS in training text is found by the combination of the FMM and BMM segmen-tation method.After feature tagging they are used to train the maximum entropy model.The People Daily corpus of January 1998 is used in training and testing.Experimental results show a closed test precision of 98.64%and an open test precision of 95.01%.The open test precision is 3.76%better compared with that of the precision of common word probability method.展开更多
基金Supported by the National Natural Science Foundation of China (Grant Nos. 60375019, 60373101 and 60575041). Acknowledgement Composition of this paper has benefited a lot from the work of postgraduates and teachers in the M0E-MS Key Laboratory of Natural Language Processing and Speech.
文摘The paper presents some main progresses and achievements in Chinese information processing. It focuses on six aspects, i.e., Chinese syntactic analysis, Chinese semantic analysis, machine translation, information retrieval, information extraction, and speech recognition and synthesis. The important techniques and possible key problems of the respective branch in the near future are discussed as well.
文摘To facilitate the wider use of computers all over the world,it is necessary to provide National Language Support in the computer systems.This paper introduces some aspects of design and implementation of the UNIX-based Chinese Information Processing Systems(CIPS). Due to the special nature of the Oriental languages,and in order to be able to share resources and ex- change information between different countries,it is necessary to create a standard of multilingual informa- tion exchange code.The Unified Chinese/Japanese/Korean character code,Han Character Collec- lion(HCC),was proposed to ISO/IEC JTC1/SC2/ WG2 by China Computer and Information Pro- cessing Standardization Technical Committee.Based on this character set and the corresponding coding sys- tem,it is possible to create a true Internationalized UNIX System.
文摘The resolution of overlapping ambiguity strings(OAS)is studied based on the maximum entropy model.There are two model outputs,where either the first two characters form a word or the last two characters form a word.The features of the model include one word in con-text of OAS,the current OAS and word probability relation of two kinds of segmentation results.OAS in training text is found by the combination of the FMM and BMM segmen-tation method.After feature tagging they are used to train the maximum entropy model.The People Daily corpus of January 1998 is used in training and testing.Experimental results show a closed test precision of 98.64%and an open test precision of 95.01%.The open test precision is 3.76%better compared with that of the precision of common word probability method.