摘要
文章提出一种新的中文文档实验系统,力求建立一个克服束缚中文信息处理发展的自动分词问题的实验研究平台。文章采用一种新的基于中文词的文本编码方法,对每个词进行编码,并使新编码与机内码联系起来。使用这种基于词的编码格式可以使词成为计算机中文处理中的最小信息载体,无须再进行中文分词。文章使用该方法进行了关键词自动抽取的实验研究。结果显示,基于词编码的中文文档实验系统能很好的解决中文分词问题,并给其它中文文本分析奠定良好基础。
This paper presents a novel Chinese text experiment system. This method attempts to construct an experiment platform that deals with the automatic segmentation issue that blocks the development of Chinese Information Processing (CIP) for a long time, A new coding format that codes every word (not character) is adopted, Then, the new codes are connected with internal statement number (ISN), Through all above, words become the smallest information unit in texts, which makes automatic word segmentation is unnecessary. Keyword extraction experiment is conducted based on this method. The result shows that Chinese segmentation problem is solved by this word platform and the method lays the foundation of other Chinese text analysis.
出处
《微计算机信息》
北大核心
2008年第18期171-172,104,共3页
Control & Automation
关键词
中文信息处理
汉字编码
词平台
自动分词
Chinese Information Processing
Chinese character coding, words coding,automatic segmentation