摘要
文本挖掘是指使用数据挖掘技术,自动地从文本数据中发现和提取独立于用户信息需求的文档集中的隐含知识。而中文文本数据的获得是依靠中文信息处理技术来进行的,因而自动分词成为中文信息处理中的基础课题。对于海量信息处理的应用,分词的速度是极为重要的,对整个系统的效率有很大的影响。分析了几种常见的分词方法,设计了一个基于正向最大匹配法的中文自动分词系统。为了提高分词的精度,对加强歧义消除和词语优化的算法进行了研究处理。
Text mining uses the data mining technique to find and extract the crytic knowledge automatically from text files, which is self - existent the information users needed. Chinese text data is achieved by Chinese information handling. So text participle is a basic question for discussion on Chinese information handling. The rate of text participle is most important especially in applied in great information handling, and it affects the efficiency of whole system. This paper analyzes some ways in text participle, and designed a Chinese- text - participle - system based on most - matching from left to right. In order to improve the participle precision, the algorithms of eliminating different meanings and words optimization are dealt with.
出处
《计算机技术与发展》
2007年第12期122-124,172,共4页
Computer Technology and Development
基金
安徽省科技计划项目(2007ZD-7021010)
关键词
中文分词
歧义消除
最大匹配
词语优化
Chinese text participle
different meanings eliminating
most matching
word optimization