摘要
中文分词是中文信息化处理的基础环节。在中文全文索引中,中文分词更起着举足轻重的作用。该文首先比较了常见的中文分词算法,最后选用了综合性能较优的分词算法—基于词频统计的匹配分词,引入全文索引的开源项目Lucene中。通过与传统的机械分词对比,发现使用基于词频统计的匹配分词的全文索引,不但大大节省索引空间,而且显著地提高了检索的质量。
Chinese Segmentation is the basic step of Chinese information processing.It plays an important role especially in the Chinese full text indexing.This paper first makes comparison between algorithms of Chinese segmentation,and then chooses the most suitable one,which is based on the statistical model of word frequency,to apply to the open source full text indexing project Lucene.By comparison with the traditional Chinese segmentation method,we find that the new full text indexing,which applied new Chinese segmentation meth od,not only saves huge amount of space of indexing,but also improves the quality of searching significantly.
出处
《电脑知识与技术》
2012年第1X期722-726,共5页
Computer Knowledge and Technology