摘要
中文分词是搜索引擎面临的主要挑战之一。本文通过分析Nutch文档的评分机制,针对Nutch中文分词模块的分词不符合汉语习惯的情况,提出采用以词典分词法为基础的庖丁解牛分词模块对Nutch要采集的数据进行切分,描述在Nutch上实现庖丁解牛分词模块的方法,并对该分词模块进行测试。实验表明,庖丁解牛分词模块的分词结果更符合汉语习惯,并且在词项对文档的覆盖方面更加均衡,另外索引文件所占的存储空间节省20%~65%。
Chinese word segmentation is one of main challenges for search engine. By analyzing the scoring mechanism of the document of Nutch,for the situation that word segmentation of Chinese word segmentation module of Nutch does not conform to Chinese language habit,this paper proposes to use Paodingjieniu Chinese word segmentation module based on dictionary to segment the data collected by Nutch,describes the method that how to realize Paodingjieniu Chinese word segmentation module on Nutch, then tests the word segmentation module. Experiments show that the word segmentation result of Paodingjieniue word segmentation module more conforms to Chinese language habit,and the coverage of terms are more balanced for documents,in addition,20% ~ 65% of the storage space of index file is saved.
出处
《计算机与现代化》
2010年第6期187-190,共4页
Computer and Modernization
关键词
中文分词
评分机制
庖丁解牛
Chinese word segmentation
scoring mechanism
Paodingjieniu