摘要
该文设计了一个基于复杂形式最大匹配算法(MMSeg_Complex)的自定义中文分词器,该分词器采用四种歧义消除规则,并实现了用户自定义词库、自定义同义词和停用词的功能,可方便地集成到Lucene中,从而有效地提高了Lucene的中文处理能力。通过实验测试表明,该分词器的分词性能跟Lucene自带的中文分词器相比有了极大的提高,并最终构建出了一个高效的中文全文检索系统。
This paper designed a custom Chinese word analyzer that based on a complex form of maximum matching algorithm(MMSEG_Complex). This analyzer use four kinds of disambiguation rules, and has achieved user-defined thesaurus、custom function of synonyms and stop words, which can be easily integrated into Lucene, thus effectively improving the Chinese processing capabilities of Lucene. Through experiments we found that this analyzer's performance of Chinese word segmentation has been greatly improved compared to the Chinese word analyzer which built-in Lucene, and then we can eventually build an effective Chinese full-text retrieval system.
出处
《电脑知识与技术(过刊)》
2014年第1X期430-433,共4页
Computer Knowledge and Technology
关键词
中文分词
复杂最大匹配
LUCENE
分词器
Chinese word segmentation
Complex_Maximum Matching
Lucene
full-text retrieval