摘要
随着信息技术的不断发展,中文分词的应用越来越广泛,例如搜索引擎、机器翻译等领域。论文介绍了对未识别部分采取检索词典的方法;对介词副词词典、姓氏词典和后缀词词典先检索,再处理介词、副词、姓名和后缀词等,处理完后再将剩余部分输出单字等。设计并实现了一个基于Lucene的中文分词系统,系统运行良好,能对待切分的文本进行相对正确、快速的切分,系统输出的结果基本符合预期的目标。
The application of word segmentation spreads more widely with the development of information technology,such as the field of research engine and machine translation.In this paper,we briefly introduce when we meet the unrecognizing parts,we adopt the methods of retrieving the dictionary to retrieve adverb preposition dictionary,name dictionary and suffix dictionary.Then we will output the remaining words.A new system of Chinese word segmentation based on Lucene is invented.The results prove that such method is effective and much accurate.Besides,the output results are basically meeting the expected results.
出处
《自动化与仪器仪表》
2016年第5期208-210,共3页
Automation & Instrumentation
关键词
中文分词
搜索引擎
检索词典
基于Lucene
Chinese word segmentation
research engine
retrieving the dictionary
based on Lucene