摘要
首先介绍了一种高效的中文电子词表数据结构 ,它支持首字 Hash和标准的二分查找 ,且不限词条长度 ;然后提出了一种改进的快速分词算法 ,在快速查找两字词的基础上 ,利用近邻匹配方法来查找多字词 ,明显提高了分词效率 .理论分析表明 ,该分词算法的时间复杂度为 1.6 6 ,在速度方面 。
In this paper, a highly efficient data structure for Chinese thesaurus is introduced, which supports standard binary search and hashing operation by means of the first Chinese character in a string, while the length of every word is not limited. Then an improved fast algorithm for Chinese word segmentation is suggested. Based on searching a word composed of two characters quickly, the word including multiple Chinese characters can be found by utilizing the algorithm, which achieves high performance in Chinese word segmentation by invoking neighborhood matching. In theory, its time complexity is 1.66, which is superior to that of other algorithms for Chinese word segmentation.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2000年第4期418-424,共7页
Journal of Computer Research and Development
基金
国家"八六三"高技术研究发展计划基金资助!(项目编号 863 -ZD0 3 -0 4-1)
关键词
分词
中文信息处理
算法
中文电子词表
计算机
word segmentation, hash, binary search, neighborhood matching, time complexity