摘要
中文分词是中文信息处理的基础,在诸如搜索引擎,自动翻译等多个领域都有着非常重要的地位。中文分词词典是中文机械式分词算法的基础,它将告诉算法什么是词,由于在算法执行过程中需要反复利用分词词典的内容进行字符串匹配,所以中文分词词典的存储结构从很大程度上决定将采用什么匹配算法以及匹配算法的好坏。在研究现存分词词典及匹配算法的基础上,吸取前人的经验经过改进,为词典加上了多级索引,并由此提出了一种新的中文分词词典存储机制——基于二级索引的中文分词词典,并在该词典的基础上提出了基于正向匹配的改进型匹配算法,大大降低了匹配过程的时间复杂度。从而提高了整个中文分词算法的分词速度。
As the basis of Chinese information processing,Chinese word segmentation plays a very important role in the fields of searching engine,automatic and so on.Chinese word dictionary is the basis of mechanic segmentation algorithm,it tells the algorithm what is a Chinese word.Because the algorithm needs the content of dictionary in order to match the string in the text,the storage structure of the dictionary will decide the method of the algorithm and its performance.Through making research into the existed theory and refinement,this paper adds multi-level index for the dictionary,and based on this formulates a new mechanism of Chinese word segmentation dictionary-dictionary based on two-level index.On the basis of this new theory,this paper also improves the positive matching method,reduces the complexity of matching process,moreover,elevates the speed of the segmentation.
出处
《计算机工程与应用》
CSCD
北大核心
2009年第19期139-141,共3页
Computer Engineering and Applications
关键词
中文分词
二级索引
正向最大匹配
Chinese word segmentation
two-level index
positive maximum matching