摘要
针对现有中文分词算法无法为移动搜索提供用户兴趣偏好信息的现状,提出一种改进的正向最大匹配中文分词算法。该算法基于逐字二分的分词词典机制,添加词分类信息,在词典中存储了每个词条的分类信息,分词时采用改进的次字区位码哈希非均匀分段机制进行正向最大匹配分词。实验结果表明,与逐字二分法相比,改进的分词算法其存储空间增加了13%,但时间效率提高了20%左右,且分词后可同时提取出词条的分类信息。
As existing Chinese word segmentation algorithm can't provide user interest information for mobile search, an improved FMM segmentation algorithm is proposed. Based on a new dictionary mechanism which contains words' classified information, the algorithm performs Forward Maximum Matching by the improved second word area code hash non-uniform segmentation mechanism. Experimental results show that compared with the Verbatim dichotomy, the storage space of the improved algorithm is increased by 13%, but the time efficiency is improved by about 20%, and the words' classified information is extracted simultaneously.
出处
《西安邮电大学学报》
2015年第4期62-65,共4页
Journal of Xi’an University of Posts and Telecommunications
基金
国家自然科学基金资助项目(61373116)
西安邮电大学青年基金资助项目(ZL2014-27)
关键词
中文分词
词典机制
词分类信息
chinese word segmentation, dictionary mechanism, words' classified information