摘要
中文自动分词技术在中文信息处理、Web文档挖掘等处理文档类研究中是一项关键技术,而分词算法是其中的核心。正向最大匹配算法FMM(Forward Maximum Match)具有切分速度快、简洁、容易实现等优点,但是还存在分词过程中设定的最大词长初始值固定不变的问题,带来匹配次数相对较多的弊端。针对此问题提出了根据中文词典中的词条长度动态确定截取待处理文本长度的思想,改进了FMM算法,并用互信息统计来消除交集型歧义。最后,通过实验对算法进行了分词和验证,结果表明改进的算法与一般正向最大匹配算法相比,中文分词的准确率提高了。
The Chinese automatic word segmentation is always one key component in many fields of Chinese information processing, the Web documents mining and so on.. The Chinese word segmentation algorithm is one of the cores. Forward maximum matching(FMM) algorithm is fast, simple, easy to implement ,but there is a problem in forward maximum matching(FMM) algorithm that the initial value of the maximum word-length is immovable, this might lead to the longer words can be matched repeatedly. Aiming at this problem, this paper puts for-ward an idea for improving FMM algorithm that is to assign the maximum text-length to be treated dynamically based on the word-length in Chinese word segmentation word bank. Finally, through experiments conducted on the word algorithm and validation. Compared with normal FMM, the accuracy of Chinese word segmentation improves.
出处
《贵州大学学报(自然科学版)》
2011年第5期112-115,119,共5页
Journal of Guizhou University:Natural Sciences
关键词
自动分词
中文信息处理
挖掘
最大匹配
automatic word segmentation
Chinese information processing
mining
maximum match