摘要
由于现代社会飞速发展,一些新的名词不断出现,在已有的字符串匹配的分词方法中,大部分的词典是固定的,如果出现新的词,那么就不能被正确识别出来。由此该文提出了渐进式丰富词典的分词方法,把那些不能正确分出来的字符串,利用统计词频的方法记录下来,如果词频达到一定阈值,就可以把它认为是新词,可以把它加入到词典中,使得词典动态的增加。实验证明,该方法在保证分词速度不受影响的基础上,可以提高分词的精度。
With the fast development of modern society,many new words appear continuously.In the existing word segmentation methods based on matching strings,most of them dictionaries are changeless.If a new word appears,it can't be recognized accurately.So this paper puts forward the method of enriching words to dictionary gradually.It registers the strings of being segmented mistakenly by statistics method.If the word frequency exceeds the threshold,it can be taken for a new word and it will be put into the dictionary.Then the dictionary can be enriched dynamically.Experiment shows this method can improve the segmentation accuracy while retaining its speed.
出处
《计算机工程与应用》
CSCD
北大核心
2006年第32期164-166,共3页
Computer Engineering and Applications
基金
河北省科技攻关计划项目(05213573)
河北省教育厅科研计划项目(2004406)。