期刊文献+

中文分词中的正向增字最大匹配算法研究 被引量:7

Study on forward increasing maximum matching algorithm for Chinese word segmentation
下载PDF
导出
摘要 针对正向最大匹配算法的长词丢失、匹配次数较多、歧义字段处理的准确率较低等问题,基于Trie树词典提出了3种正向增字最大匹配算法,分别使用逐词扫描、尾部折半扫描和尾部减一扫描这3种扫描方式采集歧义字段,并建立了一套歧义处理方法。实验结果表明,该3种算法在分词速度和准确率上均有显著提高,错误率降低到了原算法的三分之一以下。当文本规模大于200 MB时,3种正向增字最大匹配算法的分词速度均比原最大匹配算法提高30%以上。 As the forward maximum matching algorithm might lead to the longer words cannot be segmented correctly and be matched repeatedly, besides the accuracy of segmentation for ambiguous phrases of overlap type is low, it proposes three forward increasing maximum matching algorithms based on Trie-tree,respectively using word matching method, half-minus tail matching method and one-minus tail matching method to capture ambiguous phrase, and establish a set of ambiguity.Experimental results show that the segmentation algorithms in speed and accuracy has improved significantly, error rate is reduced to one-third or less of the original algorithm. When the text size is greater than 200 MB, the speed of three algorithms is increased by 30% compared with the original algorithm.
出处 《微型机与应用》 2014年第17期15-18,共4页 Microcomputer & Its Applications
关键词 中文分词 TRIE树 逐词扫描 正向增字匹配 Chinese word segmentation Trie-tree word matching algorithm forward increasing maximum matching algorithm
  • 相关文献

参考文献8

二级参考文献45

共引文献177

同被引文献100

引证文献7

二级引证文献29

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部