期刊文献+

基于分组hash与变长匹配的中文分词技术 被引量:1

Chinese word segmentation technology based on group hash and variable length matching
下载PDF
导出
摘要 中文分词是海量中文信息处理的基础任务,分词的准确性与分词速度是最为重要的。但是现有技术在分词时,准确性与分词速度却是无法调和的。为了提高中文分词的速度,同时又不因缩短初始字符串长度造成准确性降低,提出使用正则表达式进行变长字符串的截取与对词库进行分组散列的技术。通过理论分析,该技术在时间复杂度上从原来的o(n*n)下降到o(n),在精确度上又以句子长度作为动态变化的初始字符串长度,从而避免长词的丢失,保证了分词的准确性不受损失。 Chinese word segmentation is the basic task of mass Chinese information processing. The accuracy and speed of word segmentation are the most important. However, the accuracy and the speed of word segmentation cannot be reconciled with in the existing technologies. In order to improve the speed of Chinese word segmentation without the accuracy reduction caused by reducing the initial string length, this paper proposes the use of regular expressions to intercept variable-length strings and to hash word libraries. Through theoretical analysis, this technology reduces from the original O(n*n) to O(n) in terms of time complexity,and takes the sentence length as the dynamic initial string length in terms of accuracy, so as to avoid the loss of long words and ensure that the accuracy of word segmentation is not damaged.
作者 杨光豹 杨丰赫 毛贵军 Yang Guangbao;Yang Fenghe;Mao Guijun(Qingtian college, Zhejiang Radio & TV University, Qingtian, Zhejiang 323900, China)
出处 《计算机时代》 2019年第4期52-55,共4页 Computer Era
关键词 中文分词 正则表达式 散列 时间复杂度 Chinese word segmentation regular expression hash time complexity
  • 相关文献

参考文献7

二级参考文献76

共引文献281

同被引文献20

引证文献1

二级引证文献37

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部