摘要
中文分词是海量中文信息处理的基础任务,分词的准确性与分词速度是最为重要的。但是现有技术在分词时,准确性与分词速度却是无法调和的。为了提高中文分词的速度,同时又不因缩短初始字符串长度造成准确性降低,提出使用正则表达式进行变长字符串的截取与对词库进行分组散列的技术。通过理论分析,该技术在时间复杂度上从原来的o(n*n)下降到o(n),在精确度上又以句子长度作为动态变化的初始字符串长度,从而避免长词的丢失,保证了分词的准确性不受损失。
Chinese word segmentation is the basic task of mass Chinese information processing. The accuracy and speed of word segmentation are the most important. However, the accuracy and the speed of word segmentation cannot be reconciled with in the existing technologies. In order to improve the speed of Chinese word segmentation without the accuracy reduction caused by reducing the initial string length, this paper proposes the use of regular expressions to intercept variable-length strings and to hash word libraries. Through theoretical analysis, this technology reduces from the original O(n*n) to O(n) in terms of time complexity,and takes the sentence length as the dynamic initial string length in terms of accuracy, so as to avoid the loss of long words and ensure that the accuracy of word segmentation is not damaged.
作者
杨光豹
杨丰赫
毛贵军
Yang Guangbao;Yang Fenghe;Mao Guijun(Qingtian college, Zhejiang Radio & TV University, Qingtian, Zhejiang 323900, China)
出处
《计算机时代》
2019年第4期52-55,共4页
Computer Era
关键词
中文分词
正则表达式
散列
时间复杂度
Chinese word segmentation
regular expression
hash
time complexity