摘要
中文分词一直是中文类搜索引擎的重要前提之一。针对经典的机械分词方法中字符串匹配的最长匹配字的选择问题,提出了一种基于Hash的词典结构,避免了最长匹配字的过长或过短。对于歧义的发现,引入了回溯机制,即算法在每次查询词语完毕后,再以查询的词语的最后一个字为首字,开始进行新一轮的查询。对于回溯机制带来的查询次数倍增问题,提出对词语末字的检验是否能成为首字的算法,减少查询次数和时间复杂度。该方法相比于其他融合方法,具有较快的查询速度和较好的歧义处理能力。
Chinese word segmentation is one of the important preconditions of Chinese search engine. For the longest matching word selection in the string matching of classical method of mechanical word segmentation,this paper proposed a Hash-based dictionary structure,to avoid the longest matching word is too long or too short. For the discovery of ambiguity,the paper introduces the backtracking mechanism,that is,when the algorithm in each querying of word is completed,the algorithm query the last character of the word,finally using the last character of first word to start a new round of inquiry. However,the backtracking mechanism has brought about the problem of doubling the time of queries,so it proposed that the last character of the word can become the first word,reduces the number of queries and time complexity. Compared with other fusion methods,the proposed method has a faster searching speed and the ability to deal with ambiguity.
出处
《信息技术》
2017年第11期167-171,共5页
Information Technology
关键词
分词
Hash词典
回溯
尾字检验
segmentation
Hash dictionary
backtracking
tail character test