摘要
基于最长次长匹配的方法建立汉语切分路径有向图,将汉语自动分词转换为在有向图中选择正确的切分路径,其中有向图中的节点代价对应单词频度,而边代价对应所连接的两个单词的接续频度;运用改进后Dijkstra最小代价路径算法,求出有向图中路径代价最小的切分路径作为切分结果.在切分歧义的处理上采用分步过滤逐步解消的方法,并引入了基于未知词特征词驱动的机制,对未知词进行了前处理,减少了因未知词的出现而导致的切分错误.实验结果表明,该方法有效地提高了汉语分词的精确率和召回率.
The Chinese word segmentation is transformed into a best segmentation path selecting problem in a directed graph based on the maximum and second-maximum matching method. Dijkstra's algorithm is modified to choose the minimum cost path from the directed graph, of which the node cost corresponds to the single-word frequency and the edge cost to the doublewords frequency. Word segmentation ambiguities are filtered and solved step by step. The unknown-word-characteristic-driven mechanism is adopted to handle the unknown word problem. The results show marked improvement in the efficiency of segmentation,and high accuracy rate and recall rate are guaranteed.
出处
《小型微型计算机系统》
CSCD
北大核心
2006年第3期516-519,共4页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(60373095)资助.
关键词
汉语自动分词
最长次长匹配
最小代价路径
切分歧义消解
未知词特征词
chinese word segmentation
maximum and second-maximum matching
minimum cost path
ambiguity partition
unknown words characteristic