摘要
自动分词是中文信息处理的基础课题之一。为了克服传统分词方法在处理特殊领域文本时遇到的困难,本文提出了一种新的分词方法,在没有词表和训练语料的条件下,让用户参与到分词过程中,增加系统的语言知识,以适应于不同的语料和分词标准。系统采用改进的后缀数组算法,不断提取出候选词语,交给用户进行筛选,最后得到词表进行分词。四个不同语料的实验结果显示,不经过人工筛选,分词F值可以达到72%左右;而经过较少的人机交互,分词F值可以提高12%以上。随着用户工作量的增加,系统还能够进一步提高分词效果。
Word segmentation(WS)is a funamental task in Chinese information processing. To solve the difficulties of traditional methods in processing texts in restricted domains, a novel method is proposed. It requires no lexicon or training corpus and can adapt to various texts and different WS standards. It enables the user to take part in WS procedure and add language kownledge to the system. Using optimized suffix array algrithm, candidates as words are recursively extracted from the text, then judged and edited by the user. Thus, a lexicon of the text is gained and applied to segment the text. Experiments on 4 different texts show that without the user's judgement, F-score of the system reaches as much as 72%, and can be prompted by 12% with amount of work done by the user. With the increase in the workload of the user, the system is able to achieve better results.
出处
《中文信息学报》
CSCD
北大核心
2007年第3期92-98,共7页
Journal of Chinese Information Processing
基金
南京师范大学211资助项目(1240702504)
关键词
计算机应用
中文信息处理
自动分词
未登录词识别
陌生文本
人机交互
computer application
Chinese information processing
word segmentation
unknown word recognition
unknown text
Human-Computer Interaction