摘要
针对传统的分词方法切分军事类文本存在未登录词多和部分词条特征信息不完整的问题,提出把整个分词过程分解为若干子过程,以词串为分词单位对军事类文本进行分词。首先基于词典对文本进行双向扫描,标识歧义切分字段,对切分结果一致的字段进行停用词消除,计算第一次分词得到的词条间的互信息和相邻共现频次,根据计算结果判定相应的词条组合成词串并标识,最后提取所标识的歧义字段和词串由人工对其进行审核处理。实验结果表明,词条组合后的词串的特征信息更丰富,分词效果更好。
Since the unknown word in military texts is excessive,and the feature of some words is incomplete,the word segmentation method which is based on lexical chunk as the unit was provided, word segmentation was divided into some sections: bidirectional scanning in the text in the base of dictionary,marking the various and segment the words; deleting the stoic〉words which share the same segmentation results, then count words mutual information and adjacency frequency by the first time's word segmentation, according to this counting result, the lexical chunk with relevant words can be judged and signed. At last, picked up the signed various segment and lexical chunks to audit and deal with them artificially. The experimentation shows that after the word combination, the lexical chunk bears much more feature in- formation which shares a better effect of the process.
出处
《计算机科学》
CSCD
北大核心
2010年第2期171-174,共4页
Computer Science
基金
"十一五"武器装备预先研究项目(513300102)资助
关键词
军事
文本
分词
词条
Military,Text,Word segmentation,Words