期刊文献+

一种字母词语自动标注算法 被引量:2

An Auto-labeling Model of Letter-word Phrases
下载PDF
导出
摘要 自动分词是中文信息处理的基础,而未登录词识别是影响分词系统准确率的最主要的因素.字母词语作为中文信息处理中的一类未登录词语,现有的分词软件仍不能有效识别.为此设计了一个规则+统计的自动标注算法,该算法首先对原文本进行扫描,依据字母串正则表达式取得合法的字母串;再以字母串为锚点,往两边扫描,依次调用前后界规则、汉字组成成分规则、例外校正规则,结合搭配概率矩阵对字母词语进行识别和标注.实验结果表明:该算法的召回率为100%,准确率约为92%.该算法不仅对中文自动分词有益,而且所开发的软件可用于建设字母词语知识库和对字母词语语言现象的考察研究. Chinese information processing is based on segmentation. It is the unknown words that affect the precision in every segmentation system mainly. Letter-word phrases,as a group of unknown words of Chinese information processing,by testing,we find that existing segmentation software can't identify them from texts rightly. In this paper we designed a rule based & statistical algorithm to label letter-word phrases in Chinese source-texts. At first,the model scans source-texts to get a letter string,according to the principles of the letter string expression,and then takes the letter string as an anchor and scans its two sides,according to boundary words rules,Chinese components rules, exceptive rules and collocation coefficient matrix, finally labels the letter-word phrase from texts. Our experiments have shown that the recall rate of the algorithm is 100% ,and the precision is about 92%. Our research in this paper is beneficial not only to Chinese information processing, but also to the investigation of the letter-word phrase phenomenon in Chinese.
作者 郑泽芝
出处 《厦门大学学报(自然科学版)》 CAS CSCD 北大核心 2007年第5期630-634,共5页 Journal of Xiamen University:Natural Science
基金 国家语言资源监测与研究中心项目(04L2004-01-01-03) 福建省社会科学基金(2006B086) 厦门大学科研启动基金资助
关键词 字母词语 搭配系数 自动标注 letter-word phrase collocation coefficient auto-label
  • 相关文献

参考文献7

二级参考文献35

共引文献266

同被引文献18

引证文献2

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部