摘要
新词识别是中文信息处理领域中的一个难点,也是自然语言处理、信息检索和机器翻译等领域的一项基础研究.根据新词特点提出不限领域的基于概率统计技术和规则方法相结合的概念抽取方法,比较了规则和统计结合的方法与纯统计的新词识别方法,通过增加权重设置很好地结合了两种方案.
Identification of Chinese OOV (unknown words) is a problem of Chinese information processing. And it is also a basic research in NLP, IR and MT. The method based on statistic techniques and rules is put forward for new words discovery. Also, the method based on statistic techniques and rules is compared with the method based on statistic techniques only. Weight setting helps to combine the two schemes smoothly.
出处
《郑州大学学报(理学版)》
CAS
2008年第3期67-71,共5页
Journal of Zhengzhou University:Natural Science Edition
基金
江苏省自然科学基金资助项目,编号BK2006539
江苏省高校自然科学基础研究项目,编号06KJB520095
关键词
新词检测
平均互信息
频度比
权重设置
new word detection
average MI(mutual information)
frequency ratio
weight setting