摘要
提出了一种基于大规模语料的新词识别方法,在重复串统计的基础上,结合分析不同串的外部环境和内部构成,依次判断上下文邻接种类,首尾单字位置成词概率以及双字耦合度等语言特征,分别过滤得到新词。通过在不同规模的语料上实验发现,此方法可行有效,能够应用到词典编撰,术语提取等领域。
The paper proposes a method for new word identification based on large scale corpus,which analyzes the outer lingual environment and inner structure of a string simultaneously.At first,find all the repetitive strings in the text collection,then decide whether a string should be filtrated or not,according to the context varieties,inside word probabilities and double character couplings.At last the remnant words are considered as new words.The experiments have done on corpus with different scale,and the results show that this method is practicable
出处
《计算机工程与应用》
CSCD
北大核心
2007年第21期157-159,共3页
Computer Engineering and Applications
基金
国家重点基础研究发展规划(973)(the National Grand Fundamental Research 973 Program of China under Grant No.2004CB318109)
中科院知识创新工程基金(No.20056550)。
关键词
新词
邻接类别
单字成词概率
双字耦合度
new words
context variety
inside word probability
double character coupling