摘要
该文提出一种基于二元背景模型的新词发现方法。采用前、背景语料二元似然比挑选候选二元组(bigram);然后根据频率、刚性、条件概率等基于前景语料的统计量,对二元组进行进一步筛选和扩展,以确定新词边界。用该方法提取出的词既包含新词特征,又可以成词。而且该方法充分利用现有背景生语料却无需分词等标注信息,不依赖词典、分词模型和规则,具有良好的扩展性。为了得到更好的发现效果,还讨论了各统计量阈值的选取策略和垃圾元素剔除策略。该方法在网络小说语料上验证了其有效性。
A new word detection method was developed that first extracts bigrams from the target foreground corpus based on their foreground and background likelihood ratio.Then,it filters and extends the bigrams to qualified new words according to statistical metrics including the frequency,rigidity and conditional probability.The method makes sure that the selected words are actually new based on background knowledge,and fixes the word boundary precisely according to the statistical metrics.The method requires no resources such as word lists,word segmentation models or rules.The methods for determining the thresholds for the different statistical metrics and for cutting the noise bigrams are also discussed.The method has been tested on online novels.
出处
《清华大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2011年第9期1317-1320,共4页
Journal of Tsinghua University(Science and Technology)
关键词
新词发现
二元组
背景模型
似然比
new word detection
bigram
background model
likelihood ratio