摘要
为对微博语料中的中文新词进行有效的识别发现,针对微博语料的文本特性,提出一种基于词语互信息模型和外部统计量的新词发现方法。采用互信息统计模型基于候选词内部最小搭配单元向右邻元扩展统计的方法,建立候选词集;针对统计特性、语料特征,进行低频筛选,引入外部统计量的概念进行过滤。该统计方法解决了基于互信息统计模型用于新词发现时只能统计两组成元素的局限性,规避了影响新词发现研究准确性能的N元重叠问题,过滤方法对于包含大量短语句的微博语料用着良好作用,通过实例与对比验证了该方法的有效性。
To effectively identify and discover the Chinese new words in the microblog corpus,according to the text features of the corpus on microblog,a new word discovery method combining mutual information and external statistics was proposed.A new word candidate set was established by adopting mutual information statistical model based on the minimum combination and extending to the right.Based on the statistics and corpus features,the result was obtained according to the threshold value of the low-frequency and the filter method of external statistics.This statistical method solves the limitation of mutual information model that it only based on two elements and avoids the problem of N-gram overlap.Filtering methods is necessary for microblog corpus containing a large number of short phrase sentences.The effectiveness of the research method is verified through example and contrast test.
出处
《计算机工程与设计》
北大核心
2017年第3期789-794,共6页
Computer Engineering and Design
基金
国家自然科学基金项目(60743008)
河南省重点科技攻关计划基金项目(142102210045)
关键词
新词发现
微博语料
互信息
词内部耦合度
外部统计量
new word discovery
microblog corpus
mutual information
word internal coupling
external statistic