摘要
汉语自动分词是进行中文信息处理的基础。目前 ,困扰汉语自动分词的一个主要难题就是新词自动识别 ,尤其是非专名新词的自动识别。同时 ,新词自动识别对于汉语词典的编纂也有着极为重要的意义。文中提出了一种新的新词自动识别的方法。这个方法用到了互信息和log likelihoodratio两个参数的改进形式。主要分三个阶段完成 :先从网络上下载丰富的语料 ,构建语料库 ;然后采用统计的方法进行多字词识别 ;最后与已有的词表进行对照 ,判定新词。
Automatic Chinese segmentation is the basis of Chinese information processing. At present, automatic new word detection, especially automatic non proper noun detection is a dilemma for automatic Chinese segmentation. At the same time, automatic new word detection is very important to thesaurus compiling. This paper presents a new method for new word detection. It uses two improved parameters: mutual information and log likelihood ratio. This method mainly consists of three phrases. First, download adequate web documents and build a corpus; then recognize multi word units by using statistical method; finally, compare these words with the previous word list, so as to decide the new words. Experiments on real corpus show that the proposed method is more efficient and robust.
出处
《计算机应用》
CSCD
北大核心
2004年第7期132-134,共3页
journal of Computer Applications
基金
湖北省自然科学基金资助项目 (2 0 0 1ABB0 1 2 )
关键词
抽取多字词
页面解析
动态语料库
multi word unit extraction
page parsing
dynamic corpus