期刊文献+

基于语料库和网络的新词自动识别 被引量:11

Automatic New Words Detection Based on Corpus and Web
下载PDF
导出
摘要 汉语自动分词是进行中文信息处理的基础。目前 ,困扰汉语自动分词的一个主要难题就是新词自动识别 ,尤其是非专名新词的自动识别。同时 ,新词自动识别对于汉语词典的编纂也有着极为重要的意义。文中提出了一种新的新词自动识别的方法。这个方法用到了互信息和log likelihoodratio两个参数的改进形式。主要分三个阶段完成 :先从网络上下载丰富的语料 ,构建语料库 ;然后采用统计的方法进行多字词识别 ;最后与已有的词表进行对照 ,判定新词。 Automatic Chinese segmentation is the basis of Chinese information processing. At present, automatic new word detection, especially automatic non proper noun detection is a dilemma for automatic Chinese segmentation. At the same time, automatic new word detection is very important to thesaurus compiling. This paper presents a new method for new word detection. It uses two improved parameters: mutual information and log likelihood ratio. This method mainly consists of three phrases. First, download adequate web documents and build a corpus; then recognize multi word units by using statistical method; finally, compare these words with the previous word list, so as to decide the new words. Experiments on real corpus show that the proposed method is more efficient and robust.
出处 《计算机应用》 CSCD 北大核心 2004年第7期132-134,共3页 journal of Computer Applications
基金 湖北省自然科学基金资助项目 (2 0 0 1ABB0 1 2 )
关键词 抽取多字词 页面解析 动态语料库 multi word unit extraction page parsing dynamic corpus
  • 相关文献

参考文献12

  • 1LIU Jianzhou, HE Tingting, LIU Xiaohua, et al. Extracting Chinese Multi-word Units from Large-scale Balanced Corpus[ A]. The 17th PACLIC Conference[ C]. Singapore, October 2003.
  • 2Valter Crescenzi , Giansalvatore Mecca , Paolo Merialdo , et al.ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites[ A]. Proceedings of the 27th International Conference on Very Large Data Bases[ C]. September 2001. 109 ~ 118.
  • 3Peas A, Verdejo F, Gonzalo J, et al. Corpus-based Terminology Extraction applied to Information Access [ A]. In Proceedings of Corpus Linguistics 2001[ C]. Lancaster University, UK, 2001.
  • 4Silva J, Lopes G, et al. A local Maximal Method and a Fair Dispersion Normalization for Extracting Multiword Units[ A]. Proceedings of the 6th Meeting on the Mathematics of Language[C], 1999. 369 -381.
  • 5Dunning T. Accurate Methods for the Statistics of Surprise and Coincidence [J]. Association for Computational Linguistics, 1993,19(1): 61-76.
  • 6陈玉泉,顾顺莲,陆汝占.计算机辅助新词新语词典的编纂[J].上海交通大学学报,2000,34(7):999-1000. 被引量:2
  • 7孙茂松,黄昌宁,高海燕,方捷.中文姓名的自动辨识[J].中文信息学报,1995,9(2):16-27. 被引量:87
  • 8谭红叶 郑家恒 刘开瑛.中国地名的自动识别方法研究[A]..计算语言学文集[C].北京:清华大学出版社,1999..
  • 9孙茂松 张维杰.英语姓名译名的自动辨识[A]..计算语言学研究与应用[C].北京:北京语言学院出版社,1993..
  • 10陈小荷.自动分词中未登录词问题的一揽子解决方案[J].语言文字应用,1999(3):103-109. 被引量:26

二级参考文献16

共引文献111

同被引文献172

引证文献11

二级引证文献107

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部