期刊文献+

基于单字特征和搜索引擎的新词识别 被引量:2

Chinese New Word Detection Based on One Character Word and Search Engine
原文传递
导出
摘要 新词识别是影响搜索准确率以及速率的重要因素.本文提出了一种基于统计模型和词语搭配的中文新词自动识别方法.采用条件概率的方法提取单字词搭配特征和临界词特征,并采用层次结构实现新词定位以及识别.首先采用双向最大匹配相结合的方法对文本进行词法粗切分,然后根据单字词搭配得到候选新词的位置,用临界词方法确定候选新词的边界,采用改进Nagao串频统计方法对新词候选词在本文内进行重复串统计,对于只在文中出现一次的新词则借助搜索引擎进行确定.对新浪网近期的网络文章进行测试,结果表明,基于本文方法设计的系统可以识别不同领域的新词,在低频词、较长的词以及新词语搭配方面取得了良好的效果.单字词搭配检查发现新词位置综合指标F值达到96.8%. New word recognition have vital effect on precision and speed of search engine.This paper presents a hybrid method for automatic new word recognition based on a statistical model and search engine.It adopts conditional probability for collocation extraction.The method consists of four steps for new word detection and recognition:after segmenting the corpus based on bi-direction matching method,it first detects new words with collocation of one character word,and then determines new words candidates set with boundary words,and then searches for repeated strings with improved Nagao frequency statistics methods,and finally recognizes new words based on search engine dictionary.Experiment results show that the system built on the proposed method can find new words in any field,especially low frequency words,long words,and new collocation.The average of F-measure is 96.8% in locating new words.
出处 《武汉大学学报(理学版)》 CAS CSCD 北大核心 2010年第6期704-710,共7页 Journal of Wuhan University:Natural Science Edition
关键词 新词识别 单字词 临界词 搭配抽取 搜索引擎 new word recognition one character word boundary word collocation extraction search engine
  • 相关文献

参考文献11

二级参考文献37

共引文献264

同被引文献19

  • 1刘华.一种快速获取领域新词语的新方法[J].中文信息学报,2006,20(5):17-23. 被引量:14
  • 2罗智勇,宋柔.基于多特征的自适应新词识别[J].北京工业大学学报,2007,33(7):718-725. 被引量:14
  • 3Chen Keh-jiann, Bai Minghong. Unknown word detection for Chinese by a corpus-based learning method[ J]. Computation-al Linguistics and Chinese Language Processing, 1998,3 (1) : 27 -44.
  • 4Jemslow R, Wang J. Solving propositional satisfiability prob- lems[ C ]//Annals of mathematics and artificial intelligence. [s. L ] :Springer,1990.
  • 5Nie Jianyun. Unknown Word Detection and Segmentation of Chinese Using Statistical and Heuristic Knowledge [ J ]. Com- munications of COLIPS ,2008,5 ( 1 ) :47-47.
  • 6Nie J-Y,Hannan M-L,Jin W.Unknown Word Detection and Segmentation of Chinese using Statistical and Heuristic Knowledge[J].Communications of COLIPS,1995:47-57.
  • 7Isozaki H.Japanese named entity recognition based on a simple rule generator and decision tree learning[C].Proceedings of the39th Annual Meeting on Association f or Computational Linguistics Toulouse.France,2001:306-313.
  • 8Chen K-J,Ma W.Unknown Word Ex traction for Chinese Documents[C].Proceedings of COLING 2002.Taipei,2002:169-175.
  • 9MANBERU,MYERSG.Suffix arrays:a new method for outline string searches[J].SIAM Journal on Computing,1993,22(5):935-948.
  • 10曾依灵,许洪波.网络热点信息发现研究[J].通信学报,2007,28(12):141-146. 被引量:29

引证文献2

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部