期刊文献+

基于多策略融合的中文术语抽取方法 被引量:28

A Chinese Term Extraction System Based on Multi - Strategies Integration
下载PDF
导出
摘要 中文术语抽取是信息抽取、文本挖掘以及知识获取等信息处理任务中的关键技术。相对于单词型术语,词组型术语的识别过程要更加复杂。由于短语中引入了大量非名词性词汇,随之产生了更多种的噪声数据,不仅需要判断短语结构是否完整,还要考虑短语内部词汇的搭配合理性、衡量短语中所负载领域信息量等问题。文中将词组型术语抽取过程中遇到的这三个问题作为切入点,分别使用子串归并、搭配检验和领域相关度计算技术来解决这三个问题,分析词组型术语自身的结构特征以及其在语料中的分布特征,完善词组型术语的抽取任务。实验证实了该方法能够有效提升低频术语和基础术语的排序位置,从而改善了中文词组型术语抽取系统的性能。 Term extraction is one of the primary technical challenges in many information process tasks,such as information extraction,text mining and knowledge acquisition.Relative to the single-word terms,the multi-word terms have to face much more noise which is brought due to the non-noun lexicon appearing in the phrases.Besides structure integrality,collocation and domain relevant degree are also the main problems puzzling the terms extraction.To solve these problems,three strategies combining with substring reduction,collocation test and termhood computation are proposed to improve the result of multi-word terms extraction.In the experiment on the computer domain corpora,the low-frequency-terms and base-terms could obtain more attention,and consequently Chinese multi-words terms extraction get a better precision.
出处 《情报学报》 CSSCI 北大核心 2010年第3期460-467,共8页 Journal of the China Society for Scientific and Technical Information
基金 国家863高技术研究发展计划资助项目(2006AA01Z152) 国家自然科学基金资助项目(60672149)
关键词 中文术语抽取 语言规则获取 子串归并 搭配检验 词语活跃度 领域相关度 Chinese term extraction linguistical rules acquisition substring reduction collocation test word active degree domain relevant degree
  • 相关文献

参考文献13

  • 1Oakes M P,Paice C D.Term extraction for automatic abstracting[M] //Bourigault D,Jacquemin C,L'Homme M-C.Recent Advances in Computational Terminology.John Benjamins Publishing Company,2001:353-370.
  • 2Fortuna B,Lavrac N,Velardi P.Advancing Topic Ontology Learning through Term Extraction[C].PRICAI 2008,LNAI 5351,2008:626-635.
  • 3Cerbah F,Euzenat J.Using Terminology Extraction to Improve Traceability from Formal Models to Textual Requirements[C].NLDB 2000,LNCS 1959,2001:115-126.
  • 4Bourigault D.Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases[C] //Proceedings of COLING'92,1992:977-981.
  • 5张锋,樊孝忠,许云.Chinese Term Extraction Based on PAT Tree[J].Journal of Beijing Institute of Technology,2006,15(2):162-166. 被引量:2
  • 6Frantzi K T,Ananiadou S,Mima H.Automatic Recognition of Multi-word terms:the C-value/NC-value Method[J].International Journal on Digital Libraries,2000,3(2):115-130.
  • 7Yoshida M,Nakagawa H.Automatic Term Extraction Based on Perplexity of Compound Words[C] //IJCNLP 2005:269-279.
  • 8Zhang Huaping,Yu Hongkui,Xiong Deyi,et al.HHMM-based Chinese Lexical Analyzer ICTCLAS[C] //Preceedings of the 2nd SigHan Workshop,July 2003:184-187.
  • 9Merkel M,Andersson M.Knowledge-lite extraction of multi-word units language filters and entropy thresholds[C] //Proceedings of 2000 Conference on User-Oriented Content-Based Text and Image Handling.Pairs,France:ACM Press,2000:737-746.
  • 10吕学强,张乐,黄志丹,胡俊峰.基于散列技术的快速子串归并算法[J].复旦学报(自然科学版),2004,43(5):948-951. 被引量:4

二级参考文献6

共引文献4

同被引文献322

引证文献28

二级引证文献241

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部