期刊文献+

上下文分析与统计特征相结合的英文术语抽取研究 被引量:1

English Term Extraction Based on Context Analysis & Statistical Characteristic
原文传递
导出
摘要 介绍术语的基本特征,探讨科技术语的自动识别方法,并结合文本特征对TF-IDF和C-value两种主流统计指标进行改进。为了区分词汇位置对文档内容的影响,分别对不同位置的候选词设置不同的权重。最后设计并实现一个统计与规则相结合的科技术语自动抽取系统,通过位置权重、C-value、TF-IDF指标的联合计算来识别术语,提高抽取的准确率。 Firstly,the article introduces the basic features of terms,and discusses the automatic identification method of scientific terms.Then V-value is proposed,which improves the two main statistical indicators:TF-IDF and C-value according to text characteristics.Different weights are also set for the candidate terms by the position to show their effect.Finally,a term extraction system is implemented based on statistics and rules.The system combines the weight,C-value and TF-IDF,so it has a higher precision of extraction.
出处 《现代图书情报技术》 CSSCI 北大核心 2010年第12期28-33,共6页 New Technology of Library and Information Service
基金 “十一五”科技支撑计划课题“网络科技信息监测与评价”(项目编号:2006BAH03B05)的研究成果之一
关键词 术语抽取 多词术语识别 加权TF-IDF C-value计算 Term extraction Multi-word recognition Weighted TF-IDF C-value computing
  • 相关文献

参考文献9

  • 1Krauthammer M, Nenadic G. Term Identification in the Biomedical Literature [ J ]. Journal of Biomedical Informatics, 2004,37 ( 6 ) : 512 - 526.
  • 2Frantzi K T, Ananiadou S, Tsujii J. The C - value/NC - value Method of Automatic Recognition for Multi - word Terms [ C ]. In: Proceedings of the 2nd European Conference on Research and Advanced Technology for Digital Libraries. 1998:585 -604.
  • 3Terminology [ EB/OL ]. [ 2010 - 05 - 29 ]. http ://en. wikipedia. org/wiki/Term_ ( language.
  • 4百度百科-术语[EB/OL].[2010-05-29].http://baike.baidu.com/view/168249.htm?fr=ala1-1.
  • 5Ha L Q, Sicilia- Garcia E I, Ming J,et al. Extension of Zipf' s Law to Word and Character N - grams for English and Chinese[ J].Computational Linguistics and Chinese Language Processing, 2003, 8(1) :77 -102.
  • 6张玉芳,陈小莉,熊忠阳.基于信息增益的特征词权重调整算法研究[J].计算机工程与应用,2007,43(35):159-161. 被引量:33
  • 7Frantzi K, Ananiadou S, Mima H. Automatic Recognition of Multi - Word Terms : The C - value/NC - value Method [ J ]. International Journal on Digital Libraries, 2000,3 (2) :115 -130.
  • 8陈琦,伍朝辉,姚芳,宋秀荣,张付志.基于TF*IDF的垃圾邮件过滤特征选择改进算法[J].计算机应用研究,2009,26(6):2165-2167. 被引量:6
  • 9Sebastian/ F. Machine Learning in Automated Text Categorization [J ]. ACM Computing Surveys,2002,34 ( 1 ) : 1- 47.

二级参考文献9

共引文献38

同被引文献15

  • 1张锋,许云,侯艳,樊孝忠.基于互信息的中文术语抽取系统[J].计算机应用研究,2005,22(5):72-73. 被引量:36
  • 2Foo J, Merkel M. Using Machine Learning to Perform Automatic Term Recognition[C].In:Proceedings of the LREC 2010 Workshop on Methods for Automatic Acquisition of Language Resources and Their Evaluation Methods, Valletta. 2010:49-54.
  • 3Krauthammer M, Nenadic G. Term Identification in the Biomedical Literature[J].Journal of Biomedical Informatics, 2004, 37(6):512-526.
  • 4Kageura K, Umino B. Methods of Automatic Term Recognition: A Review[J].Terminology, 1996, 3(2):259-289.
  • 5Damerau F J. Generating and Evaluating Domain-oriented Multi-word Terms from Texts[J]. Information Processing & Management, 1993,29(4):433-447.
  • 6Gelbukh A, Sidorov G, Lavin-Villa E, et al. Automatic Term Extraction Using Log-Likelihood Based Comparison with General Reference Corpus[C].In: Proceedings of the Natural Language Processing and Information Systems, and the 15th International Conference on Applications of Natural Language to Information Systems. Berlin, Heidelberg: Springer-Verlag,2010:248-255.
  • 7Frantzi K, Ananiadou S, Mima H. Automatic Recognition of Multi-word Terms: The C-value/NC-value Method[J].International Journal on Digital Libraries, 2000,3(2):115-130.
  • 8中英文混合停用词表[EB/OL].[2012-11-20].http://www.smartpeer.net/myfiles/stopwords-utf8.txt.
  • 9安纪霞,李锡祚,宋冰,曾伟.服务于词典编纂的特定领域专业术语自动抽取[J].计算机与数字工程,2007,35(11):53-56. 被引量:3
  • 10岑咏华,韩哲,季培培.基于隐马尔科夫模型的中文术语识别研究[J].现代图书情报技术,2008(12):54-58. 被引量:37

引证文献1

二级引证文献23

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部