期刊文献+

基于词频分布变化统计的术语抽取方法 被引量:27

Terminology Extraction Based on Statistical Word Frequency Distribution Variety
下载PDF
导出
摘要 提出了一种规则与统计相结合的术语抽取方法,用于抽取包含多个词语的词组型术语。目前,绝大多数的统计方法都侧重于衡量术语的结构完整性,但这些方法并不能体现术语与专业相关的领域特征。通过对术语在各文档中的分布情况进行观察,提出了一种利用术语在语料中词频分布变化程度的统计信息来检验术语的领域相关性的方法,同时结合机器学习方法获取的语言知识,从计算机领域的语料中抽取领域特征明显的词组型术语。实验证明,该方法对低频术语和高频普通词串有较强的分辨能力。 A hybrid terminology extraction system combined with linguistic knowledge and statistical information was introduced to extract compound terms which contain more than one word. There have been many statistical strategies used in automatic terminology extraction,most of which emphasize particularly to measure the integrality of the terms, other than domain features. To measure the domain relativity of terms, a mew method utilizing term frequency distribution variety was proposed. Incorporating with linguistic knowledge acquired by machine learning method, an automatic extraction system was implemented to extract multi-word terrns from the corporate of computer domain. The results show that this approach is effective especially to distinguish terms with lower frequency and common words with higher frequency.
出处 《计算机科学》 CSCD 北大核心 2009年第5期177-180,共4页 Computer Science
基金 国家863高技术研究发展计划项目(2006AA01Z152) 国家自然科学基金项目(60672149)资助
关键词 术语抽取 机器学习 分布方差 知识获取 termhood unithood Terminology extraction, Machine learning, Distribution variance, Knowledge acquisition, Termhood, Unithood
  • 相关文献

参考文献7

  • 1Bourigault D.Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases[C]//Proceedings of COLING' 92.1992:977-981
  • 2Pantel P,Lin D.A Statistical Corpora-based Term Extractor[C] //Lecture Notes in Artificial Intelligence.Springer,Verlag,2001:34-46
  • 3Frantzi K T,Ananiadou S,Mima H.Automatic Recognition of Multi-word terms:the C-value/NC-value Method[J].International Journal on Digital Libraries,2000,3(2):115-130
  • 4Kageura K,Umino B.Methods of Automatic Term Recognition:A Review[J].Terminology,1996,3(2):259-289
  • 5刘桐菊,于浩,杨沐昀.基于TFIDF的专业领域词汇获取的研究[C]//第一届学生计算语言学研讨会论文集.2002
  • 6李勇.基于聚类方法对特定领域术语的自动筛选[J].计算机工程与科学,2008,30(2):64-66. 被引量:7
  • 7张普.信息领域汉语术语的特征及其在语料中的分布规律.语言教学与研究,2001,.

二级参考文献8

  • 1冯兰萍,张继国.基于本体的中文信息检索模型[J].河海大学常州分校学报,2004,18(4):40-42. 被引量:3
  • 2ConceptDiscovery from Text[EB/OL].[2007-05-20]. http://www. cs. ualberta. ca/-lindek/index. htm.
  • 3Cutting D R, Karger D, Pedersen J, et al. Scatter/Gather: A Cluster-Based Approach to Browsing Large Document Colleetions[C]//Proc of SIGIR'92,1992: 318-329.
  • 4Miller G A, Charles W. Contextual Correlates of Semantic Similarity[J]. Language and Cognitive Processes, 1991,6(1) : 1-28.
  • 5董振东,董强.知网[EB/OL].[2007-05-21].http://keen-age.com.
  • 6Jain A K,Murty M N,Flynn P J. Data Clustering: A Review [J]. ACM Computing Surveys, 1999,31 (3):264-323.
  • 7李琼.系统聚类分析中的遗传算法[J].武汉交通科技大学学报,2000,24(3):301-304. 被引量:8
  • 8行小帅,潘进,焦李成.基于免疫规划的K-means聚类算法[J].计算机学报,2003,26(5):605-610. 被引量:81

共引文献7

同被引文献244

引证文献27

二级引证文献164

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部