摘要
中文术语抽取是信息抽取、文本挖掘以及知识获取等信息处理任务中的关键技术。相对于单词型术语,词组型术语的识别过程要更加复杂。由于短语中引入了大量非名词性词汇,随之产生了更多种的噪声数据,不仅需要判断短语结构是否完整,还要考虑短语内部词汇的搭配合理性、衡量短语中所负载领域信息量等问题。文中将词组型术语抽取过程中遇到的这三个问题作为切入点,分别使用子串归并、搭配检验和领域相关度计算技术来解决这三个问题,分析词组型术语自身的结构特征以及其在语料中的分布特征,完善词组型术语的抽取任务。实验证实了该方法能够有效提升低频术语和基础术语的排序位置,从而改善了中文词组型术语抽取系统的性能。
Term extraction is one of the primary technical challenges in many information process tasks,such as information extraction,text mining and knowledge acquisition.Relative to the single-word terms,the multi-word terms have to face much more noise which is brought due to the non-noun lexicon appearing in the phrases.Besides structure integrality,collocation and domain relevant degree are also the main problems puzzling the terms extraction.To solve these problems,three strategies combining with substring reduction,collocation test and termhood computation are proposed to improve the result of multi-word terms extraction.In the experiment on the computer domain corpora,the low-frequency-terms and base-terms could obtain more attention,and consequently Chinese multi-words terms extraction get a better precision.
出处
《情报学报》
CSSCI
北大核心
2010年第3期460-467,共8页
Journal of the China Society for Scientific and Technical Information
基金
国家863高技术研究发展计划资助项目(2006AA01Z152)
国家自然科学基金资助项目(60672149)
关键词
中文术语抽取
语言规则获取
子串归并
搭配检验
词语活跃度
领域相关度
Chinese term extraction
linguistical rules acquisition
substring reduction
collocation test
word active degree
domain relevant degree