摘要
术语抽取从非结构化文本中自动抽取专业术语。该工作在中文分词、信息抽取、知识库构建中发挥着重要的作用。当前术语抽取方法很大程度上依赖于词的统计信息,由于基础教育学科中术语具有极强的长尾特性,导致基于统计的术语抽取方法很难抽取出处于尾端的术语。该文结合基础教育的学科特点,提出了DRTE:一种利用术语定义与术语关系挖掘,综合构词规则与边界检测的术语抽取方法。该文以初高中的数学课本为数据源进行术语抽取,实验结果表明我们的术语抽取方法 F1值达到82.7%,相比目前的方法提高了40.8%,能够有效地在中文基础教育领域进行自动化的术语抽取。
Term extraction is an essential task where terms are extracted automatically from unstructured text based on a specific domain.Previous methods largely rely on terms'statistic information.However,terms in k12 education area have serious long-tail effect,which makes it hard to extract terms at the tail part in methods based on statistics.In this paper,we propose DRTE,a method which focus on extracting terms from their definitions and relations.Our method also utilizes term-formation rules and boundary detection strategies.Experiments on math textbooks for middle school and high school reveal 82.7% on F1 performance of our method,which significantly outperforms the current method by 40.8%.
作者
李思良
许斌
杨玉基
LI Siliang;XU Bin;YANG Yuji(Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China)
出处
《中文信息学报》
CSCD
北大核心
2018年第3期101-109,共9页
Journal of Chinese Information Processing
基金
国家科技部863课题(2015AA015401)
关键词
术语抽取
术语定义
术语关系
term extraction
term definition
term relation