摘要
为了从专业领域语料中发现并获取所有的专业术语定义,该文提出了使用分类方法进行专业术语定义抽取的方法。该文采用一种基于实例距离分布信息的过采样方法,将其与随机欠采样方法结合用以建立平衡训练语料,并使用BRF(Balanced Random Forest)方法来获得C4.5决策树的聚合分类结果。该方法获得了最好65%的F1-measure成绩和78%的F2-measure成绩,超过了仅使用BRF方法取得的成绩。
In this paper,we introduce a classification method to identify definitions of all terms from an aviation domain corpus.This method proposes a novel approach to over-sampling minority instance using distance distribution information,which is further combined bythe random under-sampling majority instance to construct a balanced training set.It adopts the balance random forest(BRF) to build the final aggregating classifier of C4.5 decision tree.This method achieves the best score with 65% in F1-measure and 78% in F2-measure,out-performing baseline of BRF method.
出处
《中文信息学报》
CSCD
北大核心
2011年第3期30-37,共8页
Journal of Chinese Information Processing
基金
民航总局专项科技基金项目(E9905)
关键词
自然语言处理
术语定义
定义抽取
文本分类
重采样
nature language process
term definition
definition extraction
text categorization
re-sampling