
采用改进重采样和BRF方法的定义抽取研究 被引量:7

Definition Extraction with Improving Re-Sampling and BRF
摘要 为了从专业领域语料中发现并获取所有的专业术语定义,该文提出了使用分类方法进行专业术语定义抽取的方法。该文采用一种基于实例距离分布信息的过采样方法,将其与随机欠采样方法结合用以建立平衡训练语料,并使用BRF(Balanced Random Forest)方法来获得C4.5决策树的聚合分类结果。该方法获得了最好65%的F1-measure成绩和78%的F2-measure成绩,超过了仅使用BRF方法取得的成绩。 In this paper,we introduce a classification method to identify definitions of all terms from an aviation domain corpus.This method proposes a novel approach to over-sampling minority instance using distance distribution information,which is further combined bythe random under-sampling majority instance to construct a balanced training set.It adopts the balance random forest(BRF) to build the final aggregating classifier of C4.5 decision tree.This method achieves the best score with 65% in F1-measure and 78% in F2-measure,out-performing baseline of BRF method.
作者 潘湑 顾宏斌
出处 《中文信息学报》 CSCD 北大核心 2011年第3期30-37,共8页 Journal of Chinese Information Processing
基金 民航总局专项科技基金项目(E9905)
关键词 自然语言处理 术语定义 定义抽取 文本分类 重采样 nature language process term definition definition extraction text categorization re-sampling
  • 相关文献


  • 1Jun Xu, Yunbo Cao, Hang Li, Min zhao. Ranking Definitions with Supervised Learning Methods [C]// Proc. 14th International World Wide Web Conference Committee, Chiba, Japan: 2005: 811-819.
  • 2Hang Cui, Min-Yen Kan, Tat-Seng Chua. Soft pat- tern matching models for definitional question answer- ing[J]. ACM Transactions on Information Systems (TOIS), 2007, 25 (2):8-es.
  • 3H. Cui, M. Kan, and T. Chua. Generic soft pattern models for definitional question answering[C]//Proc. SIGIR'05, Salvador, Brazil: 2005 : 384-391.
  • 4Hang Cui, Min-Yen Kan, Tat-Seng Chua: Unsuper- vised learning of soft patterns for generating definitions from online news[C]//Proc. 13th international confer- ence on World Wide Web, New York, NY, USA: 2004 : 90-99.
  • 5Eugene Agichtein and Luis Gravano. Snowball: Ex- tracting relations from large plain-text collections [C]//Proc. the Fifth ACM International Conference on Digital Libraries, San Antonio, Texas, USA: 2000: 85-94.
  • 6Degorski, L. , Marcinczuk, M. , and Przepiorkowski. A. Definition extraction using a sequential combination of baseline grammars and machine learning classifiers [C]//Proc. LREC2008, Marrakech, ELRA:2008.
  • 7Przepiorkowski, A. , Marcificzuk, M. , Degorski. L. : Dealing with small, noisy and imbalanced data: Ma- chine learning or manual grammars? [C]//Proc. TSD2008, Brno, Czech Republic: September 2008.
  • 8Ismail Fahmi and Gosse Bouma. Learning to identify definitions using syntactic features [C]//Proc. the EACL workshop on Learning Structured Information in Natural Language Applications, Trento, Italy:2006.
  • 9Chawla, N., Japkowicz, N., Kolcz, A. Editorial: Special Issue on Learning from Imbalanced Data Sets [N]. SIGKDD Explorations 6(1), 1-6 2004.
  • 10Prati, R. , Batista, G., Monard, M. Class Imbal- ances versus Class Overlapping; an Analysis of a Learning System Behavior [ C ]//Proc. MICAI (2004). Heidelberg: Springer, 2004: LNAI 2972, 312-321.











使用帮助 返回顶部