摘要
Supervised machine learning methods have been employed with great success in the task of biomedical relation extraction. However, existing methods are not practical enough, since manual construction of large training data is very expensive. Therefore, active learning is urgently needed for designing practical relation extraction methods with little human effort. In this paper, we describe a unified active learning framework. Particularly, our framework systematically addresses some practical issues during active learning process, including a strategy for selecting informative data, a data diversity selection algorithm, an active feature acquisition method, and an informative feature selection algorithm, in order to meet the challenges due to the immense amount of complex and diverse biomedical text. The framework is evaluated on protein- protein interaction (PPI) extraction and is shown to achieve promising results with a significant reduction in editorial effort and labeling time.
Supervised machine learning methods have been employed with great success in the task of biomedical relation extraction. However, existing methods are not practical enough, since manual construction of large training data is very expensive. Therefore, active learning is urgently needed for designing practical relation extraction methods with little human effort. In this paper, we describe a unified active learning framework. Particularly, our framework systematically addresses some practical issues during active learning process, including a strategy for selecting informative data, a data diversity selection algorithm, an active feature acquisition method, and an informative feature selection algorithm, in order to meet the challenges due to the immense amount of complex and diverse biomedical text. The framework is evaluated on protein- protein interaction (PPI) extraction and is shown to achieve promising results with a significant reduction in editorial effort and labeling time.
基金
supported by the National Natural Science Foundation of China under Grant No.60973104
the National Basic Research 973 Program of China under Grant No.2012CB316301