Supervised machine learning approaches are effective in text mining,but their success relies heavily on manually annotated corpora.However,there are limited numbers of annotated biomedical event corpora,and the availa...Supervised machine learning approaches are effective in text mining,but their success relies heavily on manually annotated corpora.However,there are limited numbers of annotated biomedical event corpora,and the available datasets contain insufficient examples for training classifiers;the common cure is to seek large amounts of training samples from unlabeled data,but such data sets often contain many mislabeled samples,which will degrade the performance of classifiers.Therefore,this study proposes a novel error data detection approach suitable for reducing noise in unlabeled biomedical event data.First,we construct the mislabeled dataset through error data analysis with the development dataset.The sample pairs’vector representations are then obtained by the means of sequence patterns and the joint model of convolutional neural network and long short-term memory recurrent neural network.Following this,the sample identification strategy is proposed,using error detection based on pair representation for unlabeled data.With the latter,the selected samples are added to enrich the training dataset and improve the classification performance.In the BioNLP Shared Task GENIA,the experiments results indicate that the proposed approach is competent in extract the biomedical event from biomedical literature.Our approach can effectively filter some noisy examples and build a satisfactory prediction model.展开更多
基金This work was supported by the National Natural Science Foundation of China(No.61672301)Jilin Provincial Science&Technology Development(20180101054JC)+1 种基金Science and Technology Innovation Guide Project of Inner Mongolia Autonomous Region of China(2017)Talent Development Fund of Jilin Province(2018).
文摘Supervised machine learning approaches are effective in text mining,but their success relies heavily on manually annotated corpora.However,there are limited numbers of annotated biomedical event corpora,and the available datasets contain insufficient examples for training classifiers;the common cure is to seek large amounts of training samples from unlabeled data,but such data sets often contain many mislabeled samples,which will degrade the performance of classifiers.Therefore,this study proposes a novel error data detection approach suitable for reducing noise in unlabeled biomedical event data.First,we construct the mislabeled dataset through error data analysis with the development dataset.The sample pairs’vector representations are then obtained by the means of sequence patterns and the joint model of convolutional neural network and long short-term memory recurrent neural network.Following this,the sample identification strategy is proposed,using error detection based on pair representation for unlabeled data.With the latter,the selected samples are added to enrich the training dataset and improve the classification performance.In the BioNLP Shared Task GENIA,the experiments results indicate that the proposed approach is competent in extract the biomedical event from biomedical literature.Our approach can effectively filter some noisy examples and build a satisfactory prediction model.