Purpose:Mo ve recognition in scientific abstracts is an NLP task of classifying sentences of the abstracts into different types of language units.To improve the performance of move recognition in scientific abstracts,...Purpose:Mo ve recognition in scientific abstracts is an NLP task of classifying sentences of the abstracts into different types of language units.To improve the performance of move recognition in scientific abstracts,a novel model of move recognition is proposed that outperforms the BERT-based method.Design/methodology/approach:Prevalent models based on BERT for sentence classification often classify sentences without considering the context of the sentences.In this paper,inspired by the BERT masked language model(MLM),we propose a novel model called the masked sentence model that integrates the content and contextual information of the sentences in move recognition.Experiments are conducted on the benchmark dataset PubMed 20K RCT in three steps.Then,we compare our model with HSLN-RNN,BERT-based and SciBERT using the same dataset.Findings:Compared with the BERT-based and SciBERT models,the F1 score of our model outperforms them by 4.96%and 4.34%,respectively,which shows the feasibility and effectiveness of the novel model and the result of our model comes closest to the state-of-theart results of HSLN-RNN at present.Research limitations:The sequential features of move labels are not considered,which might be one of the reasons why HSLN-RNN has better performance.Our model is restricted to dealing with biomedical English literature because we use a dataset from PubMed,which is a typical biomedical database,to fine-tune our model.Practical implications:The proposed model is better and simpler in identifying move structures in scientific abstracts and is worthy of text classification experiments for capturing contextual features of sentences.Originality/value:T he study proposes a masked sentence model based on BERT that considers the contextual features of the sentences in abstracts in a new way.The performance of this classification model is significantly improved by rebuilding the input layer without changing the structure of neural networks.展开更多
Existing datasets for move recognition,such as PubMed 20ok RCT,exhibit several problems that significantly impact recognition performance,especially for Background and Objective labels.In order to improve the move rec...Existing datasets for move recognition,such as PubMed 20ok RCT,exhibit several problems that significantly impact recognition performance,especially for Background and Objective labels.In order to improve the move recognition performance,we introduce a method and construct a refined corpus based on PubMed,named RCMR 280k.This corpus comprises approximately 280,000 structured abstracts,totaling 3,386,008 sentences,each sentence is labeled with one of five categories:Background,Objective,Method,Result,or Conclusion.We also construct a subset of RCMR,named RCMR_RCT,corresponding to medical subdomain of RCTs.We conduct comparison experiments using our RCMR,RCMR_RCT with PubMed 380k and PubMed 200k RCT,respectively.The best results,obtained using the MSMBERT model,show that:(1)our RCMR outperforms PubMed 380k by 0.82%,while our RCMR_RCT outperforms PubMed 200k RCT by 9.35%;(2)compared with PubMed 380k,our corpus achieve better improvement on the Results and Conclusions categories,with average F1 performance improves 1%and 0.82%,respectively;(3)compared with PubMed 200k RCT,our corpus significantly improves the performance in the Background and Objective categories,with average F1 scores improves 28.31%and 37.22%,respectively.To the best of our knowledge,our RCMR is among the rarely high-quality,resource-rich refined PubMed corpora available.Our work in this paper has been applied in the SciAlEngine,which is openly accessible for researchers to conduct move recognition task.展开更多
基金supported by the project “The demonstration system of rich semantic search application in scientific literature” (Grant No. 1734) from the Chinese Academy of Sciences
文摘Purpose:Mo ve recognition in scientific abstracts is an NLP task of classifying sentences of the abstracts into different types of language units.To improve the performance of move recognition in scientific abstracts,a novel model of move recognition is proposed that outperforms the BERT-based method.Design/methodology/approach:Prevalent models based on BERT for sentence classification often classify sentences without considering the context of the sentences.In this paper,inspired by the BERT masked language model(MLM),we propose a novel model called the masked sentence model that integrates the content and contextual information of the sentences in move recognition.Experiments are conducted on the benchmark dataset PubMed 20K RCT in three steps.Then,we compare our model with HSLN-RNN,BERT-based and SciBERT using the same dataset.Findings:Compared with the BERT-based and SciBERT models,the F1 score of our model outperforms them by 4.96%and 4.34%,respectively,which shows the feasibility and effectiveness of the novel model and the result of our model comes closest to the state-of-theart results of HSLN-RNN at present.Research limitations:The sequential features of move labels are not considered,which might be one of the reasons why HSLN-RNN has better performance.Our model is restricted to dealing with biomedical English literature because we use a dataset from PubMed,which is a typical biomedical database,to fine-tune our model.Practical implications:The proposed model is better and simpler in identifying move structures in scientific abstracts and is worthy of text classification experiments for capturing contextual features of sentences.Originality/value:T he study proposes a masked sentence model based on BERT that considers the contextual features of the sentences in abstracts in a new way.The performance of this classification model is significantly improved by rebuilding the input layer without changing the structure of neural networks.
基金supported by the project"Deep learning-based scientific literature knowledge engine demonstration system"(Grant No.E0290905)from the Chinese Academy of Sciences。
文摘Existing datasets for move recognition,such as PubMed 20ok RCT,exhibit several problems that significantly impact recognition performance,especially for Background and Objective labels.In order to improve the move recognition performance,we introduce a method and construct a refined corpus based on PubMed,named RCMR 280k.This corpus comprises approximately 280,000 structured abstracts,totaling 3,386,008 sentences,each sentence is labeled with one of five categories:Background,Objective,Method,Result,or Conclusion.We also construct a subset of RCMR,named RCMR_RCT,corresponding to medical subdomain of RCTs.We conduct comparison experiments using our RCMR,RCMR_RCT with PubMed 380k and PubMed 200k RCT,respectively.The best results,obtained using the MSMBERT model,show that:(1)our RCMR outperforms PubMed 380k by 0.82%,while our RCMR_RCT outperforms PubMed 200k RCT by 9.35%;(2)compared with PubMed 380k,our corpus achieve better improvement on the Results and Conclusions categories,with average F1 performance improves 1%and 0.82%,respectively;(3)compared with PubMed 200k RCT,our corpus significantly improves the performance in the Background and Objective categories,with average F1 scores improves 28.31%and 37.22%,respectively.To the best of our knowledge,our RCMR is among the rarely high-quality,resource-rich refined PubMed corpora available.Our work in this paper has been applied in the SciAlEngine,which is openly accessible for researchers to conduct move recognition task.