Existing datasets for move recognition,such as PubMed 20ok RCT,exhibit several problems that significantly impact recognition performance,especially for Background and Objective labels.In order to improve the move rec...Existing datasets for move recognition,such as PubMed 20ok RCT,exhibit several problems that significantly impact recognition performance,especially for Background and Objective labels.In order to improve the move recognition performance,we introduce a method and construct a refined corpus based on PubMed,named RCMR 280k.This corpus comprises approximately 280,000 structured abstracts,totaling 3,386,008 sentences,each sentence is labeled with one of five categories:Background,Objective,Method,Result,or Conclusion.We also construct a subset of RCMR,named RCMR_RCT,corresponding to medical subdomain of RCTs.We conduct comparison experiments using our RCMR,RCMR_RCT with PubMed 380k and PubMed 200k RCT,respectively.The best results,obtained using the MSMBERT model,show that:(1)our RCMR outperforms PubMed 380k by 0.82%,while our RCMR_RCT outperforms PubMed 200k RCT by 9.35%;(2)compared with PubMed 380k,our corpus achieve better improvement on the Results and Conclusions categories,with average F1 performance improves 1%and 0.82%,respectively;(3)compared with PubMed 200k RCT,our corpus significantly improves the performance in the Background and Objective categories,with average F1 scores improves 28.31%and 37.22%,respectively.To the best of our knowledge,our RCMR is among the rarely high-quality,resource-rich refined PubMed corpora available.Our work in this paper has been applied in the SciAlEngine,which is openly accessible for researchers to conduct move recognition task.展开更多
Artificial intelligence(AI) is the key to mining and enhancing the value of big data, and knowledge graph is one of the important cornerstones of artificial intelligence, which is the core foundation for the integrati...Artificial intelligence(AI) is the key to mining and enhancing the value of big data, and knowledge graph is one of the important cornerstones of artificial intelligence, which is the core foundation for the integration of statistical and physical representations. Named entity recognition is a fundamental research task for building knowledge graphs, which needs to be supported by a high-quality corpus, and currently there is a lack of high-quality named entity recognition corpus in the field of geology, especially in Chinese. In this paper, based on the conceptual structure of geological ontology and the analysis of the characteristics of geological texts, a classification system of geological named entity types is designed with the guidance and participation of geological experts, a corresponding annotation specification is formulated, an annotation tool is developed, and the first named entity recognition corpus for the geological domain is annotated based on real geological reports. The total number of words annotated was 698 512 and the number of entities was 23 345. The paper also explores the feasibility of a model pre-annotation strategy and presents a statistical analysis of the distribution of technical and term categories across genres and the consistency of corpus annotation. Based on this corpus, a Lite Bidirectional Encoder Representations from Transformers(ALBERT)-Bi-directional Long Short-Term Memory(BiLSTM)-Conditional Random Fields(CRF) and ALBERT-BiLSTM models are selected for experiments, and the results show that the F1-scores of the recognition performance of the two models reach 0.75 and 0.65 respectively, providing a corpus basis and technical support for information extraction in the field of geology.展开更多
基金supported by the project"Deep learning-based scientific literature knowledge engine demonstration system"(Grant No.E0290905)from the Chinese Academy of Sciences。
文摘Existing datasets for move recognition,such as PubMed 20ok RCT,exhibit several problems that significantly impact recognition performance,especially for Background and Objective labels.In order to improve the move recognition performance,we introduce a method and construct a refined corpus based on PubMed,named RCMR 280k.This corpus comprises approximately 280,000 structured abstracts,totaling 3,386,008 sentences,each sentence is labeled with one of five categories:Background,Objective,Method,Result,or Conclusion.We also construct a subset of RCMR,named RCMR_RCT,corresponding to medical subdomain of RCTs.We conduct comparison experiments using our RCMR,RCMR_RCT with PubMed 380k and PubMed 200k RCT,respectively.The best results,obtained using the MSMBERT model,show that:(1)our RCMR outperforms PubMed 380k by 0.82%,while our RCMR_RCT outperforms PubMed 200k RCT by 9.35%;(2)compared with PubMed 380k,our corpus achieve better improvement on the Results and Conclusions categories,with average F1 performance improves 1%and 0.82%,respectively;(3)compared with PubMed 200k RCT,our corpus significantly improves the performance in the Background and Objective categories,with average F1 scores improves 28.31%and 37.22%,respectively.To the best of our knowledge,our RCMR is among the rarely high-quality,resource-rich refined PubMed corpora available.Our work in this paper has been applied in the SciAlEngine,which is openly accessible for researchers to conduct move recognition task.
基金the IUGS Deep-time Digital Earth (DDE) Big Science Programfinancially supported by the National Key R&D Program of China (No.2022YFF0711601)+4 种基金the Natural Science Foundation of Hubei Province of China (No.2022CFB640)the Opening Fund of Key Laboratory of Geological Survey and Evaluation of Ministry of Education (No.GLAB 2023ZR01)the Fundamental Research Funds for the Central Universities,State Key Laboratory of Geo-Information Engineering and Key Laboratory of Surveying and Mapping Science and Geospatial Information Technology of MNR,Chinese Academy of Surveying and Mapping (No.2022-03-08)the Key Laboratory of Spatial-temporal Big Data Analysis and Application of Natural Resources in Megacities,MNR (NO.KFKT-2022-02)the Project of Chengdu Municipal Bureau of Planning and Natural Resources (No.5101012018002703)。
文摘Artificial intelligence(AI) is the key to mining and enhancing the value of big data, and knowledge graph is one of the important cornerstones of artificial intelligence, which is the core foundation for the integration of statistical and physical representations. Named entity recognition is a fundamental research task for building knowledge graphs, which needs to be supported by a high-quality corpus, and currently there is a lack of high-quality named entity recognition corpus in the field of geology, especially in Chinese. In this paper, based on the conceptual structure of geological ontology and the analysis of the characteristics of geological texts, a classification system of geological named entity types is designed with the guidance and participation of geological experts, a corresponding annotation specification is formulated, an annotation tool is developed, and the first named entity recognition corpus for the geological domain is annotated based on real geological reports. The total number of words annotated was 698 512 and the number of entities was 23 345. The paper also explores the feasibility of a model pre-annotation strategy and presents a statistical analysis of the distribution of technical and term categories across genres and the consistency of corpus annotation. Based on this corpus, a Lite Bidirectional Encoder Representations from Transformers(ALBERT)-Bi-directional Long Short-Term Memory(BiLSTM)-Conditional Random Fields(CRF) and ALBERT-BiLSTM models are selected for experiments, and the results show that the F1-scores of the recognition performance of the two models reach 0.75 and 0.65 respectively, providing a corpus basis and technical support for information extraction in the field of geology.