Extracting valuable information frombiomedical texts is one of the current research hotspots of concern to a wide range of scholars.The biomedical corpus contains numerous complex long sentences and overlapping relati...Extracting valuable information frombiomedical texts is one of the current research hotspots of concern to a wide range of scholars.The biomedical corpus contains numerous complex long sentences and overlapping relational triples,making most generalized domain joint modeling methods difficult to apply effectively in this field.For a complex semantic environment in biomedical texts,in this paper,we propose a novel perspective to perform joint entity and relation extraction;existing studies divide the relation triples into several steps or modules.However,the three elements in the relation triples are interdependent and inseparable,so we regard joint extraction as a tripartite classification problem.At the same time,fromthe perspective of triple classification,we design amulti-granularity 2D convolution to refine the word pair table and better utilize the dependencies between biomedical word pairs.Finally,we use a biaffine predictor to assist in predicting the labels of word pairs for relation extraction.Our model(MCTPL)Multi-granularity Convolutional Tokens Pairs of Labeling better utilizes the elements of triples and improves the ability to extract overlapping triples compared to previous approaches.Finally,we evaluated our model on two publicly accessible datasets.The experimental results show that our model’s ability to extract relation triples on the CPI dataset improves the F1 score by 2.34%compared to the current optimal model.On the DDI dataset,the F1 value improves the F1 value by 1.68%compared to the current optimal model.Our model achieved state-of-the-art performance compared to other baseline models in biomedical text entity relation extraction.展开更多
An exhaustive study has been conducted to investigate span-based models for the joint entity and relation extraction task.However,these models sample a large number of negative entities and negative relations during t...An exhaustive study has been conducted to investigate span-based models for the joint entity and relation extraction task.However,these models sample a large number of negative entities and negative relations during the model training,which are essential but result in grossly imbalanced data distributions and in turn cause suboptimal model performance.In order to address the above issues,we propose a two-phase paradigm for the span-based joint entity and relation extraction,which involves classifying the entities and relations in the first phase,and predicting the types of these entities and relations in the second phase.The two-phase paradigm enables our model to significantly reduce the data distribution gap,including the gap between negative entities and other entities,aswell as the gap between negative relations and other relations.In addition,we make the first attempt at combining entity type and entity distance as global features,which has proven effective,especially for the relation extraction.Experimental results on several datasets demonstrate that the span-based joint extraction model augmented with the two-phase paradigm and the global features consistently outperforms previous state-ofthe-art span-based models for the joint extraction task,establishing a new standard benchmark.Qualitative and quantitative analyses further validate the effectiveness the proposed paradigm and the global features.展开更多
Aiming at the lack of classification and good standard corpus in the task of joint entity and relationship extraction in the current Chinese academic field, this paper builds a dataset in management science that can b...Aiming at the lack of classification and good standard corpus in the task of joint entity and relationship extraction in the current Chinese academic field, this paper builds a dataset in management science that can be used for joint entity and relationship extraction, and establishes a deep learning model to extract entity and relationship information from scientific texts. With the definition of entity and relation classification, we build a Chinese scientific text corpus dataset based on the abstract texts of projects funded by the National Natural Science Foundation of China(NSFC) in 2018–2019. By combining the word2vec features with the clue word feature which is a kind of special style in scientific documents, we establish a joint entity relationship extraction model based on the Bi LSTM-CNN-CRF model for scientific information extraction. The dataset we constructed contains 13060 entities(not duplicated) and 9728 entity relation labels. In terms of entity prediction effect, the accuracy rate of the constructed model reaches 69.15%, the recall rate reaches 61.03%, and the F1 value reaches 64.83%. In terms of relationship prediction effect, the accuracy rate is higher than that of entity prediction, which reflects the effectiveness of the input mixed features and the integration of local features with CNN layer in the model.展开更多
Spatial relation extraction is the process of identifying geographic entities from text and determining their corresponding spatial relations.Traditional spatial relation extraction mainly uses rule-based pattern matc...Spatial relation extraction is the process of identifying geographic entities from text and determining their corresponding spatial relations.Traditional spatial relation extraction mainly uses rule-based pattern matching,supervised learning-based or unsupervised learning-based methods.However,these methods suffer from poor time-sensitive,high labor cost and high dependence on large-scale data.With the development of pre-trained language models greatly alleviating the shortcomings of traditional methods,supervised learning methods incorporating pre-trained language models have become the mainstream relation extraction methods.Pipeline extraction and joint extraction,as the two most dominant ideas of relation extraction,both have obtained good performance on different datasets,and whether to share the contextual information of entities and relations is the main differences between the two ideas.In this paper,we compare the performance of two ideas oriented to spatial relation extraction based on Chinese corpus data in the field of geography and verify which method based on pre-trained language models is more suitable for Chinese spatial relation extraction.We fine-tuned the hyperparameters of the two models to optimize the extraction accuracy before the comparison experiments.The results of the comparison experiments show that pipeline extraction performs better than joint extraction of spatial relation extraction for Chinese text data with sentence granularity,because different tasks have different focus on contextual information,and it is difficult to take account into the needs of both tasks by sharing contextual information.In addition,we further compare the performance of the two models with the rule-based template approach in extracting topological,directional and distance relations,summarize the shortcomings of this experiment and provide an outlook for future work.展开更多
基金supported by the National Natural Science Foundation of China(Nos.62002206 and 62202373)the open topic of the Green Development Big Data Decision-Making Key Laboratory(DM202003).
文摘Extracting valuable information frombiomedical texts is one of the current research hotspots of concern to a wide range of scholars.The biomedical corpus contains numerous complex long sentences and overlapping relational triples,making most generalized domain joint modeling methods difficult to apply effectively in this field.For a complex semantic environment in biomedical texts,in this paper,we propose a novel perspective to perform joint entity and relation extraction;existing studies divide the relation triples into several steps or modules.However,the three elements in the relation triples are interdependent and inseparable,so we regard joint extraction as a tripartite classification problem.At the same time,fromthe perspective of triple classification,we design amulti-granularity 2D convolution to refine the word pair table and better utilize the dependencies between biomedical word pairs.Finally,we use a biaffine predictor to assist in predicting the labels of word pairs for relation extraction.Our model(MCTPL)Multi-granularity Convolutional Tokens Pairs of Labeling better utilizes the elements of triples and improves the ability to extract overlapping triples compared to previous approaches.Finally,we evaluated our model on two publicly accessible datasets.The experimental results show that our model’s ability to extract relation triples on the CPI dataset improves the F1 score by 2.34%compared to the current optimal model.On the DDI dataset,the F1 value improves the F1 value by 1.68%compared to the current optimal model.Our model achieved state-of-the-art performance compared to other baseline models in biomedical text entity relation extraction.
基金supported by the National Key Research and Development Program[2020YFB1006302].
文摘An exhaustive study has been conducted to investigate span-based models for the joint entity and relation extraction task.However,these models sample a large number of negative entities and negative relations during the model training,which are essential but result in grossly imbalanced data distributions and in turn cause suboptimal model performance.In order to address the above issues,we propose a two-phase paradigm for the span-based joint entity and relation extraction,which involves classifying the entities and relations in the first phase,and predicting the types of these entities and relations in the second phase.The two-phase paradigm enables our model to significantly reduce the data distribution gap,including the gap between negative entities and other entities,aswell as the gap between negative relations and other relations.In addition,we make the first attempt at combining entity type and entity distance as global features,which has proven effective,especially for the relation extraction.Experimental results on several datasets demonstrate that the span-based joint extraction model augmented with the two-phase paradigm and the global features consistently outperforms previous state-ofthe-art span-based models for the joint extraction task,establishing a new standard benchmark.Qualitative and quantitative analyses further validate the effectiveness the proposed paradigm and the global features.
基金Supported by the National Natural Science Foundation of China (71804017)the R&D Program of Beijing Municipal Education Commission (KZ202210005013)the Sichuan Social Science Planning Project (SC22B151)。
文摘Aiming at the lack of classification and good standard corpus in the task of joint entity and relationship extraction in the current Chinese academic field, this paper builds a dataset in management science that can be used for joint entity and relationship extraction, and establishes a deep learning model to extract entity and relationship information from scientific texts. With the definition of entity and relation classification, we build a Chinese scientific text corpus dataset based on the abstract texts of projects funded by the National Natural Science Foundation of China(NSFC) in 2018–2019. By combining the word2vec features with the clue word feature which is a kind of special style in scientific documents, we establish a joint entity relationship extraction model based on the Bi LSTM-CNN-CRF model for scientific information extraction. The dataset we constructed contains 13060 entities(not duplicated) and 9728 entity relation labels. In terms of entity prediction effect, the accuracy rate of the constructed model reaches 69.15%, the recall rate reaches 61.03%, and the F1 value reaches 64.83%. In terms of relationship prediction effect, the accuracy rate is higher than that of entity prediction, which reflects the effectiveness of the input mixed features and the integration of local features with CNN layer in the model.
基金supported by the National Key Research and Development Program of China under[Grant number 2021YFB3900903]the National Natural Science Foundation of China under[Grant number 41971337].
文摘Spatial relation extraction is the process of identifying geographic entities from text and determining their corresponding spatial relations.Traditional spatial relation extraction mainly uses rule-based pattern matching,supervised learning-based or unsupervised learning-based methods.However,these methods suffer from poor time-sensitive,high labor cost and high dependence on large-scale data.With the development of pre-trained language models greatly alleviating the shortcomings of traditional methods,supervised learning methods incorporating pre-trained language models have become the mainstream relation extraction methods.Pipeline extraction and joint extraction,as the two most dominant ideas of relation extraction,both have obtained good performance on different datasets,and whether to share the contextual information of entities and relations is the main differences between the two ideas.In this paper,we compare the performance of two ideas oriented to spatial relation extraction based on Chinese corpus data in the field of geography and verify which method based on pre-trained language models is more suitable for Chinese spatial relation extraction.We fine-tuned the hyperparameters of the two models to optimize the extraction accuracy before the comparison experiments.The results of the comparison experiments show that pipeline extraction performs better than joint extraction of spatial relation extraction for Chinese text data with sentence granularity,because different tasks have different focus on contextual information,and it is difficult to take account into the needs of both tasks by sharing contextual information.In addition,we further compare the performance of the two models with the rule-based template approach in extracting topological,directional and distance relations,summarize the shortcomings of this experiment and provide an outlook for future work.