Named entity recognition(NER)is a fundamental task of information extraction(IE),and it has attracted considerable research attention in recent years.The abundant annotated English NER datasets have significantly prom...Named entity recognition(NER)is a fundamental task of information extraction(IE),and it has attracted considerable research attention in recent years.The abundant annotated English NER datasets have significantly promoted the NER research in the English field.By contrast,much fewer efforts are made to the Chinese NER research,especially in the scientific domain,due to the scarcity of Chinese NER datasets.To alleviate this problem,we present aChinese scientificNER dataset–SciCN,which contains entity annotations of titles and abstracts derived from 3,500 scientific papers.We manually annotate a total of 62,059 entities,and these entities are classified into six types.Compared to English scientific NER datasets,SciCN has a larger scale and is more diverse,for it not only contains more paper abstracts but these abstracts are derived from more research fields.To investigate the properties of SciCN and provide baselines for future research,we adapt a number of previous state-of-theart Chinese NER models to evaluate SciCN.Experimental results show that SciCN is more challenging than other Chinese NER datasets.In addition,previous studies have proven the effectiveness of using lexicons to enhance Chinese NER models.Motivated by this fact,we provide a scientific domain-specific lexicon.Validation results demonstrate that our lexicon delivers better performance gains than lexicons of other domains.We hope that the SciCN dataset and the lexicon will enable us to benchmark the NER task regarding the Chinese scientific domain and make progress for future research.The dataset and lexicon are available at:https://github.com/yangjingla/SciCN.git.展开更多
Multimodal named entity recognition(MNER)and relation extraction(MRE)are key in social media analysis but face challenges like inefficient visual processing and non-optimal modality interaction.(1)Heavy visual embeddi...Multimodal named entity recognition(MNER)and relation extraction(MRE)are key in social media analysis but face challenges like inefficient visual processing and non-optimal modality interaction.(1)Heavy visual embedding:the process of visual embedding is both time and computationally expensive due to the prerequisite extraction of explicit visual cues from the original image before input into the multimodal model.Consequently,these approaches cannot achieve efficient online reasoning;(2)suboptimal interaction handling:the prevalent method of managing interaction between different modalities typically relies on the alternation of self-attention and cross-attention mechanisms or excessive dependence on the gating mechanism.This explicit modeling method may fail to capture some nuanced relations between image and text,ultimately undermining the model’s capability to extract optimal information.To address these challenges,we introduce Implicit Modality Mining(IMM),a novel end-to-end framework for fine-grained image-text correlation without heavy visual embedders.IMM uses an Implicit Semantic Alignment module with a Transformer for cross-modal clues and an Insert-Activation module to effectively utilize these clues.Our approach achieves state-of-the-art performance on three datasets.展开更多
An obviously challenging problem in named entity recognition is the construction of the kind data set of entities.Although some research has been conducted on entity database construction,the majority of them are dire...An obviously challenging problem in named entity recognition is the construction of the kind data set of entities.Although some research has been conducted on entity database construction,the majority of them are directed at Wikipedia or the minority at structured entities such as people,locations and organizational nouns in the news.This paper focuses on the identification of scientific entities in carbonate platforms in English literature,using the example of carbonate platforms in sedimentology.Firstly,based on the fact that the reasons for writing literature in key disciplines are likely to be provided by multidisciplinary experts,this paper designs a literature content extraction method that allows dealing with complex text structures.Secondly,based on the literature extraction content,we formalize the entity extraction task(lexicon and lexical-based entity extraction)for entity extraction.Furthermore,for testing the accuracy of entity extraction,three currently popular recognition methods are chosen to perform entity detection in this paper.Experiments show that the entity data set provided by the lexicon and lexical-based entity extraction method is of significant assistance for the named entity recognition task.This study presents a pilot study of entity extraction,which involves the use of a complex structure and specialized literature on carbonate platforms in English.展开更多
Named entity recognition,as a sub-task of information extraction,has attracted widespread attention from scholars at home and abroad since it was proposed,and a series of studies and discussions have been carried out ...Named entity recognition,as a sub-task of information extraction,has attracted widespread attention from scholars at home and abroad since it was proposed,and a series of studies and discussions have been carried out based on it.This paper discusses the existing named entity recognition technology based on its history of development.展开更多
Geological reports are a significant accomplishment for geologists involved in geological investigations and scientific research as they contain rich data and textual information.With the rapid development of science ...Geological reports are a significant accomplishment for geologists involved in geological investigations and scientific research as they contain rich data and textual information.With the rapid development of science and technology,a large number of textual reports have accumulated in the field of geology.However,many non-hot topics and non-English speaking regions are neglected in mainstream geoscience databases for geological information mining,making it more challenging for some researchers to extract necessary information from these texts.Natural Language Processing(NLP)has obvious advantages in processing large amounts of textual data.The objective of this paper is to identify geological named entities from Chinese geological texts using NLP techniques.We propose the RoBERTa-Prompt-Tuning-NER method,which leverages the concept of Prompt Learning and requires only a small amount of annotated data to train superior models for recognizing geological named entities in low-resource dataset configurations.The RoBERTa layer captures context-based information and longer-distance dependencies through dynamic word vectors.Finally,we conducted experiments on the constructed Geological Named Entity Recognition(GNER)dataset.Our experimental results show that the proposed model achieves the highest F1 score of 80.64%among the four baseline algorithms,demonstrating the reliability and robustness of using the model for Named Entity Recognition of geological texts.展开更多
An exhaustive study has been conducted to investigate span-based models for the joint entity and relation extraction task.However,these models sample a large number of negative entities and negative relations during t...An exhaustive study has been conducted to investigate span-based models for the joint entity and relation extraction task.However,these models sample a large number of negative entities and negative relations during the model training,which are essential but result in grossly imbalanced data distributions and in turn cause suboptimal model performance.In order to address the above issues,we propose a two-phase paradigm for the span-based joint entity and relation extraction,which involves classifying the entities and relations in the first phase,and predicting the types of these entities and relations in the second phase.The two-phase paradigm enables our model to significantly reduce the data distribution gap,including the gap between negative entities and other entities,aswell as the gap between negative relations and other relations.In addition,we make the first attempt at combining entity type and entity distance as global features,which has proven effective,especially for the relation extraction.Experimental results on several datasets demonstrate that the span-based joint extraction model augmented with the two-phase paradigm and the global features consistently outperforms previous state-ofthe-art span-based models for the joint extraction task,establishing a new standard benchmark.Qualitative and quantitative analyses further validate the effectiveness the proposed paradigm and the global features.展开更多
The China Conference on Knowledge Graph and Semantic Computing(CCKS)2020 Evaluation Task 3 presented clinical named entity recognition and event extraction for the Chinese electronic medical records.Two annotated data...The China Conference on Knowledge Graph and Semantic Computing(CCKS)2020 Evaluation Task 3 presented clinical named entity recognition and event extraction for the Chinese electronic medical records.Two annotated data sets and some other additional resources for these two subtasks were provided for participators.This evaluation competition attracted 354 teams and 46 of them successfully submitted the valid results.The pre-trained language models are widely applied in this evaluation task.Data argumentation and external resources are also helpful.展开更多
With the rapid development of Internet technology and the advent of the era of big data,more and more cyber security texts are provided on the Internet.These texts include not only security concepts,incidents,tools,gu...With the rapid development of Internet technology and the advent of the era of big data,more and more cyber security texts are provided on the Internet.These texts include not only security concepts,incidents,tools,guidelines,and policies,but also risk management approaches,best practices,assurances,technologies,and more.Through the integration of large-scale,heterogeneous,unstructured cyber security information,the identification and classification of cyber security entities can help handle cyber security issues.Due to the complexity and diversity of texts in the cyber security domain,it is difficult to identify security entities in the cyber security domain using the traditional named entity recognition(NER)methods.This paper describes various approaches and techniques for NER in this domain,including the rule-based approach,dictionary-based approach,and machine learning based approach,and discusses the problems faced by NER research in this domain,such as conjunction and disjunction,non-standardized naming convention,abbreviation,and massive nesting.Three future directions of NER in cyber security are proposed:(1)application of unsupervised or semi-supervised technology;(2)development of a more comprehensive cyber security ontology;(3)development of a more comprehensive deep learning model.展开更多
Electronic medical record (EMR) containing rich biomedical information has a great potential in disease diagnosis and biomedical research. However, the EMR information is usually in the form of unstructured text, whic...Electronic medical record (EMR) containing rich biomedical information has a great potential in disease diagnosis and biomedical research. However, the EMR information is usually in the form of unstructured text, which increases the use cost and hinders its applications. In this work, an effective named entity recognition (NER) method is presented for information extraction on Chinese EMR, which is achieved by word embedding bootstrapped deep active learning to promote the acquisition of medical information from Chinese EMR and to release its value. In this work, deep active learning of bi-directional long short-term memory followed by conditional random field (Bi-LSTM+CRF) is used to capture the characteristics of different information from labeled corpus, and the word embedding models of contiguous bag of words and skip-gram are combined in the above model to respectively capture the text feature of Chinese EMR from unlabeled corpus. To evaluate the performance of above method, the tasks of NER on Chinese EMR with “medical history” content were used. Experimental results show that the word embedding bootstrapped deep active learning method using unlabeled medical corpus can achieve a better performance compared with other models.展开更多
为获得结构化的小麦品种表型和遗传描述,针对非结构化小麦种质数据中存在的实体边界模糊以及关系重叠问题,提出一种基于深度字词融合的小麦种质信息实体关系联合抽取模型WGIE-DCWF(wheat germplasm information extraction model based ...为获得结构化的小麦品种表型和遗传描述,针对非结构化小麦种质数据中存在的实体边界模糊以及关系重叠问题,提出一种基于深度字词融合的小麦种质信息实体关系联合抽取模型WGIE-DCWF(wheat germplasm information extraction model based on deep character and word fusion)。模型编码层通过深度字词融合和上下文语义特征融合,提高密集实体特征识别能力;模型三元组抽取层建立层叠指针网络,提高重叠关系的提取能力。在小麦种质数据集和公开数据集上的一系列对比实验结果表明,WGIE-DCWF模型能够有效提高小麦种质数据实体关系联合抽取效果,同时拥有较好的泛化性,可以为小麦种质信息知识库构建提供技术支撑。展开更多
篇章关系抽取旨在识别篇章中实体对之间的关系.相较于传统的句子级别关系抽取,篇章级别关系抽取任务更加贴近实际应用,但是它对实体对的跨句子推理和上下文信息感知等问题提出了新的挑战.本文提出融合实体和上下文信息(Fuse entity and ...篇章关系抽取旨在识别篇章中实体对之间的关系.相较于传统的句子级别关系抽取,篇章级别关系抽取任务更加贴近实际应用,但是它对实体对的跨句子推理和上下文信息感知等问题提出了新的挑战.本文提出融合实体和上下文信息(Fuse entity and context information,FECI)的篇章关系抽取方法,它包含两个模块,分别是实体信息抽取模块和上下文信息抽取模块.实体信息抽取模块从两个实体中自动地抽取出能够表示实体对关系的特征.上下文信息抽取模块根据实体对的提及位置信息,从篇章中抽取不同的上下文关系特征.本文在三个篇章级别的关系抽取数据集上进行实验,效果得到显著提升.展开更多
古汉语文本承载着丰富的历史和文化信息,对这类文本进行实体关系抽取研究并构建相关知识图谱对于文化传承具有重要作用.针对古汉语文本中存在大量生僻汉字、语义模糊和复义等问题,提出了一种基于BERT古文预训练模型的实体关系联合抽取模...古汉语文本承载着丰富的历史和文化信息,对这类文本进行实体关系抽取研究并构建相关知识图谱对于文化传承具有重要作用.针对古汉语文本中存在大量生僻汉字、语义模糊和复义等问题,提出了一种基于BERT古文预训练模型的实体关系联合抽取模型(entity relation joint extraction model based on BERT-ancient-Chinese pretrained model,JEBAC).首先,通过融合BiLSTM神经网络和注意力机制的BERT古文预训练模型(BERT-ancientChinese pre-trained model integrated BiLSTM neural network and attention mechanism,BACBA),识别出句中所有的subject实体和object实体,为关系和object实体联合抽取提供依据.接下来,将subject实体的归一化编码向量与整个句子的嵌入向量相加,以更好地理解句中subject实体的语义特征;最后,结合带有subject实体特征的句子向量和object实体的提示信息,通过BACBA实现句中关系和object实体的联合抽取,从而得到句中所有的三元组信息(subject实体,关系,object实体).在中文实体关系抽取DuIE2.0数据集和CCKS 2021的文言文实体关系抽取CCLUE小样本数据集上,与现有的方法进行了性能比较.实验结果表明,该方法在抽取性能上更加有效,F1值分别可达79.2%和55.5%.展开更多
地球科学的研究成果通常记录在技术报告、期刊论文、书籍等文献中,但许多详细的地球科学报告未被使用,这为信息提取提供了机遇。为此,我们提出了一种名为GMNER(Geological Minerals named entity recognize,MNER)的深度神经网络模型,用...地球科学的研究成果通常记录在技术报告、期刊论文、书籍等文献中,但许多详细的地球科学报告未被使用,这为信息提取提供了机遇。为此,我们提出了一种名为GMNER(Geological Minerals named entity recognize,MNER)的深度神经网络模型,用于识别和提取矿物类型、地质构造、岩石与地质时间等关键信息。与传统方法不同,本次采用了大规模预训练模型BERT(Bidirectional Encoder Representations from Transformers,BERT)和深度神经网络来捕捉上下文信息,并结合条件随机场(Conditional random field,CRF)以获得准确结果。实验结果表明,MNER模型在中文地质文献中表现出色,平均精确度为0.8984,平均召回率0.9227,平均F1分数0.9104。研究不仅为自动矿物信息提取提供了新途径,也有望促进矿产资源管理和可持续利用。展开更多
基金This research was supported by the National Key Research and Development Program[2020YFB1006302].
文摘Named entity recognition(NER)is a fundamental task of information extraction(IE),and it has attracted considerable research attention in recent years.The abundant annotated English NER datasets have significantly promoted the NER research in the English field.By contrast,much fewer efforts are made to the Chinese NER research,especially in the scientific domain,due to the scarcity of Chinese NER datasets.To alleviate this problem,we present aChinese scientificNER dataset–SciCN,which contains entity annotations of titles and abstracts derived from 3,500 scientific papers.We manually annotate a total of 62,059 entities,and these entities are classified into six types.Compared to English scientific NER datasets,SciCN has a larger scale and is more diverse,for it not only contains more paper abstracts but these abstracts are derived from more research fields.To investigate the properties of SciCN and provide baselines for future research,we adapt a number of previous state-of-theart Chinese NER models to evaluate SciCN.Experimental results show that SciCN is more challenging than other Chinese NER datasets.In addition,previous studies have proven the effectiveness of using lexicons to enhance Chinese NER models.Motivated by this fact,we provide a scientific domain-specific lexicon.Validation results demonstrate that our lexicon delivers better performance gains than lexicons of other domains.We hope that the SciCN dataset and the lexicon will enable us to benchmark the NER task regarding the Chinese scientific domain and make progress for future research.The dataset and lexicon are available at:https://github.com/yangjingla/SciCN.git.
文摘Multimodal named entity recognition(MNER)and relation extraction(MRE)are key in social media analysis but face challenges like inefficient visual processing and non-optimal modality interaction.(1)Heavy visual embedding:the process of visual embedding is both time and computationally expensive due to the prerequisite extraction of explicit visual cues from the original image before input into the multimodal model.Consequently,these approaches cannot achieve efficient online reasoning;(2)suboptimal interaction handling:the prevalent method of managing interaction between different modalities typically relies on the alternation of self-attention and cross-attention mechanisms or excessive dependence on the gating mechanism.This explicit modeling method may fail to capture some nuanced relations between image and text,ultimately undermining the model’s capability to extract optimal information.To address these challenges,we introduce Implicit Modality Mining(IMM),a novel end-to-end framework for fine-grained image-text correlation without heavy visual embedders.IMM uses an Implicit Semantic Alignment module with a Transformer for cross-modal clues and an Insert-Activation module to effectively utilize these clues.Our approach achieves state-of-the-art performance on three datasets.
基金supported by the National Natural Science Foundation of China under Grant No.42050102the National Science Foundation of China(Grant No.62001236)the Natural Science Foundation of the Jiangsu Higher Education Institutions of China(Grant No.20KJA520003).
文摘An obviously challenging problem in named entity recognition is the construction of the kind data set of entities.Although some research has been conducted on entity database construction,the majority of them are directed at Wikipedia or the minority at structured entities such as people,locations and organizational nouns in the news.This paper focuses on the identification of scientific entities in carbonate platforms in English literature,using the example of carbonate platforms in sedimentology.Firstly,based on the fact that the reasons for writing literature in key disciplines are likely to be provided by multidisciplinary experts,this paper designs a literature content extraction method that allows dealing with complex text structures.Secondly,based on the literature extraction content,we formalize the entity extraction task(lexicon and lexical-based entity extraction)for entity extraction.Furthermore,for testing the accuracy of entity extraction,three currently popular recognition methods are chosen to perform entity detection in this paper.Experiments show that the entity data set provided by the lexicon and lexical-based entity extraction method is of significant assistance for the named entity recognition task.This study presents a pilot study of entity extraction,which involves the use of a complex structure and specialized literature on carbonate platforms in English.
文摘Named entity recognition,as a sub-task of information extraction,has attracted widespread attention from scholars at home and abroad since it was proposed,and a series of studies and discussions have been carried out based on it.This paper discusses the existing named entity recognition technology based on its history of development.
基金supported by the National Natural Science Foundation of China(Nos.42488201,42172137,42050104,and 42050102)the National Key R&D Program of China(No.2023YFF0804000)Sichuan Provincial Youth Science&Technology Innovative Research Group Fund(No.2022JDTD0004)
文摘Geological reports are a significant accomplishment for geologists involved in geological investigations and scientific research as they contain rich data and textual information.With the rapid development of science and technology,a large number of textual reports have accumulated in the field of geology.However,many non-hot topics and non-English speaking regions are neglected in mainstream geoscience databases for geological information mining,making it more challenging for some researchers to extract necessary information from these texts.Natural Language Processing(NLP)has obvious advantages in processing large amounts of textual data.The objective of this paper is to identify geological named entities from Chinese geological texts using NLP techniques.We propose the RoBERTa-Prompt-Tuning-NER method,which leverages the concept of Prompt Learning and requires only a small amount of annotated data to train superior models for recognizing geological named entities in low-resource dataset configurations.The RoBERTa layer captures context-based information and longer-distance dependencies through dynamic word vectors.Finally,we conducted experiments on the constructed Geological Named Entity Recognition(GNER)dataset.Our experimental results show that the proposed model achieves the highest F1 score of 80.64%among the four baseline algorithms,demonstrating the reliability and robustness of using the model for Named Entity Recognition of geological texts.
基金supported by the National Key Research and Development Program[2020YFB1006302].
文摘An exhaustive study has been conducted to investigate span-based models for the joint entity and relation extraction task.However,these models sample a large number of negative entities and negative relations during the model training,which are essential but result in grossly imbalanced data distributions and in turn cause suboptimal model performance.In order to address the above issues,we propose a two-phase paradigm for the span-based joint entity and relation extraction,which involves classifying the entities and relations in the first phase,and predicting the types of these entities and relations in the second phase.The two-phase paradigm enables our model to significantly reduce the data distribution gap,including the gap between negative entities and other entities,aswell as the gap between negative relations and other relations.In addition,we make the first attempt at combining entity type and entity distance as global features,which has proven effective,especially for the relation extraction.Experimental results on several datasets demonstrate that the span-based joint extraction model augmented with the two-phase paradigm and the global features consistently outperforms previous state-ofthe-art span-based models for the joint extraction task,establishing a new standard benchmark.Qualitative and quantitative analyses further validate the effectiveness the proposed paradigm and the global features.
文摘The China Conference on Knowledge Graph and Semantic Computing(CCKS)2020 Evaluation Task 3 presented clinical named entity recognition and event extraction for the Chinese electronic medical records.Two annotated data sets and some other additional resources for these two subtasks were provided for participators.This evaluation competition attracted 354 teams and 46 of them successfully submitted the valid results.The pre-trained language models are widely applied in this evaluation task.Data argumentation and external resources are also helpful.
基金the National Natural Science Foundation of China(Nos.61862063,61502413,and 61262025)the National Social Science Foundation of China(No.18BJL104)+2 种基金the Natural Science Foundation of Key Laboratory of Software Engineering of Yunnan Province,China(No.2020SE301)the Yunnan Science and Technology Major Project(Nos.202002AE090010 and 202002AD080002-5)the Data Driven Software Engineering Innovative Research Team Funding of Yunnan Province,China(No.2017HC012)。
文摘With the rapid development of Internet technology and the advent of the era of big data,more and more cyber security texts are provided on the Internet.These texts include not only security concepts,incidents,tools,guidelines,and policies,but also risk management approaches,best practices,assurances,technologies,and more.Through the integration of large-scale,heterogeneous,unstructured cyber security information,the identification and classification of cyber security entities can help handle cyber security issues.Due to the complexity and diversity of texts in the cyber security domain,it is difficult to identify security entities in the cyber security domain using the traditional named entity recognition(NER)methods.This paper describes various approaches and techniques for NER in this domain,including the rule-based approach,dictionary-based approach,and machine learning based approach,and discusses the problems faced by NER research in this domain,such as conjunction and disjunction,non-standardized naming convention,abbreviation,and massive nesting.Three future directions of NER in cyber security are proposed:(1)application of unsupervised or semi-supervised technology;(2)development of a more comprehensive cyber security ontology;(3)development of a more comprehensive deep learning model.
基金the Artificial Intelligence Innovation and Development Project of Shanghai Municipal Commission of Economy and Information (No. 2019-RGZN-01081)。
文摘Electronic medical record (EMR) containing rich biomedical information has a great potential in disease diagnosis and biomedical research. However, the EMR information is usually in the form of unstructured text, which increases the use cost and hinders its applications. In this work, an effective named entity recognition (NER) method is presented for information extraction on Chinese EMR, which is achieved by word embedding bootstrapped deep active learning to promote the acquisition of medical information from Chinese EMR and to release its value. In this work, deep active learning of bi-directional long short-term memory followed by conditional random field (Bi-LSTM+CRF) is used to capture the characteristics of different information from labeled corpus, and the word embedding models of contiguous bag of words and skip-gram are combined in the above model to respectively capture the text feature of Chinese EMR from unlabeled corpus. To evaluate the performance of above method, the tasks of NER on Chinese EMR with “medical history” content were used. Experimental results show that the word embedding bootstrapped deep active learning method using unlabeled medical corpus can achieve a better performance compared with other models.
文摘为获得结构化的小麦品种表型和遗传描述,针对非结构化小麦种质数据中存在的实体边界模糊以及关系重叠问题,提出一种基于深度字词融合的小麦种质信息实体关系联合抽取模型WGIE-DCWF(wheat germplasm information extraction model based on deep character and word fusion)。模型编码层通过深度字词融合和上下文语义特征融合,提高密集实体特征识别能力;模型三元组抽取层建立层叠指针网络,提高重叠关系的提取能力。在小麦种质数据集和公开数据集上的一系列对比实验结果表明,WGIE-DCWF模型能够有效提高小麦种质数据实体关系联合抽取效果,同时拥有较好的泛化性,可以为小麦种质信息知识库构建提供技术支撑。
文摘篇章关系抽取旨在识别篇章中实体对之间的关系.相较于传统的句子级别关系抽取,篇章级别关系抽取任务更加贴近实际应用,但是它对实体对的跨句子推理和上下文信息感知等问题提出了新的挑战.本文提出融合实体和上下文信息(Fuse entity and context information,FECI)的篇章关系抽取方法,它包含两个模块,分别是实体信息抽取模块和上下文信息抽取模块.实体信息抽取模块从两个实体中自动地抽取出能够表示实体对关系的特征.上下文信息抽取模块根据实体对的提及位置信息,从篇章中抽取不同的上下文关系特征.本文在三个篇章级别的关系抽取数据集上进行实验,效果得到显著提升.
文摘古汉语文本承载着丰富的历史和文化信息,对这类文本进行实体关系抽取研究并构建相关知识图谱对于文化传承具有重要作用.针对古汉语文本中存在大量生僻汉字、语义模糊和复义等问题,提出了一种基于BERT古文预训练模型的实体关系联合抽取模型(entity relation joint extraction model based on BERT-ancient-Chinese pretrained model,JEBAC).首先,通过融合BiLSTM神经网络和注意力机制的BERT古文预训练模型(BERT-ancientChinese pre-trained model integrated BiLSTM neural network and attention mechanism,BACBA),识别出句中所有的subject实体和object实体,为关系和object实体联合抽取提供依据.接下来,将subject实体的归一化编码向量与整个句子的嵌入向量相加,以更好地理解句中subject实体的语义特征;最后,结合带有subject实体特征的句子向量和object实体的提示信息,通过BACBA实现句中关系和object实体的联合抽取,从而得到句中所有的三元组信息(subject实体,关系,object实体).在中文实体关系抽取DuIE2.0数据集和CCKS 2021的文言文实体关系抽取CCLUE小样本数据集上,与现有的方法进行了性能比较.实验结果表明,该方法在抽取性能上更加有效,F1值分别可达79.2%和55.5%.
文摘地球科学的研究成果通常记录在技术报告、期刊论文、书籍等文献中,但许多详细的地球科学报告未被使用,这为信息提取提供了机遇。为此,我们提出了一种名为GMNER(Geological Minerals named entity recognize,MNER)的深度神经网络模型,用于识别和提取矿物类型、地质构造、岩石与地质时间等关键信息。与传统方法不同,本次采用了大规模预训练模型BERT(Bidirectional Encoder Representations from Transformers,BERT)和深度神经网络来捕捉上下文信息,并结合条件随机场(Conditional random field,CRF)以获得准确结果。实验结果表明,MNER模型在中文地质文献中表现出色,平均精确度为0.8984,平均召回率0.9227,平均F1分数0.9104。研究不仅为自动矿物信息提取提供了新途径,也有望促进矿产资源管理和可持续利用。