As important geological data,a geological report contains rich expert and geological knowledge,but the challenge facing current research into geological knowledge extraction and mining is how to render accurate unders...As important geological data,a geological report contains rich expert and geological knowledge,but the challenge facing current research into geological knowledge extraction and mining is how to render accurate understanding of geological reports guided by domain knowledge.While generic named entity recognition models/tools can be utilized for the processing of geoscience reports/documents,their effectiveness is hampered by a dearth of domain-specific knowledge,which in turn leads to a pronounced decline in recognition accuracy.This study summarizes six types of typical geological entities,with reference to the ontological system of geological domains and builds a high quality corpus for the task of geological named entity recognition(GNER).In addition,Geo Wo BERT-adv BGP(Geological Word-base BERTadversarial training Bi-directional Long Short-Term Memory Global Pointer)is proposed to address the issues of ambiguity,diversity and nested entities for the geological entities.The model first uses the fine-tuned word granularitybased pre-training model Geo Wo BERT(Geological Word-base BERT)and combines the text features that are extracted using the Bi LSTM(Bi-directional Long Short-Term Memory),followed by an adversarial training algorithm to improve the robustness of the model and enhance its resistance to interference,the decoding finally being performed using a global association pointer algorithm.The experimental results show that the proposed model for the constructed dataset achieves high performance and is capable of mining the rich geological information.展开更多
Named Entity Recognition(NER)stands as a fundamental task within the field of biomedical text mining,aiming to extract specific types of entities such as genes,proteins,and diseases from complex biomedical texts and c...Named Entity Recognition(NER)stands as a fundamental task within the field of biomedical text mining,aiming to extract specific types of entities such as genes,proteins,and diseases from complex biomedical texts and categorize them into predefined entity types.This process can provide basic support for the automatic construction of knowledge bases.In contrast to general texts,biomedical texts frequently contain numerous nested entities and local dependencies among these entities,presenting significant challenges to prevailing NER models.To address these issues,we propose a novel Chinese nested biomedical NER model based on RoBERTa and Global Pointer(RoBGP).Our model initially utilizes the RoBERTa-wwm-ext-large pretrained language model to dynamically generate word-level initial vectors.It then incorporates a Bidirectional Long Short-Term Memory network for capturing bidirectional semantic information,effectively addressing the issue of long-distance dependencies.Furthermore,the Global Pointer model is employed to comprehensively recognize all nested entities in the text.We conduct extensive experiments on the Chinese medical dataset CMeEE and the results demonstrate the superior performance of RoBGP over several baseline models.This research confirms the effectiveness of RoBGP in Chinese biomedical NER,providing reliable technical support for biomedical information extraction and knowledge base construction.展开更多
Named entity recognition(NER)is a fundamental task of information extraction(IE),and it has attracted considerable research attention in recent years.The abundant annotated English NER datasets have significantly prom...Named entity recognition(NER)is a fundamental task of information extraction(IE),and it has attracted considerable research attention in recent years.The abundant annotated English NER datasets have significantly promoted the NER research in the English field.By contrast,much fewer efforts are made to the Chinese NER research,especially in the scientific domain,due to the scarcity of Chinese NER datasets.To alleviate this problem,we present aChinese scientificNER dataset–SciCN,which contains entity annotations of titles and abstracts derived from 3,500 scientific papers.We manually annotate a total of 62,059 entities,and these entities are classified into six types.Compared to English scientific NER datasets,SciCN has a larger scale and is more diverse,for it not only contains more paper abstracts but these abstracts are derived from more research fields.To investigate the properties of SciCN and provide baselines for future research,we adapt a number of previous state-of-theart Chinese NER models to evaluate SciCN.Experimental results show that SciCN is more challenging than other Chinese NER datasets.In addition,previous studies have proven the effectiveness of using lexicons to enhance Chinese NER models.Motivated by this fact,we provide a scientific domain-specific lexicon.Validation results demonstrate that our lexicon delivers better performance gains than lexicons of other domains.We hope that the SciCN dataset and the lexicon will enable us to benchmark the NER task regarding the Chinese scientific domain and make progress for future research.The dataset and lexicon are available at:https://github.com/yangjingla/SciCN.git.展开更多
Chinese named entity recognition(CNER)has received widespread attention as an important task of Chinese information extraction.Most previous research has focused on individually studying flat CNER,overlapped CNER,or d...Chinese named entity recognition(CNER)has received widespread attention as an important task of Chinese information extraction.Most previous research has focused on individually studying flat CNER,overlapped CNER,or discontinuous CNER.However,a unified CNER is often needed in real-world scenarios.Recent studies have shown that grid tagging-based methods based on character-pair relationship classification hold great potential for achieving unified NER.Nevertheless,how to enrich Chinese character-pair grid representations and capture deeper dependencies between character pairs to improve entity recognition performance remains an unresolved challenge.In this study,we enhance the character-pair grid representation by incorporating both local and global information.Significantly,we introduce a new approach by considering the character-pair grid representation matrix as a specialized image,converting the classification of character-pair relationships into a pixel-level semantic segmentation task.We devise a U-shaped network to extract multi-scale and deeper semantic information from the grid image,allowing for a more comprehensive understanding of associative features between character pairs.This approach leads to improved accuracy in predicting their relationships,ultimately enhancing entity recognition performance.We conducted experiments on two public CNER datasets in the biomedical domain,namely CMeEE-V2 and Diakg.The results demonstrate the effectiveness of our approach,which achieves F1-score improvements of 7.29 percentage points and 1.64 percentage points compared to the current state-of-the-art(SOTA)models,respectively.展开更多
Named Entity Recognition(NER)is crucial for extracting structured information from text.While traditional methods rely on rules,Conditional Random Fields(CRFs),or deep learning,the advent of large-scale Pre-trained La...Named Entity Recognition(NER)is crucial for extracting structured information from text.While traditional methods rely on rules,Conditional Random Fields(CRFs),or deep learning,the advent of large-scale Pre-trained Language Models(PLMs)offers new possibilities.PLMs excel at contextual learning,potentially simplifying many natural language processing tasks.However,their application to NER remains underexplored.This paper investigates leveraging the GPT-3 PLM for NER without fine-tuning.We propose a novel scheme that utilizes carefully crafted templates and context examples selected based on semantic similarity.Our experimental results demonstrate the feasibility of this approach,suggesting a promising direction for harnessing PLMs in NER.展开更多
The power grid operation process is complex,and many operation process data involve national security,business secrets,and user privacy.Meanwhile,labeled datasets may exist in many different operation platforms,but th...The power grid operation process is complex,and many operation process data involve national security,business secrets,and user privacy.Meanwhile,labeled datasets may exist in many different operation platforms,but they cannot be directly shared since power grid data is highly privacysensitive.How to use these multi-source heterogeneous data as much as possible to build a power grid knowledge map under the premise of protecting privacy security has become an urgent problem in developing smart grid.Therefore,this paper proposes federated learning named entity recognition method for the power grid field,aiming to solve the problem of building a named entity recognition model covering the entire power grid process training by data with different security requirements.We decompose the named entity recognition(NER)model FLAT(Chinese NER Using Flat-Lattice Transformer)in each platform into a global part and a local part.The local part is used to capture the characteristics of the local data in each platform and is updated using locally labeled data.The global part is learned across different operation platforms to capture the shared NER knowledge.Its local gradients fromdifferent platforms are aggregated to update the global model,which is further delivered to each platform to update their global part.Experiments on two publicly available Chinese datasets and one power grid dataset validate the effectiveness of our method.展开更多
In recent years,cyber attacks have been intensifying and causing great harm to individuals,companies,and countries.The mining of cyber threat intelligence(CTI)can facilitate intelligence integration and serve well in ...In recent years,cyber attacks have been intensifying and causing great harm to individuals,companies,and countries.The mining of cyber threat intelligence(CTI)can facilitate intelligence integration and serve well in combating cyber attacks.Named Entity Recognition(NER),as a crucial component of text mining,can structure complex CTI text and aid cybersecurity professionals in effectively countering threats.However,current CTI NER research has mainly focused on studying English CTI.In the limited studies conducted on Chinese text,existing models have shown poor performance.To fully utilize the power of Chinese pre-trained language models(PLMs)and conquer the problem of lengthy infrequent English words mixing in the Chinese CTIs,we propose a residual dilated convolutional neural network(RDCNN)with a conditional random field(CRF)based on a robustly optimized bidirectional encoder representation from transformers pre-training approach with whole word masking(RoBERTa-wwm),abbreviated as RoBERTa-wwm-RDCNN-CRF.We are the first to experiment on the relevant open source dataset and achieve an F1-score of 82.35%,which exceeds the common baseline model bidirectional encoder representation from transformers(BERT)-bidirectional long short-term memory(BiLSTM)-CRF in this field by about 19.52%and exceeds the current state-of-the-art model,BERT-RDCNN-CRF,by about 3.53%.In addition,we conducted an ablation study on the encoder part of the model to verify the effectiveness of the proposed model and an in-depth investigation of the PLMs and encoder part of the model to verify the effectiveness of the proposed model.The RoBERTa-wwm-RDCNN-CRF model,the shared pre-processing,and augmentation methods can serve the subsequent fundamental tasks such as cybersecurity information extraction and knowledge graph construction,contributing to important applications in downstream tasks such as intrusion detection and advanced persistent threat(APT)attack detection.展开更多
With the rapid development of information technology,the electronifi-cation of medical records has gradually become a trend.In China,the population base is huge and the supporting medical institutions are numerous,so ...With the rapid development of information technology,the electronifi-cation of medical records has gradually become a trend.In China,the population base is huge and the supporting medical institutions are numerous,so this reality drives the conversion of paper medical records to electronic medical records.Electronic medical records are the basis for establishing a smart hospital and an important guarantee for achieving medical intelligence,and the massive amount of electronic medical record data is also an important data set for conducting research in the medical field.However,electronic medical records contain a large amount of private patient information,which must be desensitized before they are used as open resources.Therefore,to solve the above problems,data masking for Chinese electronic medical records with named entity recognition is proposed in this paper.Firstly,the text is vectorized to satisfy the required format of the model input.Secondly,since the input sentences may have a long or short length and the relationship between sentences in context is not negligible.To this end,a neural network model for named entity recognition based on bidirectional long short-term memory(BiLSTM)with conditional random fields(CRF)is constructed.Finally,the data masking operation is performed based on the named entity recog-nition results,mainly using regular expression filtering encryption and principal component analysis(PCA)word vector compression and replacement.In addi-tion,comparison experiments with the hidden markov model(HMM)model,LSTM-CRF model,and BiLSTM model are conducted in this paper.The experi-mental results show that the method used in this paper achieves 92.72%Accuracy,92.30%Recall,and 92.51%F1_score,which has higher accuracy compared with other models.展开更多
An obviously challenging problem in named entity recognition is the construction of the kind data set of entities.Although some research has been conducted on entity database construction,the majority of them are dire...An obviously challenging problem in named entity recognition is the construction of the kind data set of entities.Although some research has been conducted on entity database construction,the majority of them are directed at Wikipedia or the minority at structured entities such as people,locations and organizational nouns in the news.This paper focuses on the identification of scientific entities in carbonate platforms in English literature,using the example of carbonate platforms in sedimentology.Firstly,based on the fact that the reasons for writing literature in key disciplines are likely to be provided by multidisciplinary experts,this paper designs a literature content extraction method that allows dealing with complex text structures.Secondly,based on the literature extraction content,we formalize the entity extraction task(lexicon and lexical-based entity extraction)for entity extraction.Furthermore,for testing the accuracy of entity extraction,three currently popular recognition methods are chosen to perform entity detection in this paper.Experiments show that the entity data set provided by the lexicon and lexical-based entity extraction method is of significant assistance for the named entity recognition task.This study presents a pilot study of entity extraction,which involves the use of a complex structure and specialized literature on carbonate platforms in English.展开更多
Computational linguistics is an engineering-based scientific discipline.It deals with understanding written and spoken language from a computational viewpoint.Further,the domain also helps construct the artefacts that...Computational linguistics is an engineering-based scientific discipline.It deals with understanding written and spoken language from a computational viewpoint.Further,the domain also helps construct the artefacts that are useful in processing and producing a language either in bulk or in a dialogue setting.Named Entity Recognition(NER)is a fundamental task in the data extraction process.It concentrates on identifying and labelling the atomic components from several texts grouped under different entities,such as organizations,people,places,and times.Further,the NER mechanism identifies and removes more types of entities as per the requirements.The significance of the NER mechanism has been well-established in Natural Language Processing(NLP)tasks,and various research investigations have been conducted to develop novel NER methods.The conventional ways of managing the tasks range from rule-related and hand-crafted feature-related Machine Learning(ML)techniques to Deep Learning(DL)techniques.In this aspect,the current study introduces a novel Dart Games Optimizer with Hybrid Deep Learning-Driven Computational Linguistics(DGOHDL-CL)model for NER.The presented DGOHDL-CL technique aims to determine and label the atomic components from several texts as a collection of the named entities.In the presented DGOHDL-CL technique,the word embed-ding process is executed at the initial stage with the help of the word2vec model.For the NER mechanism,the Convolutional Gated Recurrent Unit(CGRU)model is employed in this work.At last,the DGO technique is used as a hyperparameter tuning strategy for the CGRU algorithm to boost the NER’s outcomes.No earlier studies integrated the DGO mechanism with the CGRU model for NER.To exhibit the superiority of the proposed DGOHDL-CL technique,a widespread simulation analysis was executed on two datasets,CoNLL-2003 and OntoNotes 5.0.The experimental outcomes establish the promising performance of the DGOHDL-CL technique over other models.展开更多
It is significant for agricultural intelligent knowledge services using knowledge graph technology to integrate multi-source heterogeneous crop and pest data and fully mine the knowledge hidden in the text.However,onl...It is significant for agricultural intelligent knowledge services using knowledge graph technology to integrate multi-source heterogeneous crop and pest data and fully mine the knowledge hidden in the text.However,only some labeled data for agricultural knowledge graph domain training are available.Furthermore,labeling is costly due to the need for more data openness and standardization.This paper proposes a novel model using knowledge distillation for a weakly supervised entity recognition in ontology construction.Knowledge distillation between the target and source data domain is performed,where Bi-LSTM and CRF models are constructed for entity recognition.The experimental result is shown that we only need to label less than one-tenth of the data for model training.Furthermore,the agricultural domain ontology is constructed by BILSTM-CRF named entity recognition model and relationship extraction model.Moreover,there are a total of 13,983 entities and 26,498 relationships built in the neo4j graph database.展开更多
Named Entity Recognition(NER)is one of the fundamental tasks in Natural Language Processing(NLP),which aims to locate,extract,and classify named entities into a predefined category such as person,organization and loca...Named Entity Recognition(NER)is one of the fundamental tasks in Natural Language Processing(NLP),which aims to locate,extract,and classify named entities into a predefined category such as person,organization and location.Most of the earlier research for identifying named entities relied on using handcrafted features and very large knowledge resources,which is time consuming and not adequate for resource-scarce languages such as Arabic.Recently,deep learning achieved state-of-the-art performance on many NLP tasks including NER without requiring hand-crafted features.In addition,transfer learning has also proven its efficiency in several NLP tasks by exploiting pretrained language models that are used to transfer knowledge learned from large-scale datasets to domain-specific tasks.Bidirectional Encoder Representation from Transformer(BERT)is a contextual language model that generates the semantic vectors dynamically according to the context of the words.BERT architecture relay on multi-head attention that allows it to capture global dependencies between words.In this paper,we propose a deep learning-based model by fine-tuning BERT model to recognize and classify Arabic named entities.The pre-trained BERT context embeddings were used as input features to a Bidirectional Gated Recurrent Unit(BGRU)and were fine-tuned using two annotated Arabic Named Entity Recognition(ANER)datasets.Experimental results demonstrate that the proposed model outperformed state-of-the-art ANER models achieving 92.28%and 90.68%F-measure values on the ANERCorp dataset and the merged ANERCorp and AQMAR dataset,respectively.展开更多
Owing to the continuous barrage of cyber threats,there is a massive amount of cyber threat intelligence.However,a great deal of cyber threat intelligence come from textual sources.For analysis of cyber threat intellig...Owing to the continuous barrage of cyber threats,there is a massive amount of cyber threat intelligence.However,a great deal of cyber threat intelligence come from textual sources.For analysis of cyber threat intelligence,many security analysts rely on cumbersome and time-consuming manual efforts.Cybersecurity knowledge graph plays a significant role in automatics analysis of cyber threat intelligence.As the foundation for constructing cybersecurity knowledge graph,named entity recognition(NER)is required for identifying critical threat-related elements from textual cyber threat intelligence.Recently,deep neural network-based models have attained very good results in NER.However,the performance of these models relies heavily on the amount of labeled data.Since labeled data in cybersecurity is scarce,in this paper,we propose an adversarial active learning framework to effectively select the informative samples for further annotation.In addition,leveraging the long short-term memory(LSTM)network and the bidirectional LSTM(BiLSTM)network,we propose a novel NER model by introducing a dynamic attention mechanism into the BiLSTM-LSTM encoderdecoder.With the selected informative samples annotated,the proposed NER model is retrained.As a result,the performance of the NER model is incrementally enhanced with low labeling cost.Experimental results show the effectiveness of the proposed method.展开更多
With the application of artificial intelligence technology in the power industry,the knowledge graph is expected to play a key role in power grid dispatch processes,intelligent maintenance,and customer service respons...With the application of artificial intelligence technology in the power industry,the knowledge graph is expected to play a key role in power grid dispatch processes,intelligent maintenance,and customer service response provision.Knowledge graphs are usually constructed based on entity recognition.Specifically,based on the mining of entity attributes and relationships,domain knowledge graphs can be constructed through knowledge fusion.In this work,the entities and characteristics of power entity recognition are analyzed,the mechanism of entity recognition is clarified,and entity recognition techniques are analyzed in the context of the power domain.Power entity recognition based on the conditional random fields (CRF) and bidirectional long short-term memory (BLSTM) models is investigated,and the two methods are comparatively analyzed.The results indicated that the CRF model,with an accuracy of 83%,can better identify the power entities compared to the BLSTM.The CRF approach can thus be applied to the entity extraction for knowledge graph construction in the power field.展开更多
Named entity recognition is a fundamental task in biomedical data mining. In this letter, a named entity recognition system based on CRFs (Conditional Random Fields) for biomedical texts is presented. The system mak...Named entity recognition is a fundamental task in biomedical data mining. In this letter, a named entity recognition system based on CRFs (Conditional Random Fields) for biomedical texts is presented. The system makes extensive use of a diverse set of features, including local features, full text features and external resource features. All features incorporated in this system are described in detail, and the impacts of different feature sets on the performance of the system are evaluated. In order to improve the performance of system, post-processing modules are exploited to deal with the abbreviation phenomena, cascaded named entity and boundary errors identification. Evaluation on this system proved that the feature selection has important impact on the system performance, and the post-processing explored has an important contribution on system performance to achieve better resuits.展开更多
In the era of big data,E-commerce plays an increasingly important role,and steel E-commerce certainly occupies a positive position.However,it is very difficult to choose satisfactory steel raw materials from diverse s...In the era of big data,E-commerce plays an increasingly important role,and steel E-commerce certainly occupies a positive position.However,it is very difficult to choose satisfactory steel raw materials from diverse steel commodities online on steel E-commerce platforms in the purchase of staffs.In order to improve the efficiency of purchasers searching for commodities on the steel E-commerce platforms,we propose a novel deep learning-based loss function for named entity recognition(NER).Considering the impacts of small sample and imbalanced data,in our NER scheme,the focal loss,the label smoothing,and the cross entropy are incorporated into a lite bidirectional encoder representations from transformers(BERT)model to avoid the over-fitting.Moreover,through the analysis of different classic annotation techniques used to tag data,an ideal one is chosen for the training model in our proposed scheme.Experiments are conducted on Chinese steel E-commerce datasets.The experimental results show that the training time of a lite BERT(ALBERT)-based method is much shorter than that of BERT-based models,while achieving the similar computational performance in terms of metrics precision,recall,and F1 with BERT-based models.Meanwhile,our proposed approach performs much better than that of combining Word2Vec,bidirectional long short-term memory(Bi-LSTM),and conditional random field(CRF)models,in consideration of training time and F1.展开更多
Traditional named entity recognition methods need professional domain knowl-edge and a large amount of human participation to extract features,as well as the Chinese named entity recognition method based on a neural n...Traditional named entity recognition methods need professional domain knowl-edge and a large amount of human participation to extract features,as well as the Chinese named entity recognition method based on a neural network model,which brings the prob-lem that vector representation is too singular in the process of character vector representa-tion.To solve the above problem,we propose a Chinese named entity recognition method based on the BERT-BiLSTM-ATT-CRF model.Firstly,we use the bidirectional encoder representations from transformers(BERT)pre-training language model to obtain the se-mantic vector of the word according to the context information of the word;Secondly,the word vectors trained by BERT are input into the bidirectional long-term and short-term memory network embedded with attention mechanism(BiLSTM-ATT)to capture the most important semantic information in the sentence;Finally,the conditional random field(CRF)is used to learn the dependence between adjacent tags to obtain the global optimal sentence level tag sequence.The experimental results show that the proposed model achieves state-of-the-art performance on both Microsoft Research Asia(MSRA)corpus and people’s daily corpus,with F1 values of 94.77% and 95.97% respectively.展开更多
Purpose: The purpose of the study is to explore the potential use of nature language process(NLP) and machine learning(ML) techniques and intents to find a feasible strategy and effective approach to fulfill the NER t...Purpose: The purpose of the study is to explore the potential use of nature language process(NLP) and machine learning(ML) techniques and intents to find a feasible strategy and effective approach to fulfill the NER task for Web oriented person-specific information extraction.Design/methodology/approach: An SVM-based multi-classification approach combined with a set of rich NLP features derived from state-of-the-art NLP techniques has been proposed to fulfill the NER task. A group of experiments has been designed to investigate the influence of various NLP-based features to the performance of the system,especially the semantic features. Optimal parameter settings regarding with SVM models,including kernel functions,margin parameter of SVM model and the context window size,have been explored through experiments as well.Findings: The SVM-based multi-classification approach has been proved to be effective for the NER task. This work shows that NLP-based features are of great importance in datadriven NE recognition,particularly the semantic features. The study indicates that higher order kernel function may not be desirable for the specific classification problem in practical application. The simple linear-kernel SVM model performed better in this case. Moreover,the modified SVM models with uneven margin parameter are more common and flexible,which have been proved to solve the imbalanced data problem better.Research limitations/implications: The SVM-based approach for NER problem is only proved to be effective on limited experiment data. Further research need to be conducted on the large batch of real Web data. In addition,the performance of the NER system need be tested when incorporated into a complete IE framework.Originality/value: The specially designed experiments make it feasible to fully explore the characters of the data and obtain the optimal parameter settings for the NER task,leading to a preferable rate in recall,precision and F1measures. The overall system performance(F1value) for all types of name entities can achieve above 88.6%,which can meet the requirements for the practical application.展开更多
Named entity recognition,as a sub-task of information extraction,has attracted widespread attention from scholars at home and abroad since it was proposed,and a series of studies and discussions have been carried out ...Named entity recognition,as a sub-task of information extraction,has attracted widespread attention from scholars at home and abroad since it was proposed,and a series of studies and discussions have been carried out based on it.This paper discusses the existing named entity recognition technology based on its history of development.展开更多
The research on named entity recognition for label-few domain is becoming increasingly important.In this paper,a novel algorithm,positive unlabeled named entity recognition(PUNER)with multi-granularity language inform...The research on named entity recognition for label-few domain is becoming increasingly important.In this paper,a novel algorithm,positive unlabeled named entity recognition(PUNER)with multi-granularity language information,is proposed,which combines positive unlabeled(PU)learning and deep learning to obtain the multi-granularity language information from a few labeled in-stances and many unlabeled instances to recognize named entities.First,PUNER selects reliable negative instances from unlabeled datasets,uses positive instances and a corresponding number of negative instances to train the PU learning classifier,and iterates continuously to label all unlabeled instances.Second,a neural network-based architecture to implement the PU learning classifier is used,and comprehensive text semantics through multi-granular language information are obtained,which helps the classifier correctly recognize named entities.Performance tests of the PUNER are carried out on three multilingual NER datasets,which are CoNLL2003,CoNLL 2002 and SIGHAN Bakeoff 2006.Experimental results demonstrate the effectiveness of the proposed PUNER.展开更多
基金financially supported by the Natural Science Foundation of China(Grant No.42301492)the National Key R&D Program of China(Grant Nos.2022YFF0711600,2022YFF0801201,2022YFF0801200)+3 种基金the Major Special Project of Xinjiang(Grant No.2022A03009-3)the Open Fund of Key Laboratory of Urban Land Resources Monitoring and Simulation,Ministry of Natural Resources(Grant No.KF-2022-07014)the Opening Fund of the Key Laboratory of the Geological Survey and Evaluation of the Ministry of Education(Grant No.GLAB 2023ZR01)the Fundamental Research Funds for the Central Universities。
文摘As important geological data,a geological report contains rich expert and geological knowledge,but the challenge facing current research into geological knowledge extraction and mining is how to render accurate understanding of geological reports guided by domain knowledge.While generic named entity recognition models/tools can be utilized for the processing of geoscience reports/documents,their effectiveness is hampered by a dearth of domain-specific knowledge,which in turn leads to a pronounced decline in recognition accuracy.This study summarizes six types of typical geological entities,with reference to the ontological system of geological domains and builds a high quality corpus for the task of geological named entity recognition(GNER).In addition,Geo Wo BERT-adv BGP(Geological Word-base BERTadversarial training Bi-directional Long Short-Term Memory Global Pointer)is proposed to address the issues of ambiguity,diversity and nested entities for the geological entities.The model first uses the fine-tuned word granularitybased pre-training model Geo Wo BERT(Geological Word-base BERT)and combines the text features that are extracted using the Bi LSTM(Bi-directional Long Short-Term Memory),followed by an adversarial training algorithm to improve the robustness of the model and enhance its resistance to interference,the decoding finally being performed using a global association pointer algorithm.The experimental results show that the proposed model for the constructed dataset achieves high performance and is capable of mining the rich geological information.
基金supported by the Outstanding Youth Team Project of Central Universities(QNTD202308)the Ant Group through CCF-Ant Research Fund(CCF-AFSG 769498 RF20220214).
文摘Named Entity Recognition(NER)stands as a fundamental task within the field of biomedical text mining,aiming to extract specific types of entities such as genes,proteins,and diseases from complex biomedical texts and categorize them into predefined entity types.This process can provide basic support for the automatic construction of knowledge bases.In contrast to general texts,biomedical texts frequently contain numerous nested entities and local dependencies among these entities,presenting significant challenges to prevailing NER models.To address these issues,we propose a novel Chinese nested biomedical NER model based on RoBERTa and Global Pointer(RoBGP).Our model initially utilizes the RoBERTa-wwm-ext-large pretrained language model to dynamically generate word-level initial vectors.It then incorporates a Bidirectional Long Short-Term Memory network for capturing bidirectional semantic information,effectively addressing the issue of long-distance dependencies.Furthermore,the Global Pointer model is employed to comprehensively recognize all nested entities in the text.We conduct extensive experiments on the Chinese medical dataset CMeEE and the results demonstrate the superior performance of RoBGP over several baseline models.This research confirms the effectiveness of RoBGP in Chinese biomedical NER,providing reliable technical support for biomedical information extraction and knowledge base construction.
基金This research was supported by the National Key Research and Development Program[2020YFB1006302].
文摘Named entity recognition(NER)is a fundamental task of information extraction(IE),and it has attracted considerable research attention in recent years.The abundant annotated English NER datasets have significantly promoted the NER research in the English field.By contrast,much fewer efforts are made to the Chinese NER research,especially in the scientific domain,due to the scarcity of Chinese NER datasets.To alleviate this problem,we present aChinese scientificNER dataset–SciCN,which contains entity annotations of titles and abstracts derived from 3,500 scientific papers.We manually annotate a total of 62,059 entities,and these entities are classified into six types.Compared to English scientific NER datasets,SciCN has a larger scale and is more diverse,for it not only contains more paper abstracts but these abstracts are derived from more research fields.To investigate the properties of SciCN and provide baselines for future research,we adapt a number of previous state-of-theart Chinese NER models to evaluate SciCN.Experimental results show that SciCN is more challenging than other Chinese NER datasets.In addition,previous studies have proven the effectiveness of using lexicons to enhance Chinese NER models.Motivated by this fact,we provide a scientific domain-specific lexicon.Validation results demonstrate that our lexicon delivers better performance gains than lexicons of other domains.We hope that the SciCN dataset and the lexicon will enable us to benchmark the NER task regarding the Chinese scientific domain and make progress for future research.The dataset and lexicon are available at:https://github.com/yangjingla/SciCN.git.
基金supported by Yunnan Provincial Major Science and Technology Special Plan Projects(Grant Nos.202202AD080003,202202AE090008,202202AD080004,202302AD080003)National Natural Science Foundation of China(Grant Nos.U21B2027,62266027,62266028,62266025)Yunnan Province Young and Middle-Aged Academic and Technical Leaders Reserve Talent Program(Grant No.202305AC160063).
文摘Chinese named entity recognition(CNER)has received widespread attention as an important task of Chinese information extraction.Most previous research has focused on individually studying flat CNER,overlapped CNER,or discontinuous CNER.However,a unified CNER is often needed in real-world scenarios.Recent studies have shown that grid tagging-based methods based on character-pair relationship classification hold great potential for achieving unified NER.Nevertheless,how to enrich Chinese character-pair grid representations and capture deeper dependencies between character pairs to improve entity recognition performance remains an unresolved challenge.In this study,we enhance the character-pair grid representation by incorporating both local and global information.Significantly,we introduce a new approach by considering the character-pair grid representation matrix as a specialized image,converting the classification of character-pair relationships into a pixel-level semantic segmentation task.We devise a U-shaped network to extract multi-scale and deeper semantic information from the grid image,allowing for a more comprehensive understanding of associative features between character pairs.This approach leads to improved accuracy in predicting their relationships,ultimately enhancing entity recognition performance.We conducted experiments on two public CNER datasets in the biomedical domain,namely CMeEE-V2 and Diakg.The results demonstrate the effectiveness of our approach,which achieves F1-score improvements of 7.29 percentage points and 1.64 percentage points compared to the current state-of-the-art(SOTA)models,respectively.
文摘Named Entity Recognition(NER)is crucial for extracting structured information from text.While traditional methods rely on rules,Conditional Random Fields(CRFs),or deep learning,the advent of large-scale Pre-trained Language Models(PLMs)offers new possibilities.PLMs excel at contextual learning,potentially simplifying many natural language processing tasks.However,their application to NER remains underexplored.This paper investigates leveraging the GPT-3 PLM for NER without fine-tuning.We propose a novel scheme that utilizes carefully crafted templates and context examples selected based on semantic similarity.Our experimental results demonstrate the feasibility of this approach,suggesting a promising direction for harnessing PLMs in NER.
基金Thisworkwas supported by State Grid Science and TechnologyResearch Program(SGSCJY00NYJS2200026).
文摘The power grid operation process is complex,and many operation process data involve national security,business secrets,and user privacy.Meanwhile,labeled datasets may exist in many different operation platforms,but they cannot be directly shared since power grid data is highly privacysensitive.How to use these multi-source heterogeneous data as much as possible to build a power grid knowledge map under the premise of protecting privacy security has become an urgent problem in developing smart grid.Therefore,this paper proposes federated learning named entity recognition method for the power grid field,aiming to solve the problem of building a named entity recognition model covering the entire power grid process training by data with different security requirements.We decompose the named entity recognition(NER)model FLAT(Chinese NER Using Flat-Lattice Transformer)in each platform into a global part and a local part.The local part is used to capture the characteristics of the local data in each platform and is updated using locally labeled data.The global part is learned across different operation platforms to capture the shared NER knowledge.Its local gradients fromdifferent platforms are aggregated to update the global model,which is further delivered to each platform to update their global part.Experiments on two publicly available Chinese datasets and one power grid dataset validate the effectiveness of our method.
基金funded by the Double Top-Class Innovation Research Project in Cyberspace Security Enforcement Technology of People’s Public Security University of China(No.2023SYL07).
文摘In recent years,cyber attacks have been intensifying and causing great harm to individuals,companies,and countries.The mining of cyber threat intelligence(CTI)can facilitate intelligence integration and serve well in combating cyber attacks.Named Entity Recognition(NER),as a crucial component of text mining,can structure complex CTI text and aid cybersecurity professionals in effectively countering threats.However,current CTI NER research has mainly focused on studying English CTI.In the limited studies conducted on Chinese text,existing models have shown poor performance.To fully utilize the power of Chinese pre-trained language models(PLMs)and conquer the problem of lengthy infrequent English words mixing in the Chinese CTIs,we propose a residual dilated convolutional neural network(RDCNN)with a conditional random field(CRF)based on a robustly optimized bidirectional encoder representation from transformers pre-training approach with whole word masking(RoBERTa-wwm),abbreviated as RoBERTa-wwm-RDCNN-CRF.We are the first to experiment on the relevant open source dataset and achieve an F1-score of 82.35%,which exceeds the common baseline model bidirectional encoder representation from transformers(BERT)-bidirectional long short-term memory(BiLSTM)-CRF in this field by about 19.52%and exceeds the current state-of-the-art model,BERT-RDCNN-CRF,by about 3.53%.In addition,we conducted an ablation study on the encoder part of the model to verify the effectiveness of the proposed model and an in-depth investigation of the PLMs and encoder part of the model to verify the effectiveness of the proposed model.The RoBERTa-wwm-RDCNN-CRF model,the shared pre-processing,and augmentation methods can serve the subsequent fundamental tasks such as cybersecurity information extraction and knowledge graph construction,contributing to important applications in downstream tasks such as intrusion detection and advanced persistent threat(APT)attack detection.
基金This research was supported by the National Natural Science Foundation of China under Grant(No.42050102)the Postgraduate Education Reform Project of Jiangsu Province under Grant(No.SJCX22_0343)Also,this research was supported by Dou Wanchun Expert Workstation of Yunnan Province(No.202205AF150013).
文摘With the rapid development of information technology,the electronifi-cation of medical records has gradually become a trend.In China,the population base is huge and the supporting medical institutions are numerous,so this reality drives the conversion of paper medical records to electronic medical records.Electronic medical records are the basis for establishing a smart hospital and an important guarantee for achieving medical intelligence,and the massive amount of electronic medical record data is also an important data set for conducting research in the medical field.However,electronic medical records contain a large amount of private patient information,which must be desensitized before they are used as open resources.Therefore,to solve the above problems,data masking for Chinese electronic medical records with named entity recognition is proposed in this paper.Firstly,the text is vectorized to satisfy the required format of the model input.Secondly,since the input sentences may have a long or short length and the relationship between sentences in context is not negligible.To this end,a neural network model for named entity recognition based on bidirectional long short-term memory(BiLSTM)with conditional random fields(CRF)is constructed.Finally,the data masking operation is performed based on the named entity recog-nition results,mainly using regular expression filtering encryption and principal component analysis(PCA)word vector compression and replacement.In addi-tion,comparison experiments with the hidden markov model(HMM)model,LSTM-CRF model,and BiLSTM model are conducted in this paper.The experi-mental results show that the method used in this paper achieves 92.72%Accuracy,92.30%Recall,and 92.51%F1_score,which has higher accuracy compared with other models.
基金supported by the National Natural Science Foundation of China under Grant No.42050102the National Science Foundation of China(Grant No.62001236)the Natural Science Foundation of the Jiangsu Higher Education Institutions of China(Grant No.20KJA520003).
文摘An obviously challenging problem in named entity recognition is the construction of the kind data set of entities.Although some research has been conducted on entity database construction,the majority of them are directed at Wikipedia or the minority at structured entities such as people,locations and organizational nouns in the news.This paper focuses on the identification of scientific entities in carbonate platforms in English literature,using the example of carbonate platforms in sedimentology.Firstly,based on the fact that the reasons for writing literature in key disciplines are likely to be provided by multidisciplinary experts,this paper designs a literature content extraction method that allows dealing with complex text structures.Secondly,based on the literature extraction content,we formalize the entity extraction task(lexicon and lexical-based entity extraction)for entity extraction.Furthermore,for testing the accuracy of entity extraction,three currently popular recognition methods are chosen to perform entity detection in this paper.Experiments show that the entity data set provided by the lexicon and lexical-based entity extraction method is of significant assistance for the named entity recognition task.This study presents a pilot study of entity extraction,which involves the use of a complex structure and specialized literature on carbonate platforms in English.
基金Princess Nourah Bint Abdulrahman University Researchers Supporting Project Number(PNURSP2022R281)Princess Nourah Bint Abdulrahman University,Riyadh,Saudi Arabia.The authors would like to thank the Deanship of Scientific Research at Umm Al-Qura University for supporting this work by Grant Code:(22UQU4331004DSR10).
文摘Computational linguistics is an engineering-based scientific discipline.It deals with understanding written and spoken language from a computational viewpoint.Further,the domain also helps construct the artefacts that are useful in processing and producing a language either in bulk or in a dialogue setting.Named Entity Recognition(NER)is a fundamental task in the data extraction process.It concentrates on identifying and labelling the atomic components from several texts grouped under different entities,such as organizations,people,places,and times.Further,the NER mechanism identifies and removes more types of entities as per the requirements.The significance of the NER mechanism has been well-established in Natural Language Processing(NLP)tasks,and various research investigations have been conducted to develop novel NER methods.The conventional ways of managing the tasks range from rule-related and hand-crafted feature-related Machine Learning(ML)techniques to Deep Learning(DL)techniques.In this aspect,the current study introduces a novel Dart Games Optimizer with Hybrid Deep Learning-Driven Computational Linguistics(DGOHDL-CL)model for NER.The presented DGOHDL-CL technique aims to determine and label the atomic components from several texts as a collection of the named entities.In the presented DGOHDL-CL technique,the word embed-ding process is executed at the initial stage with the help of the word2vec model.For the NER mechanism,the Convolutional Gated Recurrent Unit(CGRU)model is employed in this work.At last,the DGO technique is used as a hyperparameter tuning strategy for the CGRU algorithm to boost the NER’s outcomes.No earlier studies integrated the DGO mechanism with the CGRU model for NER.To exhibit the superiority of the proposed DGOHDL-CL technique,a widespread simulation analysis was executed on two datasets,CoNLL-2003 and OntoNotes 5.0.The experimental outcomes establish the promising performance of the DGOHDL-CL technique over other models.
基金supported by Heilongjiang NSF funding,No.LH202F022Heilongjiang research and application of key technologies,No.2021ZXJ05A03New generation artificial intelligent program,No.21ZD0110900 in CHINA.
文摘It is significant for agricultural intelligent knowledge services using knowledge graph technology to integrate multi-source heterogeneous crop and pest data and fully mine the knowledge hidden in the text.However,only some labeled data for agricultural knowledge graph domain training are available.Furthermore,labeling is costly due to the need for more data openness and standardization.This paper proposes a novel model using knowledge distillation for a weakly supervised entity recognition in ontology construction.Knowledge distillation between the target and source data domain is performed,where Bi-LSTM and CRF models are constructed for entity recognition.The experimental result is shown that we only need to label less than one-tenth of the data for model training.Furthermore,the agricultural domain ontology is constructed by BILSTM-CRF named entity recognition model and relationship extraction model.Moreover,there are a total of 13,983 entities and 26,498 relationships built in the neo4j graph database.
基金funded by the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University through the Graduate Students Research Support Program.
文摘Named Entity Recognition(NER)is one of the fundamental tasks in Natural Language Processing(NLP),which aims to locate,extract,and classify named entities into a predefined category such as person,organization and location.Most of the earlier research for identifying named entities relied on using handcrafted features and very large knowledge resources,which is time consuming and not adequate for resource-scarce languages such as Arabic.Recently,deep learning achieved state-of-the-art performance on many NLP tasks including NER without requiring hand-crafted features.In addition,transfer learning has also proven its efficiency in several NLP tasks by exploiting pretrained language models that are used to transfer knowledge learned from large-scale datasets to domain-specific tasks.Bidirectional Encoder Representation from Transformer(BERT)is a contextual language model that generates the semantic vectors dynamically according to the context of the words.BERT architecture relay on multi-head attention that allows it to capture global dependencies between words.In this paper,we propose a deep learning-based model by fine-tuning BERT model to recognize and classify Arabic named entities.The pre-trained BERT context embeddings were used as input features to a Bidirectional Gated Recurrent Unit(BGRU)and were fine-tuned using two annotated Arabic Named Entity Recognition(ANER)datasets.Experimental results demonstrate that the proposed model outperformed state-of-the-art ANER models achieving 92.28%and 90.68%F-measure values on the ANERCorp dataset and the merged ANERCorp and AQMAR dataset,respectively.
基金the National Natural Science Foundation of China undergrant 61501515.
文摘Owing to the continuous barrage of cyber threats,there is a massive amount of cyber threat intelligence.However,a great deal of cyber threat intelligence come from textual sources.For analysis of cyber threat intelligence,many security analysts rely on cumbersome and time-consuming manual efforts.Cybersecurity knowledge graph plays a significant role in automatics analysis of cyber threat intelligence.As the foundation for constructing cybersecurity knowledge graph,named entity recognition(NER)is required for identifying critical threat-related elements from textual cyber threat intelligence.Recently,deep neural network-based models have attained very good results in NER.However,the performance of these models relies heavily on the amount of labeled data.Since labeled data in cybersecurity is scarce,in this paper,we propose an adversarial active learning framework to effectively select the informative samples for further annotation.In addition,leveraging the long short-term memory(LSTM)network and the bidirectional LSTM(BiLSTM)network,we propose a novel NER model by introducing a dynamic attention mechanism into the BiLSTM-LSTM encoderdecoder.With the selected informative samples annotated,the proposed NER model is retrained.As a result,the performance of the NER model is incrementally enhanced with low labeling cost.Experimental results show the effectiveness of the proposed method.
基金supported by Science and Technology Project of State Grid Corporation(Research and Application of Intelligent Energy Meter Quality Analysis and Evaluation Technology Based on Full Chain Data)
文摘With the application of artificial intelligence technology in the power industry,the knowledge graph is expected to play a key role in power grid dispatch processes,intelligent maintenance,and customer service response provision.Knowledge graphs are usually constructed based on entity recognition.Specifically,based on the mining of entity attributes and relationships,domain knowledge graphs can be constructed through knowledge fusion.In this work,the entities and characteristics of power entity recognition are analyzed,the mechanism of entity recognition is clarified,and entity recognition techniques are analyzed in the context of the power domain.Power entity recognition based on the conditional random fields (CRF) and bidirectional long short-term memory (BLSTM) models is investigated,and the two methods are comparatively analyzed.The results indicated that the CRF model,with an accuracy of 83%,can better identify the power entities compared to the BLSTM.The CRF approach can thus be applied to the entity extraction for knowledge graph construction in the power field.
基金Supported by The National Natural Science Foundation of China(No.60302021).
文摘Named entity recognition is a fundamental task in biomedical data mining. In this letter, a named entity recognition system based on CRFs (Conditional Random Fields) for biomedical texts is presented. The system makes extensive use of a diverse set of features, including local features, full text features and external resource features. All features incorporated in this system are described in detail, and the impacts of different feature sets on the performance of the system are evaluated. In order to improve the performance of system, post-processing modules are exploited to deal with the abbreviation phenomena, cascaded named entity and boundary errors identification. Evaluation on this system proved that the feature selection has important impact on the system performance, and the post-processing explored has an important contribution on system performance to achieve better resuits.
基金This work was supported in part by the National Natural Science Foundation of China under Grants U1836106 and 81961138010in part by the Beijing Natural Science Foundation under Grants M21032 and 19L2029+2 种基金in part by the Beijing Intelligent Logistics System Collaborative Innovation Center under Grant BILSCIC-2019KF-08in part by the Scientific and Technological Innovation Foundation of Shunde Graduate School,USTB,under Grants BK20BF010 and BK19BF006in part by the Fundamental Research Funds for the University of Science and Technology Beijing under Grant FRF-BD-19-012A.
文摘In the era of big data,E-commerce plays an increasingly important role,and steel E-commerce certainly occupies a positive position.However,it is very difficult to choose satisfactory steel raw materials from diverse steel commodities online on steel E-commerce platforms in the purchase of staffs.In order to improve the efficiency of purchasers searching for commodities on the steel E-commerce platforms,we propose a novel deep learning-based loss function for named entity recognition(NER).Considering the impacts of small sample and imbalanced data,in our NER scheme,the focal loss,the label smoothing,and the cross entropy are incorporated into a lite bidirectional encoder representations from transformers(BERT)model to avoid the over-fitting.Moreover,through the analysis of different classic annotation techniques used to tag data,an ideal one is chosen for the training model in our proposed scheme.Experiments are conducted on Chinese steel E-commerce datasets.The experimental results show that the training time of a lite BERT(ALBERT)-based method is much shorter than that of BERT-based models,while achieving the similar computational performance in terms of metrics precision,recall,and F1 with BERT-based models.Meanwhile,our proposed approach performs much better than that of combining Word2Vec,bidirectional long short-term memory(Bi-LSTM),and conditional random field(CRF)models,in consideration of training time and F1.
文摘Traditional named entity recognition methods need professional domain knowl-edge and a large amount of human participation to extract features,as well as the Chinese named entity recognition method based on a neural network model,which brings the prob-lem that vector representation is too singular in the process of character vector representa-tion.To solve the above problem,we propose a Chinese named entity recognition method based on the BERT-BiLSTM-ATT-CRF model.Firstly,we use the bidirectional encoder representations from transformers(BERT)pre-training language model to obtain the se-mantic vector of the word according to the context information of the word;Secondly,the word vectors trained by BERT are input into the bidirectional long-term and short-term memory network embedded with attention mechanism(BiLSTM-ATT)to capture the most important semantic information in the sentence;Finally,the conditional random field(CRF)is used to learn the dependence between adjacent tags to obtain the global optimal sentence level tag sequence.The experimental results show that the proposed model achieves state-of-the-art performance on both Microsoft Research Asia(MSRA)corpus and people’s daily corpus,with F1 values of 94.77% and 95.97% respectively.
基金support by the Special Research Fundation for Young Teachers of Sun Yat-sen University(Grant No.2000-3161101)Humanity and Social Science Youth Foundation of Ministry of Educationof China(Grant No.08JC870013)
文摘Purpose: The purpose of the study is to explore the potential use of nature language process(NLP) and machine learning(ML) techniques and intents to find a feasible strategy and effective approach to fulfill the NER task for Web oriented person-specific information extraction.Design/methodology/approach: An SVM-based multi-classification approach combined with a set of rich NLP features derived from state-of-the-art NLP techniques has been proposed to fulfill the NER task. A group of experiments has been designed to investigate the influence of various NLP-based features to the performance of the system,especially the semantic features. Optimal parameter settings regarding with SVM models,including kernel functions,margin parameter of SVM model and the context window size,have been explored through experiments as well.Findings: The SVM-based multi-classification approach has been proved to be effective for the NER task. This work shows that NLP-based features are of great importance in datadriven NE recognition,particularly the semantic features. The study indicates that higher order kernel function may not be desirable for the specific classification problem in practical application. The simple linear-kernel SVM model performed better in this case. Moreover,the modified SVM models with uneven margin parameter are more common and flexible,which have been proved to solve the imbalanced data problem better.Research limitations/implications: The SVM-based approach for NER problem is only proved to be effective on limited experiment data. Further research need to be conducted on the large batch of real Web data. In addition,the performance of the NER system need be tested when incorporated into a complete IE framework.Originality/value: The specially designed experiments make it feasible to fully explore the characters of the data and obtain the optimal parameter settings for the NER task,leading to a preferable rate in recall,precision and F1measures. The overall system performance(F1value) for all types of name entities can achieve above 88.6%,which can meet the requirements for the practical application.
文摘Named entity recognition,as a sub-task of information extraction,has attracted widespread attention from scholars at home and abroad since it was proposed,and a series of studies and discussions have been carried out based on it.This paper discusses the existing named entity recognition technology based on its history of development.
基金the National Natural Science Foundation of China(No.61876144)the Strategy Priority Research Program of Chinese Acade-my of Sciences(No.XDC02070600).
文摘The research on named entity recognition for label-few domain is becoming increasingly important.In this paper,a novel algorithm,positive unlabeled named entity recognition(PUNER)with multi-granularity language information,is proposed,which combines positive unlabeled(PU)learning and deep learning to obtain the multi-granularity language information from a few labeled in-stances and many unlabeled instances to recognize named entities.First,PUNER selects reliable negative instances from unlabeled datasets,uses positive instances and a corresponding number of negative instances to train the PU learning classifier,and iterates continuously to label all unlabeled instances.Second,a neural network-based architecture to implement the PU learning classifier is used,and comprehensive text semantics through multi-granular language information are obtained,which helps the classifier correctly recognize named entities.Performance tests of the PUNER are carried out on three multilingual NER datasets,which are CoNLL2003,CoNLL 2002 and SIGHAN Bakeoff 2006.Experimental results demonstrate the effectiveness of the proposed PUNER.