Unlike named entity recognition(NER)for English,the absence of word boundaries reduces the final accuracy for Chinese NER.To avoid accumulated error introduced by word segmentation,a deep model extracting character-le...Unlike named entity recognition(NER)for English,the absence of word boundaries reduces the final accuracy for Chinese NER.To avoid accumulated error introduced by word segmentation,a deep model extracting character-level features is carefully built and becomes a basis for a new Chinese NER method,which is proposed in this paper.This method converts the raw text to a character vector sequence,extracts global text features with a bidirectional long short-term memory and extracts local text features with a soft attention model.A linear chain conditional random field is also used to label all the characters with the help of the global and local text features.Experiments based on the Microsoft Research Asia(MSRA)dataset are designed and implemented.Results show that the proposed method has good performance compared to other methods,which proves that the global and local text features extracted have a positive influence on Chinese NER.For more variety in the test domains,a resume dataset from Sina Finance is also used to prove the effectiveness of the proposed method.展开更多
The research on named entity recognition for label-few domain is becoming increasingly important.In this paper,a novel algorithm,positive unlabeled named entity recognition(PUNER)with multi-granularity language inform...The research on named entity recognition for label-few domain is becoming increasingly important.In this paper,a novel algorithm,positive unlabeled named entity recognition(PUNER)with multi-granularity language information,is proposed,which combines positive unlabeled(PU)learning and deep learning to obtain the multi-granularity language information from a few labeled in-stances and many unlabeled instances to recognize named entities.First,PUNER selects reliable negative instances from unlabeled datasets,uses positive instances and a corresponding number of negative instances to train the PU learning classifier,and iterates continuously to label all unlabeled instances.Second,a neural network-based architecture to implement the PU learning classifier is used,and comprehensive text semantics through multi-granular language information are obtained,which helps the classifier correctly recognize named entities.Performance tests of the PUNER are carried out on three multilingual NER datasets,which are CoNLL2003,CoNLL 2002 and SIGHAN Bakeoff 2006.Experimental results demonstrate the effectiveness of the proposed PUNER.展开更多
Geological reports are a significant accomplishment for geologists involved in geological investigations and scientific research as they contain rich data and textual information.With the rapid development of science ...Geological reports are a significant accomplishment for geologists involved in geological investigations and scientific research as they contain rich data and textual information.With the rapid development of science and technology,a large number of textual reports have accumulated in the field of geology.However,many non-hot topics and non-English speaking regions are neglected in mainstream geoscience databases for geological information mining,making it more challenging for some researchers to extract necessary information from these texts.Natural Language Processing(NLP)has obvious advantages in processing large amounts of textual data.The objective of this paper is to identify geological named entities from Chinese geological texts using NLP techniques.We propose the RoBERTa-Prompt-Tuning-NER method,which leverages the concept of Prompt Learning and requires only a small amount of annotated data to train superior models for recognizing geological named entities in low-resource dataset configurations.The RoBERTa layer captures context-based information and longer-distance dependencies through dynamic word vectors.Finally,we conducted experiments on the constructed Geological Named Entity Recognition(GNER)dataset.Our experimental results show that the proposed model achieves the highest F1 score of 80.64%among the four baseline algorithms,demonstrating the reliability and robustness of using the model for Named Entity Recognition of geological texts.展开更多
Class-Incremental Few-Shot Named Entity Recognition(CIFNER)aims to identify entity categories that have appeared with only a few newly added(novel)class examples.However,existing class-incremental methods typically in...Class-Incremental Few-Shot Named Entity Recognition(CIFNER)aims to identify entity categories that have appeared with only a few newly added(novel)class examples.However,existing class-incremental methods typically introduce new parameters to adapt to new classes and treat all information equally,resulting in poor generalization.Meanwhile,few-shot methods necessitate samples for all observed classes,making them difficult to transfer into a class-incremental setting.Thus,a decoupled two-phase framework method for the CIFNER task is proposed to address the above issues.The whole task is converted to two separate tasks named Entity Span Detection(ESD)and Entity Class Discrimination(ECD)that leverage parameter-cloning and label-fusion to learn different levels of knowledge separately,such as class-generic knowledge and class-specific knowledge.Moreover,different variants,such as the Conditional Random Field-based(CRF-based),word-pair-based methods in ESD module,and add-based,Natural Language Inference-based(NLI-based)and prompt-based methods in ECD module,are investigated to demonstrate the generalizability of the decoupled framework.Extensive experiments on the three Named Entity Recognition(NER)datasets reveal that our method achieves the state-of-the-art performance in the CIFNER setting.展开更多
Few-shot learning has been proposed and rapidly emerging as a viable means for completing various tasks.Recently,few-shot models have been used for Named Entity Recognition(NER).Prototypical network shows high efficie...Few-shot learning has been proposed and rapidly emerging as a viable means for completing various tasks.Recently,few-shot models have been used for Named Entity Recognition(NER).Prototypical network shows high efficiency on few-shot NER.However,existing prototypical methods only consider the similarity of tokens in query sets and support sets and ignore the semantic similarity among the sentences which contain these entities.We present a novel model,Few-shot Named Entity Recognition with Joint Token and Sentence Awareness(JTSA),to address the issue.The sentence awareness is introduced to probe the semantic similarity among the sentences.The Token awareness is used to explore the similarity of the tokens.To further improve the robustness and results of the model,we adopt the joint learning scheme on the few-shot NER.Experimental results demonstrate that our model outperforms state-of-the-art models on two standard Fewshot NER datasets.展开更多
Few-shot named entity recognition(NER)aims to identify named entities in new domains using a limited amount of annotated data.Previous methods divided this task into entity span detection and entity classification,ach...Few-shot named entity recognition(NER)aims to identify named entities in new domains using a limited amount of annotated data.Previous methods divided this task into entity span detection and entity classification,achieving good results.However these methods are limited by the imbalance between the entity and non-entity categories due to the use of sequence labeling for entity span detection.To this end,a point-proto network(PPN)combining pointer and prototypical networks was proposed.Specifically,the pointer network generates the position of entities in sentences in the entity span detection stage.The prototypical network builds semantic prototypes of entity types and classifies entities based on their distance from these prototypes in the entity classification stage.Moreover,the low-rank adaptation(LoRA)fine-tuning method,which involves freezing the pre-trained weights and injecting a trainable decomposition matrix,reduces the parameters that need to be trained and saved.Extensive experiments on the few-shot NER Dataset(Few-NERD)and Cross-Dataset demonstrate the superiority of PPN in this domain.展开更多
Conventional named entity recognition methods usually assume that the model can be trained with sufficient annotated data to obtain good recognition results.However,in Chinese named entity recognition in the electric ...Conventional named entity recognition methods usually assume that the model can be trained with sufficient annotated data to obtain good recognition results.However,in Chinese named entity recognition in the electric power domain,existing methods still face the challenges of lack of annotated data and new entities of unseen types.To address these challenges,this paper proposes a meta-learning-based continuous cue adjustment method.A generative pre-trained language model is used so that it does not change its own model structure when dealing with new entity types.To guide the pre-trained model to make full use of its own latent knowledge,a vector of learnable parameters is set as a cue to compensate for the lack of training data.In order to further improve the model's few-shot learning capability,a meta-learning strategy is used to train the model.Experimental results show that the proposed approach achieves the best results in a few-shot electric Chinese power named entity recognition dataset compared to several traditional named entity approaches.展开更多
Traditional named entity recognition methods need professional domain knowl-edge and a large amount of human participation to extract features,as well as the Chinese named entity recognition method based on a neural n...Traditional named entity recognition methods need professional domain knowl-edge and a large amount of human participation to extract features,as well as the Chinese named entity recognition method based on a neural network model,which brings the prob-lem that vector representation is too singular in the process of character vector representa-tion.To solve the above problem,we propose a Chinese named entity recognition method based on the BERT-BiLSTM-ATT-CRF model.Firstly,we use the bidirectional encoder representations from transformers(BERT)pre-training language model to obtain the se-mantic vector of the word according to the context information of the word;Secondly,the word vectors trained by BERT are input into the bidirectional long-term and short-term memory network embedded with attention mechanism(BiLSTM-ATT)to capture the most important semantic information in the sentence;Finally,the conditional random field(CRF)is used to learn the dependence between adjacent tags to obtain the global optimal sentence level tag sequence.The experimental results show that the proposed model achieves state-of-the-art performance on both Microsoft Research Asia(MSRA)corpus and people’s daily corpus,with F1 values of 94.77% and 95.97% respectively.展开更多
Inspired by the concept of content-addressable retrieval from cognitive science,we propose a novel fragment-based Chinese named entity recognition(NER)model augmented with a lexicon-based memory in which both characte...Inspired by the concept of content-addressable retrieval from cognitive science,we propose a novel fragment-based Chinese named entity recognition(NER)model augmented with a lexicon-based memory in which both character-level and word-level features are combined to generate better feature representations for possible entity names.Observing that the boundary information of entity names is particularly useful to locate and classify them into pre-defined categories,position-dependent features,such as prefix and suffix,are introduced and taken into account for NER tasks in the form of distributed representations.The lexicon-based memory is built to help generate such position-dependent features and deal with the problem of out-of-vocabulary words.Experimental results show that the proposed model,called LEMON,achieved state-of-the-art performance with an increase in the Fl-score up to 3.2%over the state-of-the-art models on four different widely-used NER datasets.展开更多
Currently,as a basic task of military document information extraction,Named Entity Recognition(NER)for military documents has received great attention.In 2020,China Conference on Knowledge Graph and Semantic Computing...Currently,as a basic task of military document information extraction,Named Entity Recognition(NER)for military documents has received great attention.In 2020,China Conference on Knowledge Graph and Semantic Computing(CCKS)and System Engineering Research Institute of Academy of Military Sciences(AMS)issued the NER task for test evaluation,which requires the recognition of four types of entities including Test Elements(TE),Performance Indicators(PI),System Components(SC)and Task Scenarios(TS).Due to the particularity and confidentiality of the military field,only 400 items of annotated data are provided by the organizer.In this paper,the task is regarded as a few-shot learning problem for NER,and a method based on BERT and two-level model fusion is proposed.Firstly,the proposed method is based on several basic models fine tuned by BERT on the training data.Then,a two-level fusion strategy applied to the prediction results of multiple basic models is proposed to alleviate the over-fitting problem.Finally,the labeling errors are eliminated by post-processing.This method achieves F1 score of 0.7203 on the test set of the evaluation task.展开更多
Network texts have become important carriers of cybersecurity information on the Internet.These texts include the latest security events such as vulnerability exploitations,attack discoveries,advanced persistent threa...Network texts have become important carriers of cybersecurity information on the Internet.These texts include the latest security events such as vulnerability exploitations,attack discoveries,advanced persistent threats,and so on.Extracting cybersecurity entities from these unstructured texts is a critical and fundamental task in many cybersecurity applications.However,most Named Entity Recognition(NER)models are suitable only for general fields,and there has been little research focusing on cybersecurity entity extraction in the security domain.To this end,in this paper,we propose a novel cybersecurity entity identification model based on Bidirectional Long Short-Term Memory with Conditional Random Fields(Bi-LSTM with CRF)to extract security-related concepts and entities from unstructured text.This model,which we have named XBi LSTM-CRF,consists of a word-embedding layer,a bidirectional LSTM layer,and a CRF layer,and concatenates X input with bidirectional LSTM output.Via extensive experiments on an open-source dataset containing an office security bulletin,security blogs,and the Common Vulnerabilities and Exposures list,we demonstrate that XBi LSTM-CRF achieves better cybersecurity entity extraction than state-of-the-art models.展开更多
With the rapid development of Internet technology and the advent of the era of big data,more and more cyber security texts are provided on the Internet.These texts include not only security concepts,incidents,tools,gu...With the rapid development of Internet technology and the advent of the era of big data,more and more cyber security texts are provided on the Internet.These texts include not only security concepts,incidents,tools,guidelines,and policies,but also risk management approaches,best practices,assurances,technologies,and more.Through the integration of large-scale,heterogeneous,unstructured cyber security information,the identification and classification of cyber security entities can help handle cyber security issues.Due to the complexity and diversity of texts in the cyber security domain,it is difficult to identify security entities in the cyber security domain using the traditional named entity recognition(NER)methods.This paper describes various approaches and techniques for NER in this domain,including the rule-based approach,dictionary-based approach,and machine learning based approach,and discusses the problems faced by NER research in this domain,such as conjunction and disjunction,non-standardized naming convention,abbreviation,and massive nesting.Three future directions of NER in cyber security are proposed:(1)application of unsupervised or semi-supervised technology;(2)development of a more comprehensive cyber security ontology;(3)development of a more comprehensive deep learning model.展开更多
针对当前大多数命名实体识别(NER)模型只使用字符级信息编码且缺乏对文本层次信息提取的问题,提出一种融合多粒度语言知识与层级信息的中文NER(CNER)模型(CMH)。首先,使用经过多粒度语言知识预训练的模型编码文本,使模型能够同时捕获文...针对当前大多数命名实体识别(NER)模型只使用字符级信息编码且缺乏对文本层次信息提取的问题,提出一种融合多粒度语言知识与层级信息的中文NER(CNER)模型(CMH)。首先,使用经过多粒度语言知识预训练的模型编码文本,使模型能够同时捕获文本的细粒度和粗粒度语言信息,从而更好地表征语料;其次,使用ON-LSTM(Ordered Neurons Long Short-Term Memory network)模型提取层级信息,利用文本本身的层级结构信息增强编码间的时序关系;最后,在模型的解码端结合文本的分词信息,并将实体识别问题转化为表格填充问题,以更好地解决实体重叠问题并获得更准确的实体识别结果。同时,为解决当前模型在不同领域中的迁移能力较差的问题,提出通用实体识别的理念,通过筛选多领域的通用实体类型,构建一套提升模型在多领域中的泛化能力的通用NER数据集MDNER(Multi-Domain NER dataset)。为验证所提模型的效果,在数据集Resume、Weibo、MSRA上进行实验,与MECT(Multi-metadata Embedding based Cross-Transformer)模型相比,F1值分别提高了0.94、4.95和1.58个百分点。为了验证所提模型在多领域中的实体识别效果,在MDNER上进行实验,F1值达到了95.29%。实验结果表明,多粒度语言知识预训练、文本层级结构信息提取和高效指针解码器对模型的性能提升至关重要。展开更多
针对命名实体识别(NER)任务中相关模型通常仅对字符及相关词汇进行建模,未充分利用汉字特有的字形结构信息和实体类型信息的问题,提出一种融合先验知识和字形特征的命名实体识别模型。首先,采用结合高斯注意力机制的Transformer对输入...针对命名实体识别(NER)任务中相关模型通常仅对字符及相关词汇进行建模,未充分利用汉字特有的字形结构信息和实体类型信息的问题,提出一种融合先验知识和字形特征的命名实体识别模型。首先,采用结合高斯注意力机制的Transformer对输入序列进行编码,并从中文维基百科中获取实体类型的中文释义,采用双向门控循环单元(BiGRU)编码实体类型信息作为先验知识,利用注意力机制将它与字符表示进行组合;其次,采用双向长短时记忆(BiLSTM)网络编码输入序列的远距离依赖关系,通过字形编码表获得繁体的仓颉码和简体的现代五笔码,采用卷积神经网络(CNN)提取字形特征表示,并根据不同权重组合繁体与简体字形特征,利用门控机制将它与经过BiLSTM编码后的字符表示进行组合;最后,使用条件随机场(CRF)解码,得到命名实体标注序列。在偏口语化的数据集Weibo、小型数据集Boson和大型数据集PeopleDaily上的实验结果表明,与基线模型MECT(Multi-metadata Embedding based Cross-Transformer)相比,所提模型的F1值别提高了2.47、1.20和0.98个百分点,验证了模型的有效性。展开更多
基金Supported by 242 National Information Security Projects(2017A149)。
文摘Unlike named entity recognition(NER)for English,the absence of word boundaries reduces the final accuracy for Chinese NER.To avoid accumulated error introduced by word segmentation,a deep model extracting character-level features is carefully built and becomes a basis for a new Chinese NER method,which is proposed in this paper.This method converts the raw text to a character vector sequence,extracts global text features with a bidirectional long short-term memory and extracts local text features with a soft attention model.A linear chain conditional random field is also used to label all the characters with the help of the global and local text features.Experiments based on the Microsoft Research Asia(MSRA)dataset are designed and implemented.Results show that the proposed method has good performance compared to other methods,which proves that the global and local text features extracted have a positive influence on Chinese NER.For more variety in the test domains,a resume dataset from Sina Finance is also used to prove the effectiveness of the proposed method.
基金the National Natural Science Foundation of China(No.61876144)the Strategy Priority Research Program of Chinese Acade-my of Sciences(No.XDC02070600).
文摘The research on named entity recognition for label-few domain is becoming increasingly important.In this paper,a novel algorithm,positive unlabeled named entity recognition(PUNER)with multi-granularity language information,is proposed,which combines positive unlabeled(PU)learning and deep learning to obtain the multi-granularity language information from a few labeled in-stances and many unlabeled instances to recognize named entities.First,PUNER selects reliable negative instances from unlabeled datasets,uses positive instances and a corresponding number of negative instances to train the PU learning classifier,and iterates continuously to label all unlabeled instances.Second,a neural network-based architecture to implement the PU learning classifier is used,and comprehensive text semantics through multi-granular language information are obtained,which helps the classifier correctly recognize named entities.Performance tests of the PUNER are carried out on three multilingual NER datasets,which are CoNLL2003,CoNLL 2002 and SIGHAN Bakeoff 2006.Experimental results demonstrate the effectiveness of the proposed PUNER.
基金supported by the National Natural Science Foundation of China(Nos.42488201,42172137,42050104,and 42050102)the National Key R&D Program of China(No.2023YFF0804000)Sichuan Provincial Youth Science&Technology Innovative Research Group Fund(No.2022JDTD0004)
文摘Geological reports are a significant accomplishment for geologists involved in geological investigations and scientific research as they contain rich data and textual information.With the rapid development of science and technology,a large number of textual reports have accumulated in the field of geology.However,many non-hot topics and non-English speaking regions are neglected in mainstream geoscience databases for geological information mining,making it more challenging for some researchers to extract necessary information from these texts.Natural Language Processing(NLP)has obvious advantages in processing large amounts of textual data.The objective of this paper is to identify geological named entities from Chinese geological texts using NLP techniques.We propose the RoBERTa-Prompt-Tuning-NER method,which leverages the concept of Prompt Learning and requires only a small amount of annotated data to train superior models for recognizing geological named entities in low-resource dataset configurations.The RoBERTa layer captures context-based information and longer-distance dependencies through dynamic word vectors.Finally,we conducted experiments on the constructed Geological Named Entity Recognition(GNER)dataset.Our experimental results show that the proposed model achieves the highest F1 score of 80.64%among the four baseline algorithms,demonstrating the reliability and robustness of using the model for Named Entity Recognition of geological texts.
基金supported by the National Natural Science Foundation of China(No.62006243)。
文摘Class-Incremental Few-Shot Named Entity Recognition(CIFNER)aims to identify entity categories that have appeared with only a few newly added(novel)class examples.However,existing class-incremental methods typically introduce new parameters to adapt to new classes and treat all information equally,resulting in poor generalization.Meanwhile,few-shot methods necessitate samples for all observed classes,making them difficult to transfer into a class-incremental setting.Thus,a decoupled two-phase framework method for the CIFNER task is proposed to address the above issues.The whole task is converted to two separate tasks named Entity Span Detection(ESD)and Entity Class Discrimination(ECD)that leverage parameter-cloning and label-fusion to learn different levels of knowledge separately,such as class-generic knowledge and class-specific knowledge.Moreover,different variants,such as the Conditional Random Field-based(CRF-based),word-pair-based methods in ESD module,and add-based,Natural Language Inference-based(NLI-based)and prompt-based methods in ECD module,are investigated to demonstrate the generalizability of the decoupled framework.Extensive experiments on the three Named Entity Recognition(NER)datasets reveal that our method achieves the state-of-the-art performance in the CIFNER setting.
基金The State Key Program of National Natural Science of China,Grant/Award Number:61533018National Natural Science Foundation of China,Grant/Award Number:61402220+2 种基金The Philosophy and Social Science Foundation of Hunan Province,Grant/Award Number:16YBA323Natural Science Foundation of Hunan Province,Grant/Award Number:2020J4525,2022JJ30495Scientific Research Fund of Hunan Provincial Education Department,Grant/Award Number:18B279,19A439,22A0316.
文摘Few-shot learning has been proposed and rapidly emerging as a viable means for completing various tasks.Recently,few-shot models have been used for Named Entity Recognition(NER).Prototypical network shows high efficiency on few-shot NER.However,existing prototypical methods only consider the similarity of tokens in query sets and support sets and ignore the semantic similarity among the sentences which contain these entities.We present a novel model,Few-shot Named Entity Recognition with Joint Token and Sentence Awareness(JTSA),to address the issue.The sentence awareness is introduced to probe the semantic similarity among the sentences.The Token awareness is used to explore the similarity of the tokens.To further improve the robustness and results of the model,we adopt the joint learning scheme on the few-shot NER.Experimental results demonstrate that our model outperforms state-of-the-art models on two standard Fewshot NER datasets.
基金supported by the National Key Research and Development Project(2021YFF0901701)。
文摘Few-shot named entity recognition(NER)aims to identify named entities in new domains using a limited amount of annotated data.Previous methods divided this task into entity span detection and entity classification,achieving good results.However these methods are limited by the imbalance between the entity and non-entity categories due to the use of sequence labeling for entity span detection.To this end,a point-proto network(PPN)combining pointer and prototypical networks was proposed.Specifically,the pointer network generates the position of entities in sentences in the entity span detection stage.The prototypical network builds semantic prototypes of entity types and classifies entities based on their distance from these prototypes in the entity classification stage.Moreover,the low-rank adaptation(LoRA)fine-tuning method,which involves freezing the pre-trained weights and injecting a trainable decomposition matrix,reduces the parameters that need to be trained and saved.Extensive experiments on the few-shot NER Dataset(Few-NERD)and Cross-Dataset demonstrate the superiority of PPN in this domain.
文摘Conventional named entity recognition methods usually assume that the model can be trained with sufficient annotated data to obtain good recognition results.However,in Chinese named entity recognition in the electric power domain,existing methods still face the challenges of lack of annotated data and new entities of unseen types.To address these challenges,this paper proposes a meta-learning-based continuous cue adjustment method.A generative pre-trained language model is used so that it does not change its own model structure when dealing with new entity types.To guide the pre-trained model to make full use of its own latent knowledge,a vector of learnable parameters is set as a cue to compensate for the lack of training data.In order to further improve the model's few-shot learning capability,a meta-learning strategy is used to train the model.Experimental results show that the proposed approach achieves the best results in a few-shot electric Chinese power named entity recognition dataset compared to several traditional named entity approaches.
文摘Traditional named entity recognition methods need professional domain knowl-edge and a large amount of human participation to extract features,as well as the Chinese named entity recognition method based on a neural network model,which brings the prob-lem that vector representation is too singular in the process of character vector representa-tion.To solve the above problem,we propose a Chinese named entity recognition method based on the BERT-BiLSTM-ATT-CRF model.Firstly,we use the bidirectional encoder representations from transformers(BERT)pre-training language model to obtain the se-mantic vector of the word according to the context information of the word;Secondly,the word vectors trained by BERT are input into the bidirectional long-term and short-term memory network embedded with attention mechanism(BiLSTM-ATT)to capture the most important semantic information in the sentence;Finally,the conditional random field(CRF)is used to learn the dependence between adjacent tags to obtain the global optimal sentence level tag sequence.The experimental results show that the proposed model achieves state-of-the-art performance on both Microsoft Research Asia(MSRA)corpus and people’s daily corpus,with F1 values of 94.77% and 95.97% respectively.
基金supported by the National Key Research and Development Program of China under Grant No.2018YFC0830900the National Natural Science Foundation of China under Grant No.62076068Shanghai Municipal Science and Technology Project under Grant No.21511102800。
文摘Inspired by the concept of content-addressable retrieval from cognitive science,we propose a novel fragment-based Chinese named entity recognition(NER)model augmented with a lexicon-based memory in which both character-level and word-level features are combined to generate better feature representations for possible entity names.Observing that the boundary information of entity names is particularly useful to locate and classify them into pre-defined categories,position-dependent features,such as prefix and suffix,are introduced and taken into account for NER tasks in the form of distributed representations.The lexicon-based memory is built to help generate such position-dependent features and deal with the problem of out-of-vocabulary words.Experimental results show that the proposed model,called LEMON,achieved state-of-the-art performance with an increase in the Fl-score up to 3.2%over the state-of-the-art models on four different widely-used NER datasets.
文摘Currently,as a basic task of military document information extraction,Named Entity Recognition(NER)for military documents has received great attention.In 2020,China Conference on Knowledge Graph and Semantic Computing(CCKS)and System Engineering Research Institute of Academy of Military Sciences(AMS)issued the NER task for test evaluation,which requires the recognition of four types of entities including Test Elements(TE),Performance Indicators(PI),System Components(SC)and Task Scenarios(TS).Due to the particularity and confidentiality of the military field,only 400 items of annotated data are provided by the organizer.In this paper,the task is regarded as a few-shot learning problem for NER,and a method based on BERT and two-level model fusion is proposed.Firstly,the proposed method is based on several basic models fine tuned by BERT on the training data.Then,a two-level fusion strategy applied to the prediction results of multiple basic models is proposed to alleviate the over-fitting problem.Finally,the labeling errors are eliminated by post-processing.This method achieves F1 score of 0.7203 on the test set of the evaluation task.
基金supported by the National Natural Science Foundation of China(Nos.61702508,61802404,and U1836209)the National Key Research and Development Program of China(Nos.2018YFB0803602 and 2016QY06X1204)+2 种基金the National Social Science Foundation of China(No.19BSH022)supported by the Key Laboratory of Network Assessment Technology,Chinese Academy of SciencesBeijing Key Laboratory of Network Security and Protection Technology。
文摘Network texts have become important carriers of cybersecurity information on the Internet.These texts include the latest security events such as vulnerability exploitations,attack discoveries,advanced persistent threats,and so on.Extracting cybersecurity entities from these unstructured texts is a critical and fundamental task in many cybersecurity applications.However,most Named Entity Recognition(NER)models are suitable only for general fields,and there has been little research focusing on cybersecurity entity extraction in the security domain.To this end,in this paper,we propose a novel cybersecurity entity identification model based on Bidirectional Long Short-Term Memory with Conditional Random Fields(Bi-LSTM with CRF)to extract security-related concepts and entities from unstructured text.This model,which we have named XBi LSTM-CRF,consists of a word-embedding layer,a bidirectional LSTM layer,and a CRF layer,and concatenates X input with bidirectional LSTM output.Via extensive experiments on an open-source dataset containing an office security bulletin,security blogs,and the Common Vulnerabilities and Exposures list,we demonstrate that XBi LSTM-CRF achieves better cybersecurity entity extraction than state-of-the-art models.
基金the National Natural Science Foundation of China(Nos.61862063,61502413,and 61262025)the National Social Science Foundation of China(No.18BJL104)+2 种基金the Natural Science Foundation of Key Laboratory of Software Engineering of Yunnan Province,China(No.2020SE301)the Yunnan Science and Technology Major Project(Nos.202002AE090010 and 202002AD080002-5)the Data Driven Software Engineering Innovative Research Team Funding of Yunnan Province,China(No.2017HC012)。
文摘With the rapid development of Internet technology and the advent of the era of big data,more and more cyber security texts are provided on the Internet.These texts include not only security concepts,incidents,tools,guidelines,and policies,but also risk management approaches,best practices,assurances,technologies,and more.Through the integration of large-scale,heterogeneous,unstructured cyber security information,the identification and classification of cyber security entities can help handle cyber security issues.Due to the complexity and diversity of texts in the cyber security domain,it is difficult to identify security entities in the cyber security domain using the traditional named entity recognition(NER)methods.This paper describes various approaches and techniques for NER in this domain,including the rule-based approach,dictionary-based approach,and machine learning based approach,and discusses the problems faced by NER research in this domain,such as conjunction and disjunction,non-standardized naming convention,abbreviation,and massive nesting.Three future directions of NER in cyber security are proposed:(1)application of unsupervised or semi-supervised technology;(2)development of a more comprehensive cyber security ontology;(3)development of a more comprehensive deep learning model.
文摘针对当前大多数命名实体识别(NER)模型只使用字符级信息编码且缺乏对文本层次信息提取的问题,提出一种融合多粒度语言知识与层级信息的中文NER(CNER)模型(CMH)。首先,使用经过多粒度语言知识预训练的模型编码文本,使模型能够同时捕获文本的细粒度和粗粒度语言信息,从而更好地表征语料;其次,使用ON-LSTM(Ordered Neurons Long Short-Term Memory network)模型提取层级信息,利用文本本身的层级结构信息增强编码间的时序关系;最后,在模型的解码端结合文本的分词信息,并将实体识别问题转化为表格填充问题,以更好地解决实体重叠问题并获得更准确的实体识别结果。同时,为解决当前模型在不同领域中的迁移能力较差的问题,提出通用实体识别的理念,通过筛选多领域的通用实体类型,构建一套提升模型在多领域中的泛化能力的通用NER数据集MDNER(Multi-Domain NER dataset)。为验证所提模型的效果,在数据集Resume、Weibo、MSRA上进行实验,与MECT(Multi-metadata Embedding based Cross-Transformer)模型相比,F1值分别提高了0.94、4.95和1.58个百分点。为了验证所提模型在多领域中的实体识别效果,在MDNER上进行实验,F1值达到了95.29%。实验结果表明,多粒度语言知识预训练、文本层级结构信息提取和高效指针解码器对模型的性能提升至关重要。
文摘针对命名实体识别(NER)任务中相关模型通常仅对字符及相关词汇进行建模,未充分利用汉字特有的字形结构信息和实体类型信息的问题,提出一种融合先验知识和字形特征的命名实体识别模型。首先,采用结合高斯注意力机制的Transformer对输入序列进行编码,并从中文维基百科中获取实体类型的中文释义,采用双向门控循环单元(BiGRU)编码实体类型信息作为先验知识,利用注意力机制将它与字符表示进行组合;其次,采用双向长短时记忆(BiLSTM)网络编码输入序列的远距离依赖关系,通过字形编码表获得繁体的仓颉码和简体的现代五笔码,采用卷积神经网络(CNN)提取字形特征表示,并根据不同权重组合繁体与简体字形特征,利用门控机制将它与经过BiLSTM编码后的字符表示进行组合;最后,使用条件随机场(CRF)解码,得到命名实体标注序列。在偏口语化的数据集Weibo、小型数据集Boson和大型数据集PeopleDaily上的实验结果表明,与基线模型MECT(Multi-metadata Embedding based Cross-Transformer)相比,所提模型的F1值别提高了2.47、1.20和0.98个百分点,验证了模型的有效性。