Chinese named entity recognition(CNER)has received widespread attention as an important task of Chinese information extraction.Most previous research has focused on individually studying flat CNER,overlapped CNER,or d...Chinese named entity recognition(CNER)has received widespread attention as an important task of Chinese information extraction.Most previous research has focused on individually studying flat CNER,overlapped CNER,or discontinuous CNER.However,a unified CNER is often needed in real-world scenarios.Recent studies have shown that grid tagging-based methods based on character-pair relationship classification hold great potential for achieving unified NER.Nevertheless,how to enrich Chinese character-pair grid representations and capture deeper dependencies between character pairs to improve entity recognition performance remains an unresolved challenge.In this study,we enhance the character-pair grid representation by incorporating both local and global information.Significantly,we introduce a new approach by considering the character-pair grid representation matrix as a specialized image,converting the classification of character-pair relationships into a pixel-level semantic segmentation task.We devise a U-shaped network to extract multi-scale and deeper semantic information from the grid image,allowing for a more comprehensive understanding of associative features between character pairs.This approach leads to improved accuracy in predicting their relationships,ultimately enhancing entity recognition performance.We conducted experiments on two public CNER datasets in the biomedical domain,namely CMeEE-V2 and Diakg.The results demonstrate the effectiveness of our approach,which achieves F1-score improvements of 7.29 percentage points and 1.64 percentage points compared to the current state-of-the-art(SOTA)models,respectively.展开更多
With the rapid development of information technology,the electronifi-cation of medical records has gradually become a trend.In China,the population base is huge and the supporting medical institutions are numerous,so ...With the rapid development of information technology,the electronifi-cation of medical records has gradually become a trend.In China,the population base is huge and the supporting medical institutions are numerous,so this reality drives the conversion of paper medical records to electronic medical records.Electronic medical records are the basis for establishing a smart hospital and an important guarantee for achieving medical intelligence,and the massive amount of electronic medical record data is also an important data set for conducting research in the medical field.However,electronic medical records contain a large amount of private patient information,which must be desensitized before they are used as open resources.Therefore,to solve the above problems,data masking for Chinese electronic medical records with named entity recognition is proposed in this paper.Firstly,the text is vectorized to satisfy the required format of the model input.Secondly,since the input sentences may have a long or short length and the relationship between sentences in context is not negligible.To this end,a neural network model for named entity recognition based on bidirectional long short-term memory(BiLSTM)with conditional random fields(CRF)is constructed.Finally,the data masking operation is performed based on the named entity recog-nition results,mainly using regular expression filtering encryption and principal component analysis(PCA)word vector compression and replacement.In addi-tion,comparison experiments with the hidden markov model(HMM)model,LSTM-CRF model,and BiLSTM model are conducted in this paper.The experi-mental results show that the method used in this paper achieves 92.72%Accuracy,92.30%Recall,and 92.51%F1_score,which has higher accuracy compared with other models.展开更多
Unlike named entity recognition(NER)for English,the absence of word boundaries reduces the final accuracy for Chinese NER.To avoid accumulated error introduced by word segmentation,a deep model extracting character-le...Unlike named entity recognition(NER)for English,the absence of word boundaries reduces the final accuracy for Chinese NER.To avoid accumulated error introduced by word segmentation,a deep model extracting character-level features is carefully built and becomes a basis for a new Chinese NER method,which is proposed in this paper.This method converts the raw text to a character vector sequence,extracts global text features with a bidirectional long short-term memory and extracts local text features with a soft attention model.A linear chain conditional random field is also used to label all the characters with the help of the global and local text features.Experiments based on the Microsoft Research Asia(MSRA)dataset are designed and implemented.Results show that the proposed method has good performance compared to other methods,which proves that the global and local text features extracted have a positive influence on Chinese NER.For more variety in the test domains,a resume dataset from Sina Finance is also used to prove the effectiveness of the proposed method.展开更多
由于社交媒体平台上所发布的非结构化信息存在数据不一致、重要程度不同等问题,使自动准确抽取所需信息并标注受灾级别成为一个有挑战性的工作。因此,结合形式概念分析(FCA)、词共现关系和上下文语义信息构建了水灾事件知识体系。利用...由于社交媒体平台上所发布的非结构化信息存在数据不一致、重要程度不同等问题,使自动准确抽取所需信息并标注受灾级别成为一个有挑战性的工作。因此,结合形式概念分析(FCA)、词共现关系和上下文语义信息构建了水灾事件知识体系。利用所构建的知识体系,基于TencentPretrain框架对大规模语言预训练模型(LLM)进行指令微调,构建了ChatFlowFlood信息抽取模型,可以在少量人工标记情况下,准确自动抽取被困情况、紧缺物资等信息;在信息抽取模型的基础上,通过模糊层次分析法(FAHP)和CRITIC法(CRiteria Importance Through Intercriteria Correlation)主客观结合评定求助信息的救援优先级,帮助决策者理解灾情紧急程度。实验结果表明,在中文社交媒体数据上,与ChatFlow-7B模型相比,ChatFlowFlood模型的FBERT指标提升了73.09%。展开更多
基金supported by Yunnan Provincial Major Science and Technology Special Plan Projects(Grant Nos.202202AD080003,202202AE090008,202202AD080004,202302AD080003)National Natural Science Foundation of China(Grant Nos.U21B2027,62266027,62266028,62266025)Yunnan Province Young and Middle-Aged Academic and Technical Leaders Reserve Talent Program(Grant No.202305AC160063).
文摘Chinese named entity recognition(CNER)has received widespread attention as an important task of Chinese information extraction.Most previous research has focused on individually studying flat CNER,overlapped CNER,or discontinuous CNER.However,a unified CNER is often needed in real-world scenarios.Recent studies have shown that grid tagging-based methods based on character-pair relationship classification hold great potential for achieving unified NER.Nevertheless,how to enrich Chinese character-pair grid representations and capture deeper dependencies between character pairs to improve entity recognition performance remains an unresolved challenge.In this study,we enhance the character-pair grid representation by incorporating both local and global information.Significantly,we introduce a new approach by considering the character-pair grid representation matrix as a specialized image,converting the classification of character-pair relationships into a pixel-level semantic segmentation task.We devise a U-shaped network to extract multi-scale and deeper semantic information from the grid image,allowing for a more comprehensive understanding of associative features between character pairs.This approach leads to improved accuracy in predicting their relationships,ultimately enhancing entity recognition performance.We conducted experiments on two public CNER datasets in the biomedical domain,namely CMeEE-V2 and Diakg.The results demonstrate the effectiveness of our approach,which achieves F1-score improvements of 7.29 percentage points and 1.64 percentage points compared to the current state-of-the-art(SOTA)models,respectively.
基金This research was supported by the National Natural Science Foundation of China under Grant(No.42050102)the Postgraduate Education Reform Project of Jiangsu Province under Grant(No.SJCX22_0343)Also,this research was supported by Dou Wanchun Expert Workstation of Yunnan Province(No.202205AF150013).
文摘With the rapid development of information technology,the electronifi-cation of medical records has gradually become a trend.In China,the population base is huge and the supporting medical institutions are numerous,so this reality drives the conversion of paper medical records to electronic medical records.Electronic medical records are the basis for establishing a smart hospital and an important guarantee for achieving medical intelligence,and the massive amount of electronic medical record data is also an important data set for conducting research in the medical field.However,electronic medical records contain a large amount of private patient information,which must be desensitized before they are used as open resources.Therefore,to solve the above problems,data masking for Chinese electronic medical records with named entity recognition is proposed in this paper.Firstly,the text is vectorized to satisfy the required format of the model input.Secondly,since the input sentences may have a long or short length and the relationship between sentences in context is not negligible.To this end,a neural network model for named entity recognition based on bidirectional long short-term memory(BiLSTM)with conditional random fields(CRF)is constructed.Finally,the data masking operation is performed based on the named entity recog-nition results,mainly using regular expression filtering encryption and principal component analysis(PCA)word vector compression and replacement.In addi-tion,comparison experiments with the hidden markov model(HMM)model,LSTM-CRF model,and BiLSTM model are conducted in this paper.The experi-mental results show that the method used in this paper achieves 92.72%Accuracy,92.30%Recall,and 92.51%F1_score,which has higher accuracy compared with other models.
基金Supported by 242 National Information Security Projects(2017A149)。
文摘Unlike named entity recognition(NER)for English,the absence of word boundaries reduces the final accuracy for Chinese NER.To avoid accumulated error introduced by word segmentation,a deep model extracting character-level features is carefully built and becomes a basis for a new Chinese NER method,which is proposed in this paper.This method converts the raw text to a character vector sequence,extracts global text features with a bidirectional long short-term memory and extracts local text features with a soft attention model.A linear chain conditional random field is also used to label all the characters with the help of the global and local text features.Experiments based on the Microsoft Research Asia(MSRA)dataset are designed and implemented.Results show that the proposed method has good performance compared to other methods,which proves that the global and local text features extracted have a positive influence on Chinese NER.For more variety in the test domains,a resume dataset from Sina Finance is also used to prove the effectiveness of the proposed method.
文摘由于社交媒体平台上所发布的非结构化信息存在数据不一致、重要程度不同等问题,使自动准确抽取所需信息并标注受灾级别成为一个有挑战性的工作。因此,结合形式概念分析(FCA)、词共现关系和上下文语义信息构建了水灾事件知识体系。利用所构建的知识体系,基于TencentPretrain框架对大规模语言预训练模型(LLM)进行指令微调,构建了ChatFlowFlood信息抽取模型,可以在少量人工标记情况下,准确自动抽取被困情况、紧缺物资等信息;在信息抽取模型的基础上,通过模糊层次分析法(FAHP)和CRITIC法(CRiteria Importance Through Intercriteria Correlation)主客观结合评定求助信息的救援优先级,帮助决策者理解灾情紧急程度。实验结果表明,在中文社交媒体数据上,与ChatFlow-7B模型相比,ChatFlowFlood模型的FBERT指标提升了73.09%。