Named entity recognition(NER)is a fundamental task of information extraction(IE),and it has attracted considerable research attention in recent years.The abundant annotated English NER datasets have significantly prom...Named entity recognition(NER)is a fundamental task of information extraction(IE),and it has attracted considerable research attention in recent years.The abundant annotated English NER datasets have significantly promoted the NER research in the English field.By contrast,much fewer efforts are made to the Chinese NER research,especially in the scientific domain,due to the scarcity of Chinese NER datasets.To alleviate this problem,we present aChinese scientificNER dataset–SciCN,which contains entity annotations of titles and abstracts derived from 3,500 scientific papers.We manually annotate a total of 62,059 entities,and these entities are classified into six types.Compared to English scientific NER datasets,SciCN has a larger scale and is more diverse,for it not only contains more paper abstracts but these abstracts are derived from more research fields.To investigate the properties of SciCN and provide baselines for future research,we adapt a number of previous state-of-theart Chinese NER models to evaluate SciCN.Experimental results show that SciCN is more challenging than other Chinese NER datasets.In addition,previous studies have proven the effectiveness of using lexicons to enhance Chinese NER models.Motivated by this fact,we provide a scientific domain-specific lexicon.Validation results demonstrate that our lexicon delivers better performance gains than lexicons of other domains.We hope that the SciCN dataset and the lexicon will enable us to benchmark the NER task regarding the Chinese scientific domain and make progress for future research.The dataset and lexicon are available at:https://github.com/yangjingla/SciCN.git.展开更多
Named Entity Recognition(NER)stands as a fundamental task within the field of biomedical text mining,aiming to extract specific types of entities such as genes,proteins,and diseases from complex biomedical texts and c...Named Entity Recognition(NER)stands as a fundamental task within the field of biomedical text mining,aiming to extract specific types of entities such as genes,proteins,and diseases from complex biomedical texts and categorize them into predefined entity types.This process can provide basic support for the automatic construction of knowledge bases.In contrast to general texts,biomedical texts frequently contain numerous nested entities and local dependencies among these entities,presenting significant challenges to prevailing NER models.To address these issues,we propose a novel Chinese nested biomedical NER model based on RoBERTa and Global Pointer(RoBGP).Our model initially utilizes the RoBERTa-wwm-ext-large pretrained language model to dynamically generate word-level initial vectors.It then incorporates a Bidirectional Long Short-Term Memory network for capturing bidirectional semantic information,effectively addressing the issue of long-distance dependencies.Furthermore,the Global Pointer model is employed to comprehensively recognize all nested entities in the text.We conduct extensive experiments on the Chinese medical dataset CMeEE and the results demonstrate the superior performance of RoBGP over several baseline models.This research confirms the effectiveness of RoBGP in Chinese biomedical NER,providing reliable technical support for biomedical information extraction and knowledge base construction.展开更多
Chinese named entity recognition(CNER)has received widespread attention as an important task of Chinese information extraction.Most previous research has focused on individually studying flat CNER,overlapped CNER,or d...Chinese named entity recognition(CNER)has received widespread attention as an important task of Chinese information extraction.Most previous research has focused on individually studying flat CNER,overlapped CNER,or discontinuous CNER.However,a unified CNER is often needed in real-world scenarios.Recent studies have shown that grid tagging-based methods based on character-pair relationship classification hold great potential for achieving unified NER.Nevertheless,how to enrich Chinese character-pair grid representations and capture deeper dependencies between character pairs to improve entity recognition performance remains an unresolved challenge.In this study,we enhance the character-pair grid representation by incorporating both local and global information.Significantly,we introduce a new approach by considering the character-pair grid representation matrix as a specialized image,converting the classification of character-pair relationships into a pixel-level semantic segmentation task.We devise a U-shaped network to extract multi-scale and deeper semantic information from the grid image,allowing for a more comprehensive understanding of associative features between character pairs.This approach leads to improved accuracy in predicting their relationships,ultimately enhancing entity recognition performance.We conducted experiments on two public CNER datasets in the biomedical domain,namely CMeEE-V2 and Diakg.The results demonstrate the effectiveness of our approach,which achieves F1-score improvements of 7.29 percentage points and 1.64 percentage points compared to the current state-of-the-art(SOTA)models,respectively.展开更多
The context of recognizing handwritten city names,this research addresses the challenges posed by the manual inscription of Bangladeshi city names in the Bangla script.In today’s technology-driven era,where precise t...The context of recognizing handwritten city names,this research addresses the challenges posed by the manual inscription of Bangladeshi city names in the Bangla script.In today’s technology-driven era,where precise tools for reading handwritten text are essential,this study focuses on leveraging deep learning to understand the intricacies of Bangla handwriting.The existing dearth of dedicated datasets has impeded the progress of Bangla handwritten city name recognition systems,particularly in critical areas such as postal automation and document processing.Notably,no prior research has specifically targeted the unique needs of Bangla handwritten city name recognition.To bridge this gap,the study collects real-world images from diverse sources to construct a comprehensive dataset for Bangla Hand Written City name recognition.The emphasis on practical data for system training enhances accuracy.The research further conducts a comparative analysis,pitting state-of-the-art(SOTA)deep learning models,including EfficientNetB0,VGG16,ResNet50,DenseNet201,InceptionV3,and Xception,against a custom Convolutional Neural Networks(CNN)model named“Our CNN.”The results showcase the superior performance of“Our CNN,”with a test accuracy of 99.97% and an outstanding F1 score of 99.95%.These metrics underscore its potential for automating city name recognition,particularly in postal services.The study concludes by highlighting the significance of meticulous dataset curation and the promising outlook for custom CNN architectures.It encourages future research avenues,including dataset expansion,algorithm refinement,exploration of recurrent neural networks and attention mechanisms,real-world deployment of models,and extension to other regional languages and scripts.These recommendations offer exciting possibilities for advancing the field of handwritten recognition technology and hold practical implications for enhancing global postal services.展开更多
Dear Jack,I'm very glad to know that you'll come to China to learn Chinese.And you want to know about Chinese names.Now,I'd like to tell you something about them.Chinese names are different from English na...Dear Jack,I'm very glad to know that you'll come to China to learn Chinese.And you want to know about Chinese names.Now,I'd like to tell you something about them.Chinese names are different from English names.In Chinese,family names always come first and given names come last Given names usually have some special meanings.We also had informal names when we were little kids,such as Congcong,Nana and so on.展开更多
Named Entity Recognition(NER)is crucial for extracting structured information from text.While traditional methods rely on rules,Conditional Random Fields(CRFs),or deep learning,the advent of large-scale Pre-trained La...Named Entity Recognition(NER)is crucial for extracting structured information from text.While traditional methods rely on rules,Conditional Random Fields(CRFs),or deep learning,the advent of large-scale Pre-trained Language Models(PLMs)offers new possibilities.PLMs excel at contextual learning,potentially simplifying many natural language processing tasks.However,their application to NER remains underexplored.This paper investigates leveraging the GPT-3 PLM for NER without fine-tuning.We propose a novel scheme that utilizes carefully crafted templates and context examples selected based on semantic similarity.Our experimental results demonstrate the feasibility of this approach,suggesting a promising direction for harnessing PLMs in NER.展开更多
The assortedness of Chinese food,together with the complexity of their naming elements,has ignited numerous scholars' interests in this field and prompted them to make abundant analyses of Chinese dish names.Most ...The assortedness of Chinese food,together with the complexity of their naming elements,has ignited numerous scholars' interests in this field and prompted them to make abundant analyses of Chinese dish names.Most of them,however,were done in studies of traditional linguistics,rhetoric,translatology and cross-cultural communication.And studies,based on corpus,on the naming elements of Chinese dishes under cognitive linguistic theories almost remain a blank.This paper aims to conduct a quantitative analysis of 4,000 Chinese dish names(500 ones selected freely from each of the eight cuisines),based on the Prominence Principle,in order to identify the specific naming elements of Chinese dishes and forward related statistics and ratios.展开更多
In this paper,the geographic name in Southwest China is regarded as a symbolic representation of human beings,and the dynamic social and historical process behind the place names is restored from the perspective of th...In this paper,the geographic name in Southwest China is regarded as a symbolic representation of human beings,and the dynamic social and historical process behind the place names is restored from the perspective of the symbolic anthropology.There are three paths in the construction and evolution of geographic names in Southwest China—Ethnic information,sacred systems,and local representation,which have been rewritten,masked,and reconstructed over the years.As a result,the system of geographical names is gradually formed and integrated into local memory through space building,culture filling,and so on,affecting and influencing local group identity and cognitive concept.展开更多
The scientific names of organisms are key identifiers of plants and animals.Correctly treating scientific names is a prerequisite for biodiversity research and documentation.Here,we present an R package,’U.Taxonstand...The scientific names of organisms are key identifiers of plants and animals.Correctly treating scientific names is a prerequisite for biodiversity research and documentation.Here,we present an R package,’U.Taxonstand’,which can standardize and harmonize scientific names in plant and animal species lists at a fast speed and at a high rate of matching success.Unlike most of other similar R packages each of which works with only one taxonomic database,U.Taxonstand can work with all taxonomic databases,as long as they are properly formatted.Multiple databases for plants and animals that can be directly used by U.Taxonstand,which include bryophytes,vascular plants,amphibians,birds,fishes,mammals,and reptiles,are available online.U.Taxonstand can be a very useful tool for botanists,zoologists,ecologists and biogeographers to standardize and harmonize scientific names of organisms.展开更多
The power grid operation process is complex,and many operation process data involve national security,business secrets,and user privacy.Meanwhile,labeled datasets may exist in many different operation platforms,but th...The power grid operation process is complex,and many operation process data involve national security,business secrets,and user privacy.Meanwhile,labeled datasets may exist in many different operation platforms,but they cannot be directly shared since power grid data is highly privacysensitive.How to use these multi-source heterogeneous data as much as possible to build a power grid knowledge map under the premise of protecting privacy security has become an urgent problem in developing smart grid.Therefore,this paper proposes federated learning named entity recognition method for the power grid field,aiming to solve the problem of building a named entity recognition model covering the entire power grid process training by data with different security requirements.We decompose the named entity recognition(NER)model FLAT(Chinese NER Using Flat-Lattice Transformer)in each platform into a global part and a local part.The local part is used to capture the characteristics of the local data in each platform and is updated using locally labeled data.The global part is learned across different operation platforms to capture the shared NER knowledge.Its local gradients fromdifferent platforms are aggregated to update the global model,which is further delivered to each platform to update their global part.Experiments on two publicly available Chinese datasets and one power grid dataset validate the effectiveness of our method.展开更多
Named Data Networking(NDN)is gaining a significant attention in Vehicular Ad-hoc Networks(VANET)due to its in-network content caching,name-based routing,and mobility-supporting characteristics.Nevertheless,existing ND...Named Data Networking(NDN)is gaining a significant attention in Vehicular Ad-hoc Networks(VANET)due to its in-network content caching,name-based routing,and mobility-supporting characteristics.Nevertheless,existing NDN faces three significant challenges,including security,privacy,and routing.In particular,security attacks,such as Content Poisoning Attacks(CPA),can jeopardize legitimate vehicles with malicious content.For instance,attacker host vehicles can serve consumers with invalid information,which has dire consequences,including road accidents.In such a situation,trust in the content-providing vehicles brings a new challenge.On the other hand,ensuring privacy and preventing unauthorized access in vehicular(VNDN)is another challenge.Moreover,NDN’s pull-based content retrieval mechanism is inefficient for delivering emergency messages in VNDN.In this connection,our contribution is threefold.Unlike existing rule-based reputation evaluation,we propose a Machine Learning(ML)-based reputation evaluation mechanism that identifies CPA attackers and legitimate nodes.Based on ML evaluation results,vehicles accept or discard served content.Secondly,we exploit a decentralized blockchain system to ensure vehicles’privacy by maintaining their information in a secure digital ledger.Finally,we improve the default routing mechanism of VNDN from pull to a push-based content dissemination using Publish-Subscribe(Pub-Sub)approach.We implemented and evaluated our ML-based classification model on a publicly accessible BurST-Asutralian dataset for Misbehavior Detection(BurST-ADMA).We used five(05)hybrid ML classifiers,including Logistic Regression,Decision Tree,K-Nearest Neighbors,Random Forest,and Gaussian Naive Bayes.The qualitative results indicate that Random Forest has achieved the highest average accuracy rate of 100%.Our proposed research offers the most accurate solution to detect CPA in VNDN for safe,secure,and reliable vehicle communication.展开更多
In recent years,cyber attacks have been intensifying and causing great harm to individuals,companies,and countries.The mining of cyber threat intelligence(CTI)can facilitate intelligence integration and serve well in ...In recent years,cyber attacks have been intensifying and causing great harm to individuals,companies,and countries.The mining of cyber threat intelligence(CTI)can facilitate intelligence integration and serve well in combating cyber attacks.Named Entity Recognition(NER),as a crucial component of text mining,can structure complex CTI text and aid cybersecurity professionals in effectively countering threats.However,current CTI NER research has mainly focused on studying English CTI.In the limited studies conducted on Chinese text,existing models have shown poor performance.To fully utilize the power of Chinese pre-trained language models(PLMs)and conquer the problem of lengthy infrequent English words mixing in the Chinese CTIs,we propose a residual dilated convolutional neural network(RDCNN)with a conditional random field(CRF)based on a robustly optimized bidirectional encoder representation from transformers pre-training approach with whole word masking(RoBERTa-wwm),abbreviated as RoBERTa-wwm-RDCNN-CRF.We are the first to experiment on the relevant open source dataset and achieve an F1-score of 82.35%,which exceeds the common baseline model bidirectional encoder representation from transformers(BERT)-bidirectional long short-term memory(BiLSTM)-CRF in this field by about 19.52%and exceeds the current state-of-the-art model,BERT-RDCNN-CRF,by about 3.53%.In addition,we conducted an ablation study on the encoder part of the model to verify the effectiveness of the proposed model and an in-depth investigation of the PLMs and encoder part of the model to verify the effectiveness of the proposed model.The RoBERTa-wwm-RDCNN-CRF model,the shared pre-processing,and augmentation methods can serve the subsequent fundamental tasks such as cybersecurity information extraction and knowledge graph construction,contributing to important applications in downstream tasks such as intrusion detection and advanced persistent threat(APT)attack detection.展开更多
With the rapid development of information technology,the electronifi-cation of medical records has gradually become a trend.In China,the population base is huge and the supporting medical institutions are numerous,so ...With the rapid development of information technology,the electronifi-cation of medical records has gradually become a trend.In China,the population base is huge and the supporting medical institutions are numerous,so this reality drives the conversion of paper medical records to electronic medical records.Electronic medical records are the basis for establishing a smart hospital and an important guarantee for achieving medical intelligence,and the massive amount of electronic medical record data is also an important data set for conducting research in the medical field.However,electronic medical records contain a large amount of private patient information,which must be desensitized before they are used as open resources.Therefore,to solve the above problems,data masking for Chinese electronic medical records with named entity recognition is proposed in this paper.Firstly,the text is vectorized to satisfy the required format of the model input.Secondly,since the input sentences may have a long or short length and the relationship between sentences in context is not negligible.To this end,a neural network model for named entity recognition based on bidirectional long short-term memory(BiLSTM)with conditional random fields(CRF)is constructed.Finally,the data masking operation is performed based on the named entity recog-nition results,mainly using regular expression filtering encryption and principal component analysis(PCA)word vector compression and replacement.In addi-tion,comparison experiments with the hidden markov model(HMM)model,LSTM-CRF model,and BiLSTM model are conducted in this paper.The experi-mental results show that the method used in this paper achieves 92.72%Accuracy,92.30%Recall,and 92.51%F1_score,which has higher accuracy compared with other models.展开更多
The naming convention in English-speaking countries(e.g.,USA and UK),and several others in the Western culture,where women traditionally have adopted their husbands’surnames,is compared with the naming convention in ...The naming convention in English-speaking countries(e.g.,USA and UK),and several others in the Western culture,where women traditionally have adopted their husbands’surnames,is compared with the naming convention in Spain and Latin America,where women do not relinquish their maiden surnames.From a cross-cultural perspective spanning over three centuries,from Madame de Staël and Virginia Woolf to Hillary Clinton,this essay renders instances of women who took on the surname of their spouse upon marriage.It appears that even nowadays many women,including feminists,choose to comply with this patriarchal habit.Entanglements arising upon divorce or remarriage,such as traceability and perception of selfhood,especially for women with academic and professional profiles,are discussed here.Samples collected from life and literature across a fairly representative cultural range and diverse moments in history help to reach conclusions and come up with a consistent argument.Winds of change seem to be blowing with Vice President Kamala Harris,whose case is mentioned at the end of this essay.To circumvent the confusion for individuals and families(especially“blended”ones)that could result in the discrimination between males and females,on the one hand,and on the other hand,between married and unmarried women,the Spanish naming convention is proposed as a perfect compromise.This consists in every person bearing two surnames from birth and for good:one of each parent.Thus,women would keep their name(s),and along with them their perception of their self and their social and professional identity.展开更多
Recent advancements in the Vehicular Ad-hoc Network(VANET)have tremendously addressed road-related challenges.Specifically,Named Data Networking(NDN)in VANET has emerged as a vital technology due to its outstanding fe...Recent advancements in the Vehicular Ad-hoc Network(VANET)have tremendously addressed road-related challenges.Specifically,Named Data Networking(NDN)in VANET has emerged as a vital technology due to its outstanding features.However,the NDN communication framework fails to address two important issues.The current NDN employs a pull-based content retrieval network,which is inefficient in disseminating crucial content in Vehicular Named Data Networking(VNDN).Additionally,VNDN is vulnerable to illusion attackers due to the administrative-less network of autonomous vehicles.Although various solutions have been proposed for detecting vehicles’behavior,they inadequately addressed the challenges specific to VNDN.To deal with these two issues,we propose a novel push-based crucial content dissemination scheme that extends the scope of VNDN from pullbased content retrieval to a push-based content forwarding mechanism.In addition,we exploitMachine Learning(ML)techniques within VNDN to detect the behavior of vehicles and classify them as attackers or legitimate.We trained and tested our system on the publicly accessible dataset Vehicular Reference Misbehavior(VeReMi).We employed fiveML classification algorithms and constructed the bestmodel for illusion attack detection.Our results indicate that RandomForest(RF)achieved excellent accuracy in detecting all illusion attack types in VeReMi,with an accuracy rate of 100%for type 1 and type 2,96%for type 4 and type 16,and 95%for type 8.Thus,RF can effectively evaluate the behavior of vehicles and identify attacker vehicles with high accuracy.The ultimate goal of our research is to improve content exchange and secureVNDNfromattackers.Thus,ourML-based attack detection and preventionmechanismensures trustworthy content dissemination and prevents attacker vehicles from sharing misleading information in VNDN.展开更多
Vehicular data misuse may lead to traffic accidents and even loss of life,so it is crucial to achieve secure vehicular data communications.This paper focuses on secure vehicular data communications in the Named Data N...Vehicular data misuse may lead to traffic accidents and even loss of life,so it is crucial to achieve secure vehicular data communications.This paper focuses on secure vehicular data communications in the Named Data Networking(NDN).In NDN,names,provider IDs and data are transmitted in plaintext,which exposes vehicular data to security threats and leads to considerable data communication costs and failure rates.This paper proposes a Secure vehicular Data Communication(SDC)approach in NDN to supress data communication costs and failure rates.SCD constructs a vehicular backbone to reduce the number of authenticated nodes involved in reverse paths.Only the ciphtertext of the name and data is included in the signed Interest and Data and transmitted along the backbone,so the secure data communications are achieved.SCD is evaluated,and the data results demonstrate that SCD achieves the above objectives.展开更多
An obviously challenging problem in named entity recognition is the construction of the kind data set of entities.Although some research has been conducted on entity database construction,the majority of them are dire...An obviously challenging problem in named entity recognition is the construction of the kind data set of entities.Although some research has been conducted on entity database construction,the majority of them are directed at Wikipedia or the minority at structured entities such as people,locations and organizational nouns in the news.This paper focuses on the identification of scientific entities in carbonate platforms in English literature,using the example of carbonate platforms in sedimentology.Firstly,based on the fact that the reasons for writing literature in key disciplines are likely to be provided by multidisciplinary experts,this paper designs a literature content extraction method that allows dealing with complex text structures.Secondly,based on the literature extraction content,we formalize the entity extraction task(lexicon and lexical-based entity extraction)for entity extraction.Furthermore,for testing the accuracy of entity extraction,three currently popular recognition methods are chosen to perform entity detection in this paper.Experiments show that the entity data set provided by the lexicon and lexical-based entity extraction method is of significant assistance for the named entity recognition task.This study presents a pilot study of entity extraction,which involves the use of a complex structure and specialized literature on carbonate platforms in English.展开更多
Computational linguistics is an engineering-based scientific discipline.It deals with understanding written and spoken language from a computational viewpoint.Further,the domain also helps construct the artefacts that...Computational linguistics is an engineering-based scientific discipline.It deals with understanding written and spoken language from a computational viewpoint.Further,the domain also helps construct the artefacts that are useful in processing and producing a language either in bulk or in a dialogue setting.Named Entity Recognition(NER)is a fundamental task in the data extraction process.It concentrates on identifying and labelling the atomic components from several texts grouped under different entities,such as organizations,people,places,and times.Further,the NER mechanism identifies and removes more types of entities as per the requirements.The significance of the NER mechanism has been well-established in Natural Language Processing(NLP)tasks,and various research investigations have been conducted to develop novel NER methods.The conventional ways of managing the tasks range from rule-related and hand-crafted feature-related Machine Learning(ML)techniques to Deep Learning(DL)techniques.In this aspect,the current study introduces a novel Dart Games Optimizer with Hybrid Deep Learning-Driven Computational Linguistics(DGOHDL-CL)model for NER.The presented DGOHDL-CL technique aims to determine and label the atomic components from several texts as a collection of the named entities.In the presented DGOHDL-CL technique,the word embed-ding process is executed at the initial stage with the help of the word2vec model.For the NER mechanism,the Convolutional Gated Recurrent Unit(CGRU)model is employed in this work.At last,the DGO technique is used as a hyperparameter tuning strategy for the CGRU algorithm to boost the NER’s outcomes.No earlier studies integrated the DGO mechanism with the CGRU model for NER.To exhibit the superiority of the proposed DGOHDL-CL technique,a widespread simulation analysis was executed on two datasets,CoNLL-2003 and OntoNotes 5.0.The experimental outcomes establish the promising performance of the DGOHDL-CL technique over other models.展开更多
Settlement naming is an important carrier of settlement society and culture,carries regional culture,historical information,beliefs and other information,and is an important clue to understand regional culture and dev...Settlement naming is an important carrier of settlement society and culture,carries regional culture,historical information,beliefs and other information,and is an important clue to understand regional culture and development characteristics.In this paper,the naming of traditional mountain settlements in Mentougou in western Beijing was studied through literature review and field research,and the correlation between the naming and distribution characteristics of the settlements was discussed to provide reference for the protection and construction of mountain settlements in Mentougou.展开更多
Guangzhou and Foshan enjoy relatively mature metro network.However,some names of metro stations are over-transliterated in Pinyin.Such a translation method is used in translating general names,nouns of locality and so...Guangzhou and Foshan enjoy relatively mature metro network.However,some names of metro stations are over-transliterated in Pinyin.Such a translation method is used in translating general names,nouns of locality and some names of tourist destinations.With translation landscape and linguistic landscape theories,the reasons and impacts of over-transliteration in Guangzhou and Foshan metro will be discussed from the perspective of symbolic function.English names of Metro stations in other cities serve as a reference so as to appropriate solutions.展开更多
基金This research was supported by the National Key Research and Development Program[2020YFB1006302].
文摘Named entity recognition(NER)is a fundamental task of information extraction(IE),and it has attracted considerable research attention in recent years.The abundant annotated English NER datasets have significantly promoted the NER research in the English field.By contrast,much fewer efforts are made to the Chinese NER research,especially in the scientific domain,due to the scarcity of Chinese NER datasets.To alleviate this problem,we present aChinese scientificNER dataset–SciCN,which contains entity annotations of titles and abstracts derived from 3,500 scientific papers.We manually annotate a total of 62,059 entities,and these entities are classified into six types.Compared to English scientific NER datasets,SciCN has a larger scale and is more diverse,for it not only contains more paper abstracts but these abstracts are derived from more research fields.To investigate the properties of SciCN and provide baselines for future research,we adapt a number of previous state-of-theart Chinese NER models to evaluate SciCN.Experimental results show that SciCN is more challenging than other Chinese NER datasets.In addition,previous studies have proven the effectiveness of using lexicons to enhance Chinese NER models.Motivated by this fact,we provide a scientific domain-specific lexicon.Validation results demonstrate that our lexicon delivers better performance gains than lexicons of other domains.We hope that the SciCN dataset and the lexicon will enable us to benchmark the NER task regarding the Chinese scientific domain and make progress for future research.The dataset and lexicon are available at:https://github.com/yangjingla/SciCN.git.
基金supported by the Outstanding Youth Team Project of Central Universities(QNTD202308)the Ant Group through CCF-Ant Research Fund(CCF-AFSG 769498 RF20220214).
文摘Named Entity Recognition(NER)stands as a fundamental task within the field of biomedical text mining,aiming to extract specific types of entities such as genes,proteins,and diseases from complex biomedical texts and categorize them into predefined entity types.This process can provide basic support for the automatic construction of knowledge bases.In contrast to general texts,biomedical texts frequently contain numerous nested entities and local dependencies among these entities,presenting significant challenges to prevailing NER models.To address these issues,we propose a novel Chinese nested biomedical NER model based on RoBERTa and Global Pointer(RoBGP).Our model initially utilizes the RoBERTa-wwm-ext-large pretrained language model to dynamically generate word-level initial vectors.It then incorporates a Bidirectional Long Short-Term Memory network for capturing bidirectional semantic information,effectively addressing the issue of long-distance dependencies.Furthermore,the Global Pointer model is employed to comprehensively recognize all nested entities in the text.We conduct extensive experiments on the Chinese medical dataset CMeEE and the results demonstrate the superior performance of RoBGP over several baseline models.This research confirms the effectiveness of RoBGP in Chinese biomedical NER,providing reliable technical support for biomedical information extraction and knowledge base construction.
基金supported by Yunnan Provincial Major Science and Technology Special Plan Projects(Grant Nos.202202AD080003,202202AE090008,202202AD080004,202302AD080003)National Natural Science Foundation of China(Grant Nos.U21B2027,62266027,62266028,62266025)Yunnan Province Young and Middle-Aged Academic and Technical Leaders Reserve Talent Program(Grant No.202305AC160063).
文摘Chinese named entity recognition(CNER)has received widespread attention as an important task of Chinese information extraction.Most previous research has focused on individually studying flat CNER,overlapped CNER,or discontinuous CNER.However,a unified CNER is often needed in real-world scenarios.Recent studies have shown that grid tagging-based methods based on character-pair relationship classification hold great potential for achieving unified NER.Nevertheless,how to enrich Chinese character-pair grid representations and capture deeper dependencies between character pairs to improve entity recognition performance remains an unresolved challenge.In this study,we enhance the character-pair grid representation by incorporating both local and global information.Significantly,we introduce a new approach by considering the character-pair grid representation matrix as a specialized image,converting the classification of character-pair relationships into a pixel-level semantic segmentation task.We devise a U-shaped network to extract multi-scale and deeper semantic information from the grid image,allowing for a more comprehensive understanding of associative features between character pairs.This approach leads to improved accuracy in predicting their relationships,ultimately enhancing entity recognition performance.We conducted experiments on two public CNER datasets in the biomedical domain,namely CMeEE-V2 and Diakg.The results demonstrate the effectiveness of our approach,which achieves F1-score improvements of 7.29 percentage points and 1.64 percentage points compared to the current state-of-the-art(SOTA)models,respectively.
基金MMU Postdoctoral and Research Fellow(Account:MMUI/230023.02).
文摘The context of recognizing handwritten city names,this research addresses the challenges posed by the manual inscription of Bangladeshi city names in the Bangla script.In today’s technology-driven era,where precise tools for reading handwritten text are essential,this study focuses on leveraging deep learning to understand the intricacies of Bangla handwriting.The existing dearth of dedicated datasets has impeded the progress of Bangla handwritten city name recognition systems,particularly in critical areas such as postal automation and document processing.Notably,no prior research has specifically targeted the unique needs of Bangla handwritten city name recognition.To bridge this gap,the study collects real-world images from diverse sources to construct a comprehensive dataset for Bangla Hand Written City name recognition.The emphasis on practical data for system training enhances accuracy.The research further conducts a comparative analysis,pitting state-of-the-art(SOTA)deep learning models,including EfficientNetB0,VGG16,ResNet50,DenseNet201,InceptionV3,and Xception,against a custom Convolutional Neural Networks(CNN)model named“Our CNN.”The results showcase the superior performance of“Our CNN,”with a test accuracy of 99.97% and an outstanding F1 score of 99.95%.These metrics underscore its potential for automating city name recognition,particularly in postal services.The study concludes by highlighting the significance of meticulous dataset curation and the promising outlook for custom CNN architectures.It encourages future research avenues,including dataset expansion,algorithm refinement,exploration of recurrent neural networks and attention mechanisms,real-world deployment of models,and extension to other regional languages and scripts.These recommendations offer exciting possibilities for advancing the field of handwritten recognition technology and hold practical implications for enhancing global postal services.
文摘Dear Jack,I'm very glad to know that you'll come to China to learn Chinese.And you want to know about Chinese names.Now,I'd like to tell you something about them.Chinese names are different from English names.In Chinese,family names always come first and given names come last Given names usually have some special meanings.We also had informal names when we were little kids,such as Congcong,Nana and so on.
文摘Named Entity Recognition(NER)is crucial for extracting structured information from text.While traditional methods rely on rules,Conditional Random Fields(CRFs),or deep learning,the advent of large-scale Pre-trained Language Models(PLMs)offers new possibilities.PLMs excel at contextual learning,potentially simplifying many natural language processing tasks.However,their application to NER remains underexplored.This paper investigates leveraging the GPT-3 PLM for NER without fine-tuning.We propose a novel scheme that utilizes carefully crafted templates and context examples selected based on semantic similarity.Our experimental results demonstrate the feasibility of this approach,suggesting a promising direction for harnessing PLMs in NER.
文摘The assortedness of Chinese food,together with the complexity of their naming elements,has ignited numerous scholars' interests in this field and prompted them to make abundant analyses of Chinese dish names.Most of them,however,were done in studies of traditional linguistics,rhetoric,translatology and cross-cultural communication.And studies,based on corpus,on the naming elements of Chinese dishes under cognitive linguistic theories almost remain a blank.This paper aims to conduct a quantitative analysis of 4,000 Chinese dish names(500 ones selected freely from each of the eight cuisines),based on the Prominence Principle,in order to identify the specific naming elements of Chinese dishes and forward related statistics and ratios.
文摘In this paper,the geographic name in Southwest China is regarded as a symbolic representation of human beings,and the dynamic social and historical process behind the place names is restored from the perspective of the symbolic anthropology.There are three paths in the construction and evolution of geographic names in Southwest China—Ethnic information,sacred systems,and local representation,which have been rewritten,masked,and reconstructed over the years.As a result,the system of geographical names is gradually formed and integrated into local memory through space building,culture filling,and so on,affecting and influencing local group identity and cognitive concept.
基金supported by the National Natural Science Foundation of China (32030068)the Shanghai Municipal Natural Science Foundation (20ZR1418100) to J.Z.
文摘The scientific names of organisms are key identifiers of plants and animals.Correctly treating scientific names is a prerequisite for biodiversity research and documentation.Here,we present an R package,’U.Taxonstand’,which can standardize and harmonize scientific names in plant and animal species lists at a fast speed and at a high rate of matching success.Unlike most of other similar R packages each of which works with only one taxonomic database,U.Taxonstand can work with all taxonomic databases,as long as they are properly formatted.Multiple databases for plants and animals that can be directly used by U.Taxonstand,which include bryophytes,vascular plants,amphibians,birds,fishes,mammals,and reptiles,are available online.U.Taxonstand can be a very useful tool for botanists,zoologists,ecologists and biogeographers to standardize and harmonize scientific names of organisms.
基金Thisworkwas supported by State Grid Science and TechnologyResearch Program(SGSCJY00NYJS2200026).
文摘The power grid operation process is complex,and many operation process data involve national security,business secrets,and user privacy.Meanwhile,labeled datasets may exist in many different operation platforms,but they cannot be directly shared since power grid data is highly privacysensitive.How to use these multi-source heterogeneous data as much as possible to build a power grid knowledge map under the premise of protecting privacy security has become an urgent problem in developing smart grid.Therefore,this paper proposes federated learning named entity recognition method for the power grid field,aiming to solve the problem of building a named entity recognition model covering the entire power grid process training by data with different security requirements.We decompose the named entity recognition(NER)model FLAT(Chinese NER Using Flat-Lattice Transformer)in each platform into a global part and a local part.The local part is used to capture the characteristics of the local data in each platform and is updated using locally labeled data.The global part is learned across different operation platforms to capture the shared NER knowledge.Its local gradients fromdifferent platforms are aggregated to update the global model,which is further delivered to each platform to update their global part.Experiments on two publicly available Chinese datasets and one power grid dataset validate the effectiveness of our method.
基金Supporting Project Number(RSPD2023R553),King Saud University,Riyadh,Saudi Arabia.
文摘Named Data Networking(NDN)is gaining a significant attention in Vehicular Ad-hoc Networks(VANET)due to its in-network content caching,name-based routing,and mobility-supporting characteristics.Nevertheless,existing NDN faces three significant challenges,including security,privacy,and routing.In particular,security attacks,such as Content Poisoning Attacks(CPA),can jeopardize legitimate vehicles with malicious content.For instance,attacker host vehicles can serve consumers with invalid information,which has dire consequences,including road accidents.In such a situation,trust in the content-providing vehicles brings a new challenge.On the other hand,ensuring privacy and preventing unauthorized access in vehicular(VNDN)is another challenge.Moreover,NDN’s pull-based content retrieval mechanism is inefficient for delivering emergency messages in VNDN.In this connection,our contribution is threefold.Unlike existing rule-based reputation evaluation,we propose a Machine Learning(ML)-based reputation evaluation mechanism that identifies CPA attackers and legitimate nodes.Based on ML evaluation results,vehicles accept or discard served content.Secondly,we exploit a decentralized blockchain system to ensure vehicles’privacy by maintaining their information in a secure digital ledger.Finally,we improve the default routing mechanism of VNDN from pull to a push-based content dissemination using Publish-Subscribe(Pub-Sub)approach.We implemented and evaluated our ML-based classification model on a publicly accessible BurST-Asutralian dataset for Misbehavior Detection(BurST-ADMA).We used five(05)hybrid ML classifiers,including Logistic Regression,Decision Tree,K-Nearest Neighbors,Random Forest,and Gaussian Naive Bayes.The qualitative results indicate that Random Forest has achieved the highest average accuracy rate of 100%.Our proposed research offers the most accurate solution to detect CPA in VNDN for safe,secure,and reliable vehicle communication.
基金funded by the Double Top-Class Innovation Research Project in Cyberspace Security Enforcement Technology of People’s Public Security University of China(No.2023SYL07).
文摘In recent years,cyber attacks have been intensifying and causing great harm to individuals,companies,and countries.The mining of cyber threat intelligence(CTI)can facilitate intelligence integration and serve well in combating cyber attacks.Named Entity Recognition(NER),as a crucial component of text mining,can structure complex CTI text and aid cybersecurity professionals in effectively countering threats.However,current CTI NER research has mainly focused on studying English CTI.In the limited studies conducted on Chinese text,existing models have shown poor performance.To fully utilize the power of Chinese pre-trained language models(PLMs)and conquer the problem of lengthy infrequent English words mixing in the Chinese CTIs,we propose a residual dilated convolutional neural network(RDCNN)with a conditional random field(CRF)based on a robustly optimized bidirectional encoder representation from transformers pre-training approach with whole word masking(RoBERTa-wwm),abbreviated as RoBERTa-wwm-RDCNN-CRF.We are the first to experiment on the relevant open source dataset and achieve an F1-score of 82.35%,which exceeds the common baseline model bidirectional encoder representation from transformers(BERT)-bidirectional long short-term memory(BiLSTM)-CRF in this field by about 19.52%and exceeds the current state-of-the-art model,BERT-RDCNN-CRF,by about 3.53%.In addition,we conducted an ablation study on the encoder part of the model to verify the effectiveness of the proposed model and an in-depth investigation of the PLMs and encoder part of the model to verify the effectiveness of the proposed model.The RoBERTa-wwm-RDCNN-CRF model,the shared pre-processing,and augmentation methods can serve the subsequent fundamental tasks such as cybersecurity information extraction and knowledge graph construction,contributing to important applications in downstream tasks such as intrusion detection and advanced persistent threat(APT)attack detection.
基金This research was supported by the National Natural Science Foundation of China under Grant(No.42050102)the Postgraduate Education Reform Project of Jiangsu Province under Grant(No.SJCX22_0343)Also,this research was supported by Dou Wanchun Expert Workstation of Yunnan Province(No.202205AF150013).
文摘With the rapid development of information technology,the electronifi-cation of medical records has gradually become a trend.In China,the population base is huge and the supporting medical institutions are numerous,so this reality drives the conversion of paper medical records to electronic medical records.Electronic medical records are the basis for establishing a smart hospital and an important guarantee for achieving medical intelligence,and the massive amount of electronic medical record data is also an important data set for conducting research in the medical field.However,electronic medical records contain a large amount of private patient information,which must be desensitized before they are used as open resources.Therefore,to solve the above problems,data masking for Chinese electronic medical records with named entity recognition is proposed in this paper.Firstly,the text is vectorized to satisfy the required format of the model input.Secondly,since the input sentences may have a long or short length and the relationship between sentences in context is not negligible.To this end,a neural network model for named entity recognition based on bidirectional long short-term memory(BiLSTM)with conditional random fields(CRF)is constructed.Finally,the data masking operation is performed based on the named entity recog-nition results,mainly using regular expression filtering encryption and principal component analysis(PCA)word vector compression and replacement.In addi-tion,comparison experiments with the hidden markov model(HMM)model,LSTM-CRF model,and BiLSTM model are conducted in this paper.The experi-mental results show that the method used in this paper achieves 92.72%Accuracy,92.30%Recall,and 92.51%F1_score,which has higher accuracy compared with other models.
文摘The naming convention in English-speaking countries(e.g.,USA and UK),and several others in the Western culture,where women traditionally have adopted their husbands’surnames,is compared with the naming convention in Spain and Latin America,where women do not relinquish their maiden surnames.From a cross-cultural perspective spanning over three centuries,from Madame de Staël and Virginia Woolf to Hillary Clinton,this essay renders instances of women who took on the surname of their spouse upon marriage.It appears that even nowadays many women,including feminists,choose to comply with this patriarchal habit.Entanglements arising upon divorce or remarriage,such as traceability and perception of selfhood,especially for women with academic and professional profiles,are discussed here.Samples collected from life and literature across a fairly representative cultural range and diverse moments in history help to reach conclusions and come up with a consistent argument.Winds of change seem to be blowing with Vice President Kamala Harris,whose case is mentioned at the end of this essay.To circumvent the confusion for individuals and families(especially“blended”ones)that could result in the discrimination between males and females,on the one hand,and on the other hand,between married and unmarried women,the Spanish naming convention is proposed as a perfect compromise.This consists in every person bearing two surnames from birth and for good:one of each parent.Thus,women would keep their name(s),and along with them their perception of their self and their social and professional identity.
基金supported by the Researchers Supporting Project Number(RSP2023R34)King Saud University,Riyadh,Saudi Arabia。
文摘Recent advancements in the Vehicular Ad-hoc Network(VANET)have tremendously addressed road-related challenges.Specifically,Named Data Networking(NDN)in VANET has emerged as a vital technology due to its outstanding features.However,the NDN communication framework fails to address two important issues.The current NDN employs a pull-based content retrieval network,which is inefficient in disseminating crucial content in Vehicular Named Data Networking(VNDN).Additionally,VNDN is vulnerable to illusion attackers due to the administrative-less network of autonomous vehicles.Although various solutions have been proposed for detecting vehicles’behavior,they inadequately addressed the challenges specific to VNDN.To deal with these two issues,we propose a novel push-based crucial content dissemination scheme that extends the scope of VNDN from pullbased content retrieval to a push-based content forwarding mechanism.In addition,we exploitMachine Learning(ML)techniques within VNDN to detect the behavior of vehicles and classify them as attackers or legitimate.We trained and tested our system on the publicly accessible dataset Vehicular Reference Misbehavior(VeReMi).We employed fiveML classification algorithms and constructed the bestmodel for illusion attack detection.Our results indicate that RandomForest(RF)achieved excellent accuracy in detecting all illusion attack types in VeReMi,with an accuracy rate of 100%for type 1 and type 2,96%for type 4 and type 16,and 95%for type 8.Thus,RF can effectively evaluate the behavior of vehicles and identify attacker vehicles with high accuracy.The ultimate goal of our research is to improve content exchange and secureVNDNfromattackers.Thus,ourML-based attack detection and preventionmechanismensures trustworthy content dissemination and prevents attacker vehicles from sharing misleading information in VNDN.
基金supported by the National Natural Science Foundation of China under Grant No.62032013the LiaoNing Revitalization Talents Program under Grant No.XLYC1902010.
文摘Vehicular data misuse may lead to traffic accidents and even loss of life,so it is crucial to achieve secure vehicular data communications.This paper focuses on secure vehicular data communications in the Named Data Networking(NDN).In NDN,names,provider IDs and data are transmitted in plaintext,which exposes vehicular data to security threats and leads to considerable data communication costs and failure rates.This paper proposes a Secure vehicular Data Communication(SDC)approach in NDN to supress data communication costs and failure rates.SCD constructs a vehicular backbone to reduce the number of authenticated nodes involved in reverse paths.Only the ciphtertext of the name and data is included in the signed Interest and Data and transmitted along the backbone,so the secure data communications are achieved.SCD is evaluated,and the data results demonstrate that SCD achieves the above objectives.
基金supported by the National Natural Science Foundation of China under Grant No.42050102the National Science Foundation of China(Grant No.62001236)the Natural Science Foundation of the Jiangsu Higher Education Institutions of China(Grant No.20KJA520003).
文摘An obviously challenging problem in named entity recognition is the construction of the kind data set of entities.Although some research has been conducted on entity database construction,the majority of them are directed at Wikipedia or the minority at structured entities such as people,locations and organizational nouns in the news.This paper focuses on the identification of scientific entities in carbonate platforms in English literature,using the example of carbonate platforms in sedimentology.Firstly,based on the fact that the reasons for writing literature in key disciplines are likely to be provided by multidisciplinary experts,this paper designs a literature content extraction method that allows dealing with complex text structures.Secondly,based on the literature extraction content,we formalize the entity extraction task(lexicon and lexical-based entity extraction)for entity extraction.Furthermore,for testing the accuracy of entity extraction,three currently popular recognition methods are chosen to perform entity detection in this paper.Experiments show that the entity data set provided by the lexicon and lexical-based entity extraction method is of significant assistance for the named entity recognition task.This study presents a pilot study of entity extraction,which involves the use of a complex structure and specialized literature on carbonate platforms in English.
基金Princess Nourah Bint Abdulrahman University Researchers Supporting Project Number(PNURSP2022R281)Princess Nourah Bint Abdulrahman University,Riyadh,Saudi Arabia.The authors would like to thank the Deanship of Scientific Research at Umm Al-Qura University for supporting this work by Grant Code:(22UQU4331004DSR10).
文摘Computational linguistics is an engineering-based scientific discipline.It deals with understanding written and spoken language from a computational viewpoint.Further,the domain also helps construct the artefacts that are useful in processing and producing a language either in bulk or in a dialogue setting.Named Entity Recognition(NER)is a fundamental task in the data extraction process.It concentrates on identifying and labelling the atomic components from several texts grouped under different entities,such as organizations,people,places,and times.Further,the NER mechanism identifies and removes more types of entities as per the requirements.The significance of the NER mechanism has been well-established in Natural Language Processing(NLP)tasks,and various research investigations have been conducted to develop novel NER methods.The conventional ways of managing the tasks range from rule-related and hand-crafted feature-related Machine Learning(ML)techniques to Deep Learning(DL)techniques.In this aspect,the current study introduces a novel Dart Games Optimizer with Hybrid Deep Learning-Driven Computational Linguistics(DGOHDL-CL)model for NER.The presented DGOHDL-CL technique aims to determine and label the atomic components from several texts as a collection of the named entities.In the presented DGOHDL-CL technique,the word embed-ding process is executed at the initial stage with the help of the word2vec model.For the NER mechanism,the Convolutional Gated Recurrent Unit(CGRU)model is employed in this work.At last,the DGO technique is used as a hyperparameter tuning strategy for the CGRU algorithm to boost the NER’s outcomes.No earlier studies integrated the DGO mechanism with the CGRU model for NER.To exhibit the superiority of the proposed DGOHDL-CL technique,a widespread simulation analysis was executed on two datasets,CoNLL-2003 and OntoNotes 5.0.The experimental outcomes establish the promising performance of the DGOHDL-CL technique over other models.
文摘Settlement naming is an important carrier of settlement society and culture,carries regional culture,historical information,beliefs and other information,and is an important clue to understand regional culture and development characteristics.In this paper,the naming of traditional mountain settlements in Mentougou in western Beijing was studied through literature review and field research,and the correlation between the naming and distribution characteristics of the settlements was discussed to provide reference for the protection and construction of mountain settlements in Mentougou.
文摘Guangzhou and Foshan enjoy relatively mature metro network.However,some names of metro stations are over-transliterated in Pinyin.Such a translation method is used in translating general names,nouns of locality and some names of tourist destinations.With translation landscape and linguistic landscape theories,the reasons and impacts of over-transliteration in Guangzhou and Foshan metro will be discussed from the perspective of symbolic function.English names of Metro stations in other cities serve as a reference so as to appropriate solutions.