In the field of natural language processing(NLP),there have been various pre-training language models in recent years,with question answering systems gaining significant attention.However,as algorithms,data,and comput...In the field of natural language processing(NLP),there have been various pre-training language models in recent years,with question answering systems gaining significant attention.However,as algorithms,data,and computing power advance,the issue of increasingly larger models and a growing number of parameters has surfaced.Consequently,model training has become more costly and less efficient.To enhance the efficiency and accuracy of the training process while reducing themodel volume,this paper proposes a first-order pruningmodel PAL-BERT based on the ALBERT model according to the characteristics of question-answering(QA)system and language model.Firstly,a first-order network pruning method based on the ALBERT model is designed,and the PAL-BERT model is formed.Then,the parameter optimization strategy of the PAL-BERT model is formulated,and the Mish function was used as an activation function instead of ReLU to improve the performance.Finally,after comparison experiments with traditional deep learning models TextCNN and BiLSTM,it is confirmed that PALBERT is a pruning model compression method that can significantly reduce training time and optimize training efficiency.Compared with traditional models,PAL-BERT significantly improves the NLP task’s performance.展开更多
Named Entity Recognition(NER)stands as a fundamental task within the field of biomedical text mining,aiming to extract specific types of entities such as genes,proteins,and diseases from complex biomedical texts and c...Named Entity Recognition(NER)stands as a fundamental task within the field of biomedical text mining,aiming to extract specific types of entities such as genes,proteins,and diseases from complex biomedical texts and categorize them into predefined entity types.This process can provide basic support for the automatic construction of knowledge bases.In contrast to general texts,biomedical texts frequently contain numerous nested entities and local dependencies among these entities,presenting significant challenges to prevailing NER models.To address these issues,we propose a novel Chinese nested biomedical NER model based on RoBERTa and Global Pointer(RoBGP).Our model initially utilizes the RoBERTa-wwm-ext-large pretrained language model to dynamically generate word-level initial vectors.It then incorporates a Bidirectional Long Short-Term Memory network for capturing bidirectional semantic information,effectively addressing the issue of long-distance dependencies.Furthermore,the Global Pointer model is employed to comprehensively recognize all nested entities in the text.We conduct extensive experiments on the Chinese medical dataset CMeEE and the results demonstrate the superior performance of RoBGP over several baseline models.This research confirms the effectiveness of RoBGP in Chinese biomedical NER,providing reliable technical support for biomedical information extraction and knowledge base construction.展开更多
Purpose:Automatic keyphrase extraction(AKE)is an important task for grasping the main points of the text.In this paper,we aim to combine the benefits of sequence labeling formulation and pretrained language model to p...Purpose:Automatic keyphrase extraction(AKE)is an important task for grasping the main points of the text.In this paper,we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research.Design/methodology/approach:We regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT,which was released by Google in 2018.We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain,which contains 100,000 abstracts as training set,6,000 abstracts as development set and 3,094 abstracts as test set.We use unsupervised keyphrase extraction methods including term frequency(TF),TF-IDF,TextRank and supervised machine learning methods including Conditional Random Field(CRF),Bidirectional Long Short Term Memory Network(BiLSTM),and BiLSTM-CRF as baselines.Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models.Findings:Compared with character-level BiLSTM-CRF,the best baseline model with F1 score of 50.16%,our character-level sequence labeling model based on BERT obtains F1 score of 59.80%,getting 9.64%absolute improvement.Research limitations:We just consider automatic keyphrase extraction task rather than keyphrase generation task,so only keyphrases that are occurred in the given text can be extracted.In addition,our proposed dataset is not suitable for dealing with nested keyphrases.Practical implications:We make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts(CAKE)publicly available for the benefits of research community,which is available at:https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction.Originality/value:By designing comparative experiments,our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models.And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.展开更多
In this paper we present the results of the Interactive Argument-Pair Extraction in Judgement Document Challenge held by both the Chinese AI and Law Challenge(CAIL)and the Chinese National Social Media Processing Conf...In this paper we present the results of the Interactive Argument-Pair Extraction in Judgement Document Challenge held by both the Chinese AI and Law Challenge(CAIL)and the Chinese National Social Media Processing Conference(SMP),and introduce the related data set-SMP-CAIL2020-Argmine.The task challenged participants to choose the correct argument among five candidates proposed by the defense to refute or acknowledge the given argument made by the plaintiff,providing the full context recorded in the judgement documents of both parties.We received entries from 63 competing teams,38 of which scored higher than the provided baseline model(BERT)in the first phase and entered the second phase.The best performing system in the two phases achieved accuracy of 0.856 and 0.905,respectively.In this paper,we will present the results of the competition and a summary of the systems,highlighting commonalities and innovations among participating systems.The SMP-CAIL2020-Argmine data set and baseline modelshave been already released.展开更多
基金Supported by Sichuan Science and Technology Program(2021YFQ0003,2023YFSY0026,2023YFH0004).
文摘In the field of natural language processing(NLP),there have been various pre-training language models in recent years,with question answering systems gaining significant attention.However,as algorithms,data,and computing power advance,the issue of increasingly larger models and a growing number of parameters has surfaced.Consequently,model training has become more costly and less efficient.To enhance the efficiency and accuracy of the training process while reducing themodel volume,this paper proposes a first-order pruningmodel PAL-BERT based on the ALBERT model according to the characteristics of question-answering(QA)system and language model.Firstly,a first-order network pruning method based on the ALBERT model is designed,and the PAL-BERT model is formed.Then,the parameter optimization strategy of the PAL-BERT model is formulated,and the Mish function was used as an activation function instead of ReLU to improve the performance.Finally,after comparison experiments with traditional deep learning models TextCNN and BiLSTM,it is confirmed that PALBERT is a pruning model compression method that can significantly reduce training time and optimize training efficiency.Compared with traditional models,PAL-BERT significantly improves the NLP task’s performance.
基金supported by the Outstanding Youth Team Project of Central Universities(QNTD202308)the Ant Group through CCF-Ant Research Fund(CCF-AFSG 769498 RF20220214).
文摘Named Entity Recognition(NER)stands as a fundamental task within the field of biomedical text mining,aiming to extract specific types of entities such as genes,proteins,and diseases from complex biomedical texts and categorize them into predefined entity types.This process can provide basic support for the automatic construction of knowledge bases.In contrast to general texts,biomedical texts frequently contain numerous nested entities and local dependencies among these entities,presenting significant challenges to prevailing NER models.To address these issues,we propose a novel Chinese nested biomedical NER model based on RoBERTa and Global Pointer(RoBGP).Our model initially utilizes the RoBERTa-wwm-ext-large pretrained language model to dynamically generate word-level initial vectors.It then incorporates a Bidirectional Long Short-Term Memory network for capturing bidirectional semantic information,effectively addressing the issue of long-distance dependencies.Furthermore,the Global Pointer model is employed to comprehensively recognize all nested entities in the text.We conduct extensive experiments on the Chinese medical dataset CMeEE and the results demonstrate the superior performance of RoBGP over several baseline models.This research confirms the effectiveness of RoBGP in Chinese biomedical NER,providing reliable technical support for biomedical information extraction and knowledge base construction.
基金This work is supported by the project“Research on Methods and Technologies of Scientific Researcher Entity Linking and Subject Indexing”(Grant No.G190091)from the National Science Library,Chinese Academy of Sciencesthe project“Design and Research on a Next Generation of Open Knowledge Services System and Key Technologies”(2019XM55).
文摘Purpose:Automatic keyphrase extraction(AKE)is an important task for grasping the main points of the text.In this paper,we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research.Design/methodology/approach:We regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT,which was released by Google in 2018.We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain,which contains 100,000 abstracts as training set,6,000 abstracts as development set and 3,094 abstracts as test set.We use unsupervised keyphrase extraction methods including term frequency(TF),TF-IDF,TextRank and supervised machine learning methods including Conditional Random Field(CRF),Bidirectional Long Short Term Memory Network(BiLSTM),and BiLSTM-CRF as baselines.Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models.Findings:Compared with character-level BiLSTM-CRF,the best baseline model with F1 score of 50.16%,our character-level sequence labeling model based on BERT obtains F1 score of 59.80%,getting 9.64%absolute improvement.Research limitations:We just consider automatic keyphrase extraction task rather than keyphrase generation task,so only keyphrases that are occurred in the given text can be extracted.In addition,our proposed dataset is not suitable for dealing with nested keyphrases.Practical implications:We make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts(CAKE)publicly available for the benefits of research community,which is available at:https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction.Originality/value:By designing comparative experiments,our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models.And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.
基金supported by National Key Research and Development Plan(No.2018YFC0830600),and is cooperated with China Justice Big Data Institute,which provided judgement documents and the employment of professional annotators.The competition is also sponsored by Beijing Thunisoft Information Technology Co.,Ltd.,and supported by both CAIL and SMP organizers.
文摘In this paper we present the results of the Interactive Argument-Pair Extraction in Judgement Document Challenge held by both the Chinese AI and Law Challenge(CAIL)and the Chinese National Social Media Processing Conference(SMP),and introduce the related data set-SMP-CAIL2020-Argmine.The task challenged participants to choose the correct argument among five candidates proposed by the defense to refute or acknowledge the given argument made by the plaintiff,providing the full context recorded in the judgement documents of both parties.We received entries from 63 competing teams,38 of which scored higher than the provided baseline model(BERT)in the first phase and entered the second phase.The best performing system in the two phases achieved accuracy of 0.856 and 0.905,respectively.In this paper,we will present the results of the competition and a summary of the systems,highlighting commonalities and innovations among participating systems.The SMP-CAIL2020-Argmine data set and baseline modelshave been already released.