As important geological data,a geological report contains rich expert and geological knowledge,but the challenge facing current research into geological knowledge extraction and mining is how to render accurate unders...As important geological data,a geological report contains rich expert and geological knowledge,but the challenge facing current research into geological knowledge extraction and mining is how to render accurate understanding of geological reports guided by domain knowledge.While generic named entity recognition models/tools can be utilized for the processing of geoscience reports/documents,their effectiveness is hampered by a dearth of domain-specific knowledge,which in turn leads to a pronounced decline in recognition accuracy.This study summarizes six types of typical geological entities,with reference to the ontological system of geological domains and builds a high quality corpus for the task of geological named entity recognition(GNER).In addition,Geo Wo BERT-adv BGP(Geological Word-base BERTadversarial training Bi-directional Long Short-Term Memory Global Pointer)is proposed to address the issues of ambiguity,diversity and nested entities for the geological entities.The model first uses the fine-tuned word granularitybased pre-training model Geo Wo BERT(Geological Word-base BERT)and combines the text features that are extracted using the Bi LSTM(Bi-directional Long Short-Term Memory),followed by an adversarial training algorithm to improve the robustness of the model and enhance its resistance to interference,the decoding finally being performed using a global association pointer algorithm.The experimental results show that the proposed model for the constructed dataset achieves high performance and is capable of mining the rich geological information.展开更多
Handheld ultrasound devices are known for their portability and affordability,making them widely utilized in underdeveloped areas and community healthcare for rapid diagnosis and early screening.However,the image qual...Handheld ultrasound devices are known for their portability and affordability,making them widely utilized in underdeveloped areas and community healthcare for rapid diagnosis and early screening.However,the image quality of handheld ultrasound devices is not always satisfactory due to the limited equipment size,which hinders accurate diagnoses by doctors.At the same time,paired ultrasound images are difficult to obtain from the clinic because imaging process is complicated.Therefore,we propose a modified cycle generative adversarial network(cycleGAN) for ultrasound image enhancement from multiple organs via unpaired pre-training.We introduce an ultrasound image pre-training method that does not require paired images,alleviating the requirement for large-scale paired datasets.We also propose an enhanced block with different structures in the pre-training and fine-tuning phases,which can help achieve the goals of different training phases.To improve the robustness of the model,we add Gaussian noise to the training images as data augmentation.Our approach is effective in obtaining the best quantitative evaluation results using a small number of parameters and less training costs to improve the quality of handheld ultrasound devices.展开更多
Entity relation extraction(ERE)is an important task in the field of information extraction.With the wide application of pre-training language model(PLM)in natural language processing(NLP),using PLM has become a brand ...Entity relation extraction(ERE)is an important task in the field of information extraction.With the wide application of pre-training language model(PLM)in natural language processing(NLP),using PLM has become a brand new research direction of ERE.In this paper,BERT is used to extracting entityrelations,and a separated pipeline architecture is proposed.ERE was decomposed into entity-relation classification sub-task and entity-pair annotation sub-task.Both sub-tasks conduct the pre-training and fine-tuning independently.Combining dynamic and static masking,newVerb-MLM and Entity-MLM BERT pre-training tasks were put forward to enhance the correlation between BERT pre-training and TargetedNLPdownstream task-ERE.Inter-layer sharing attentionmechanismwas added to the model,sharing the attention parameters according to the similarity of the attention matrix.Contrast experiment on the SemEavl 2010 Task8 dataset demonstrates that the new MLM task and inter-layer sharing attention mechanism improve the performance of BERT on the entity relation extraction effectively.展开更多
事故隐患分类能够直观反映企业安全生产管理的薄弱点,同时将直接决定企业优化安全管理工作的方向。油田安全生产过程中,隐患种类多,数据量大,单纯依赖人工方式分类及管理效率较低,且难以发掘数据中蕴含的潜在规律。基于油田安全生产的...事故隐患分类能够直观反映企业安全生产管理的薄弱点,同时将直接决定企业优化安全管理工作的方向。油田安全生产过程中,隐患种类多,数据量大,单纯依赖人工方式分类及管理效率较低,且难以发掘数据中蕴含的潜在规律。基于油田安全生产的需求及事故隐患特征,提出了一种基于BERT-BiLSTM的分类模型,用于油田安全生产隐患文本的主题自动分类,通过基于Transformer的双向编码器表示(bidirectionalencoder representations from Transformer,BERT)模型提取输入文本的字符级特征,生成全局文本信息的向量表示,再通过双向长短时记忆网络(bi-directional long short-term memory,BiLSTM)模型对局部关键信息和上下文深层次特征进行特征提取,进而通过Softmax激活函数进行概率计算得到分类结果。通过与传统分类方法的比较表明,BERT-BiLSTM分类模型在加权平均准确率、加权平均召回率和加权平均F_(1)等指标方面均有所改善,模型与油田企业现有安全管理信息系统的有机融合将为进一步提升油田企业的事故隐患管理针对性,推动企业安全管理从事后被动反应向事前主动预防转变提供重要的技术支撑。展开更多
古汉语文本承载着丰富的历史和文化信息,对这类文本进行实体关系抽取研究并构建相关知识图谱对于文化传承具有重要作用.针对古汉语文本中存在大量生僻汉字、语义模糊和复义等问题,提出了一种基于BERT古文预训练模型的实体关系联合抽取模...古汉语文本承载着丰富的历史和文化信息,对这类文本进行实体关系抽取研究并构建相关知识图谱对于文化传承具有重要作用.针对古汉语文本中存在大量生僻汉字、语义模糊和复义等问题,提出了一种基于BERT古文预训练模型的实体关系联合抽取模型(entity relation joint extraction model based on BERT-ancient-Chinese pretrained model,JEBAC).首先,通过融合BiLSTM神经网络和注意力机制的BERT古文预训练模型(BERT-ancientChinese pre-trained model integrated BiLSTM neural network and attention mechanism,BACBA),识别出句中所有的subject实体和object实体,为关系和object实体联合抽取提供依据.接下来,将subject实体的归一化编码向量与整个句子的嵌入向量相加,以更好地理解句中subject实体的语义特征;最后,结合带有subject实体特征的句子向量和object实体的提示信息,通过BACBA实现句中关系和object实体的联合抽取,从而得到句中所有的三元组信息(subject实体,关系,object实体).在中文实体关系抽取DuIE2.0数据集和CCKS 2021的文言文实体关系抽取CCLUE小样本数据集上,与现有的方法进行了性能比较.实验结果表明,该方法在抽取性能上更加有效,F1值分别可达79.2%和55.5%.展开更多
识别学科交叉研究的前沿主题,并对演化趋势进行分析,有助于揭示学科交叉融合的方向,为未来创新性、突破性研究提供参考。首先,基于引文视角构建测度论文学科交叉性的指标,识别具有学科交叉性的研究论文;其次,通过BERT-LDA模型识别研究主...识别学科交叉研究的前沿主题,并对演化趋势进行分析,有助于揭示学科交叉融合的方向,为未来创新性、突破性研究提供参考。首先,基于引文视角构建测度论文学科交叉性的指标,识别具有学科交叉性的研究论文;其次,通过BERT-LDA模型识别研究主题,利用余弦相似度计算主题之间的相似度,构建主题演化路径;最后,基于新颖度、增长性、关注度、影响力构建前沿主题识别指标体系,识别具有前沿性的学科交叉研究主题。以图书情报学(Library and Information Science,LIS)为例展开研究,研究结果显示,2004—2023年该学科领域的交叉研究主题呈现出逐渐细化和深入的特点,主要集中在信息挖掘与知识发现、互联网信息行为、医疗信息学3个方面;现阶段学科交叉研究前沿主题为医疗数据模型、舆情治理与情感分析、机器学习与深度学习;基于信息技术的研究方法和其在不同领域的应用研究具有良好的应用前景,有可能成为未来LIS领域的核心研究主题。展开更多
针对民航陆空通话领域语料难以获取、实体分布不均,以及意图信息提取中实体规范不足且准确率有待提升等问题,为了更好地提取陆空通话意图信息,提出一种融合本体的基于双向转换编码器(bidirectional encoder representations from transf...针对民航陆空通话领域语料难以获取、实体分布不均,以及意图信息提取中实体规范不足且准确率有待提升等问题,为了更好地提取陆空通话意图信息,提出一种融合本体的基于双向转换编码器(bidirectional encoder representations from transformers,BERT)与生成对抗网络(generative adversarial network,GAN)的陆空通话意图信息挖掘方法,并引入航班池信息对提取的部分信息进行校验修正,形成空中交通管制(air traffic control,ATC)系统可理解的结构化信息。首先,使用改进的GAN模型进行陆空通话智能文本生成,可有效进行数据增强,平衡各类实体信息分布并扩充数据集;然后,根据欧洲单一天空空中交通管理项目定义的本体规则进行意图的分类与标注;之后,通过BERT预训练模型生成字向量并解决一词多义问题,利用双向长短时记忆(bidirectional long short-term memory,BiLSTM)网络双向编码提取上下句语义特征,同时将该语义特征送入条件随机场(conditional random field,CRF)模型进行推理预测,学习标签的依赖关系并加以约束,以获取全局最优结果;最后,根据编辑距离(edit distance,ED)算法进行意图信息合理性校验与修正。对比实验结果表明,所提方法的宏平均F_(1)值达到了98.75%,在民航陆空通话数据集上的意图挖掘性能优于其他主流模型,为其加入数字化进程奠定了基础。展开更多
在互联网时代,越来越多的财务公司选择在财经新闻平台上发表自己的见解,这些评论文本作为舆情的载体,可以充分反映财务公司的情绪,影响公众的投资决策和市场走势.情感分析为分析海量的经济类文本情感类型提供了有效的研究手段.但是,由...在互联网时代,越来越多的财务公司选择在财经新闻平台上发表自己的见解,这些评论文本作为舆情的载体,可以充分反映财务公司的情绪,影响公众的投资决策和市场走势.情感分析为分析海量的经济类文本情感类型提供了有效的研究手段.但是,由于特定领域文本的专业性和大标签数据集的不适用性,经济类文本情感分析给传统的情感分析模型带来了巨大的挑战.当将一般情感分析模型应用于经济等特定领域时,模型在准确率与召回率上表现较差.为了克服这些挑战,文章针对财经新闻平台上的经济类文本的情感分析任务,从词表示模型出发,提出了基于知识蒸馏方法的双路BERT(Two-way BERT based on knowledge distillation method)情感分析模型,与文本卷积神经网络(Text-CNN)、卷积递归神经网络(CRNN)、双向长时和短时记忆网络(Bi-LSTM)等算法进行对比实验,结果得出该改进方法相较于其他算法在准确率、召回率和F1值均提升了1%~3%,具有较好的泛化性能.展开更多
基金financially supported by the Natural Science Foundation of China(Grant No.42301492)the National Key R&D Program of China(Grant Nos.2022YFF0711600,2022YFF0801201,2022YFF0801200)+3 种基金the Major Special Project of Xinjiang(Grant No.2022A03009-3)the Open Fund of Key Laboratory of Urban Land Resources Monitoring and Simulation,Ministry of Natural Resources(Grant No.KF-2022-07014)the Opening Fund of the Key Laboratory of the Geological Survey and Evaluation of the Ministry of Education(Grant No.GLAB 2023ZR01)the Fundamental Research Funds for the Central Universities。
文摘As important geological data,a geological report contains rich expert and geological knowledge,but the challenge facing current research into geological knowledge extraction and mining is how to render accurate understanding of geological reports guided by domain knowledge.While generic named entity recognition models/tools can be utilized for the processing of geoscience reports/documents,their effectiveness is hampered by a dearth of domain-specific knowledge,which in turn leads to a pronounced decline in recognition accuracy.This study summarizes six types of typical geological entities,with reference to the ontological system of geological domains and builds a high quality corpus for the task of geological named entity recognition(GNER).In addition,Geo Wo BERT-adv BGP(Geological Word-base BERTadversarial training Bi-directional Long Short-Term Memory Global Pointer)is proposed to address the issues of ambiguity,diversity and nested entities for the geological entities.The model first uses the fine-tuned word granularitybased pre-training model Geo Wo BERT(Geological Word-base BERT)and combines the text features that are extracted using the Bi LSTM(Bi-directional Long Short-Term Memory),followed by an adversarial training algorithm to improve the robustness of the model and enhance its resistance to interference,the decoding finally being performed using a global association pointer algorithm.The experimental results show that the proposed model for the constructed dataset achieves high performance and is capable of mining the rich geological information.
文摘Handheld ultrasound devices are known for their portability and affordability,making them widely utilized in underdeveloped areas and community healthcare for rapid diagnosis and early screening.However,the image quality of handheld ultrasound devices is not always satisfactory due to the limited equipment size,which hinders accurate diagnoses by doctors.At the same time,paired ultrasound images are difficult to obtain from the clinic because imaging process is complicated.Therefore,we propose a modified cycle generative adversarial network(cycleGAN) for ultrasound image enhancement from multiple organs via unpaired pre-training.We introduce an ultrasound image pre-training method that does not require paired images,alleviating the requirement for large-scale paired datasets.We also propose an enhanced block with different structures in the pre-training and fine-tuning phases,which can help achieve the goals of different training phases.To improve the robustness of the model,we add Gaussian noise to the training images as data augmentation.Our approach is effective in obtaining the best quantitative evaluation results using a small number of parameters and less training costs to improve the quality of handheld ultrasound devices.
基金Hainan Province High level talent project of basic and applied basic research plan(Natural Science Field)in 2019(No.2019RC100)Haikou City Key Science and Technology Plan Project(2020–049)Hainan Province Key Research and Development Project(ZDYF2020018).
文摘Entity relation extraction(ERE)is an important task in the field of information extraction.With the wide application of pre-training language model(PLM)in natural language processing(NLP),using PLM has become a brand new research direction of ERE.In this paper,BERT is used to extracting entityrelations,and a separated pipeline architecture is proposed.ERE was decomposed into entity-relation classification sub-task and entity-pair annotation sub-task.Both sub-tasks conduct the pre-training and fine-tuning independently.Combining dynamic and static masking,newVerb-MLM and Entity-MLM BERT pre-training tasks were put forward to enhance the correlation between BERT pre-training and TargetedNLPdownstream task-ERE.Inter-layer sharing attentionmechanismwas added to the model,sharing the attention parameters according to the similarity of the attention matrix.Contrast experiment on the SemEavl 2010 Task8 dataset demonstrates that the new MLM task and inter-layer sharing attention mechanism improve the performance of BERT on the entity relation extraction effectively.
文摘事故隐患分类能够直观反映企业安全生产管理的薄弱点,同时将直接决定企业优化安全管理工作的方向。油田安全生产过程中,隐患种类多,数据量大,单纯依赖人工方式分类及管理效率较低,且难以发掘数据中蕴含的潜在规律。基于油田安全生产的需求及事故隐患特征,提出了一种基于BERT-BiLSTM的分类模型,用于油田安全生产隐患文本的主题自动分类,通过基于Transformer的双向编码器表示(bidirectionalencoder representations from Transformer,BERT)模型提取输入文本的字符级特征,生成全局文本信息的向量表示,再通过双向长短时记忆网络(bi-directional long short-term memory,BiLSTM)模型对局部关键信息和上下文深层次特征进行特征提取,进而通过Softmax激活函数进行概率计算得到分类结果。通过与传统分类方法的比较表明,BERT-BiLSTM分类模型在加权平均准确率、加权平均召回率和加权平均F_(1)等指标方面均有所改善,模型与油田企业现有安全管理信息系统的有机融合将为进一步提升油田企业的事故隐患管理针对性,推动企业安全管理从事后被动反应向事前主动预防转变提供重要的技术支撑。
文摘古汉语文本承载着丰富的历史和文化信息,对这类文本进行实体关系抽取研究并构建相关知识图谱对于文化传承具有重要作用.针对古汉语文本中存在大量生僻汉字、语义模糊和复义等问题,提出了一种基于BERT古文预训练模型的实体关系联合抽取模型(entity relation joint extraction model based on BERT-ancient-Chinese pretrained model,JEBAC).首先,通过融合BiLSTM神经网络和注意力机制的BERT古文预训练模型(BERT-ancientChinese pre-trained model integrated BiLSTM neural network and attention mechanism,BACBA),识别出句中所有的subject实体和object实体,为关系和object实体联合抽取提供依据.接下来,将subject实体的归一化编码向量与整个句子的嵌入向量相加,以更好地理解句中subject实体的语义特征;最后,结合带有subject实体特征的句子向量和object实体的提示信息,通过BACBA实现句中关系和object实体的联合抽取,从而得到句中所有的三元组信息(subject实体,关系,object实体).在中文实体关系抽取DuIE2.0数据集和CCKS 2021的文言文实体关系抽取CCLUE小样本数据集上,与现有的方法进行了性能比较.实验结果表明,该方法在抽取性能上更加有效,F1值分别可达79.2%和55.5%.
文摘识别学科交叉研究的前沿主题,并对演化趋势进行分析,有助于揭示学科交叉融合的方向,为未来创新性、突破性研究提供参考。首先,基于引文视角构建测度论文学科交叉性的指标,识别具有学科交叉性的研究论文;其次,通过BERT-LDA模型识别研究主题,利用余弦相似度计算主题之间的相似度,构建主题演化路径;最后,基于新颖度、增长性、关注度、影响力构建前沿主题识别指标体系,识别具有前沿性的学科交叉研究主题。以图书情报学(Library and Information Science,LIS)为例展开研究,研究结果显示,2004—2023年该学科领域的交叉研究主题呈现出逐渐细化和深入的特点,主要集中在信息挖掘与知识发现、互联网信息行为、医疗信息学3个方面;现阶段学科交叉研究前沿主题为医疗数据模型、舆情治理与情感分析、机器学习与深度学习;基于信息技术的研究方法和其在不同领域的应用研究具有良好的应用前景,有可能成为未来LIS领域的核心研究主题。
文摘针对民航陆空通话领域语料难以获取、实体分布不均,以及意图信息提取中实体规范不足且准确率有待提升等问题,为了更好地提取陆空通话意图信息,提出一种融合本体的基于双向转换编码器(bidirectional encoder representations from transformers,BERT)与生成对抗网络(generative adversarial network,GAN)的陆空通话意图信息挖掘方法,并引入航班池信息对提取的部分信息进行校验修正,形成空中交通管制(air traffic control,ATC)系统可理解的结构化信息。首先,使用改进的GAN模型进行陆空通话智能文本生成,可有效进行数据增强,平衡各类实体信息分布并扩充数据集;然后,根据欧洲单一天空空中交通管理项目定义的本体规则进行意图的分类与标注;之后,通过BERT预训练模型生成字向量并解决一词多义问题,利用双向长短时记忆(bidirectional long short-term memory,BiLSTM)网络双向编码提取上下句语义特征,同时将该语义特征送入条件随机场(conditional random field,CRF)模型进行推理预测,学习标签的依赖关系并加以约束,以获取全局最优结果;最后,根据编辑距离(edit distance,ED)算法进行意图信息合理性校验与修正。对比实验结果表明,所提方法的宏平均F_(1)值达到了98.75%,在民航陆空通话数据集上的意图挖掘性能优于其他主流模型,为其加入数字化进程奠定了基础。
文摘在互联网时代,越来越多的财务公司选择在财经新闻平台上发表自己的见解,这些评论文本作为舆情的载体,可以充分反映财务公司的情绪,影响公众的投资决策和市场走势.情感分析为分析海量的经济类文本情感类型提供了有效的研究手段.但是,由于特定领域文本的专业性和大标签数据集的不适用性,经济类文本情感分析给传统的情感分析模型带来了巨大的挑战.当将一般情感分析模型应用于经济等特定领域时,模型在准确率与召回率上表现较差.为了克服这些挑战,文章针对财经新闻平台上的经济类文本的情感分析任务,从词表示模型出发,提出了基于知识蒸馏方法的双路BERT(Two-way BERT based on knowledge distillation method)情感分析模型,与文本卷积神经网络(Text-CNN)、卷积递归神经网络(CRNN)、双向长时和短时记忆网络(Bi-LSTM)等算法进行对比实验,结果得出该改进方法相较于其他算法在准确率、召回率和F1值均提升了1%~3%,具有较好的泛化性能.