国防科技领域两阶段开放信息抽取方法

Two-stage open information extraction method for the defence technology field

导出

摘要互联网开源渠道蕴含大量国防科技信息资源,是获取高价值军事情报的重要数据来源。国防科技领域开放信息抽取(open information extraction,OpenIE)旨在从海量信息资源中进行主谓宾-宾补(SAO-C)结构元组抽取,其对于国防科技领域本体归纳、知识图谱构建等具有重要意义。然而,相比其他领域的信息抽取,国防科技领域开放信息抽取面临元组重叠嵌套、实体跨度长且难识别、领域标注数据缺乏等问题。本文提出一种国防科技领域两阶段开放信息抽取方法,首先利用基于预训练语言模型的序列标注算法抽取谓语,然后引入多头注意力机制来学习预测要素边界。结合领域专家知识,利用基于实体边界的标注策略构建了国防科技领域标注数据集,并在该数据集上进行了实验,结果显示该方法的F1值在两阶段上比长短期记忆结合条件随机场(LSTM+CRF)方法分别提高了3.92%和16.67百分点。 [Objective]The abundant information resources available on the internet about defense technology are of vital importance as data sources for obtaining high-value military intelligence.The aim of open information extraction in the field of defense technology is to extract structured triplets containing subject,predicate,object,and other arguments from the massive amount of information available on the internet.This technology has important implications for ontology induction and the construction of knowledge graphs in the defense technology domain.However,while information extraction experiments in the general domain yield good results,open information extraction in the defense technology domain faces several challenges,such as a lack of domain annotated data,arguments overlapping unadaptability,and unrecognizable long entities.[Methods]In this paper,an annotation strategy is proposed based on the entity boundaries,and an annotated dataset in the defense technology field combined with the experience of domain experts was constructed.Furthermore,a two-stage open information extraction method is proposed in the defense technology field that utilizes a pretrained language model-based sequence labeling algorithm to extract predicates and a multihead attention mechanism to learn the prediction of argument boundaries.In the first stage,the input sentence was converted into an input sequence<[CLS],input sentence[SEP]>,and the input sequence was encoded using apretrained language model to obtain an implicit state representation of the input sequence.Based on this sentence representation,a conditional random field(CRF)layer was used to predict the position of the predicates,i.e.,to predict the BIO labels of the words.In the second stage,the predicated predicates from the first stage were concatenated with the original sentence and converted into an input sequence<[CLS],predicate[SEP],and input sentence[SEP]>,which was encoded using apretrained language model to obtain an implicit state representation of the input sequence.This representation was then fed to a multihead pointer network to predict the position of the argument.The predicted position was tagged with the actual position to calculate the cross-entropy loss function.Finally,the predicates and the arguments predicted by the predicate and argument extraction models were combined to obtain the complete triplet.[Results]The experimental results from the extensive experiments conducted on a self-built annotated dataset in the defense technology field reveal the following.(1)In predicate extraction,our method achieved a 3.92%performance improvement in the F1value as compared to LSTM methods and more than 10%performance improvement as compared to syntactic analysis methods.(2)In argument extraction,our method achieved a considerable performance improvement of more than 16%in the F1value as compared to LSTM methods and about 11%in the F1value as compared to the BERT+CRF method.[Conclusions]The proposed two-stage open information extraction method can overcome the challenge of arguments overlapping unadaptability and the difficulty of long-span entity extraction,thus improving the shortcomings of existing open information extraction methods.Extensive experimental analysis conducted on the self-built annotated dataset proved the effectiveness of the proposed method.

作者胡明昊王芳徐先涛罗威刘晓鹏罗准辰谭玉珊 HU Minghao;WANG Fang;XU Xiantao;LUO Wei;LIU Xiaopeng;LUO Zhunchen;Tan Yushan(Information Research Center of Military Science,PLA Academy of Military Science,Beijing 100142,China)

机构地区军事科学院军事科学信息研究中心

出处《清华大学学报（自然科学版）》 EI CAS CSCD 北大核心 2023年第9期1309-1316,共8页 Journal of Tsinghua University(Science and Technology)

基金国家自然科学基金青年项目(62006243)。

关键词国防科技开放信息抽取主谓宾-宾补结构知识图谱预训练语言模型 defense technology open information extraction subject-verb-object complement knowledge graph pretrained language model

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

1本刊编辑部.中华医学会系列杂志对正文中表的要求[J].中华肩肘外科电子杂志,2023,11(2):174-174.
2相辉.多思善想解难题[J].小学生学习指导,2023(20):56-57.
3刘庆龄,王一伊,曾立.大国竞争下国防科技情报工作的优化策略研究[J].情报杂志,2022,41(10):1-8. 被引量：7
4王作东.1950年代初的军委总情报部[J].文史春秋,2023(6):15-17.
5陈赟,古丽拉·阿东别克,马雅静.旅游领域嵌套实体和重叠关系联合抽取模型BPNRel[J].东北师大学报（自然科学版）,2023,55(3):64-74.
6陕西学前师范学院学报编辑部.2021-2025年陕西学前师范学院学报选题领域本体公告:第三十四组幼儿园课程高质量发展[J].陕西学前师范学院学报,2023,39(9).
7赵亚平,黄毅,李虹,孟杰.人工智能技术在军事情报领域的应用与发展[J].指挥控制与仿真,2023,45(4):36-43. 被引量：3
8张卫国,张思瑞.基于改进DeepLabV3与Canny算法的路面裂缝语义分割方法[J].计算技术与自动化,2023,42(3):96-101.
9冯学荣.强硬互博反违初衷[J].特别文摘,2020(4):29-30.
10徐友松,范海滨,李强,刘倩,张进平,董美川,张龙,郝志文.胶东隆起栖霞地区金矿床成矿地质条件及找矿预测分析[J].中国锰业,2023,41(2):54-60. 被引量：2

清华大学学报（自然科学版）

2023年第9期

浏览历史

内容加载中请稍等...

国防科技领域两阶段开放信息抽取方法

相关作者

相关机构

相关主题

浏览历史