摘要
针对信息安全领域缺乏语料库的情况,提出一种事件语料标注方法。将新闻文本中句子作为研究对象,在哈工大语言技术LTP平台分析基础上,将词性、句法和语义角色等多种特征融合到条件随机场模型中,对句中分词做标注,得到分词标签后,完善LTP平台的XML形式结果。实验部分不仅和人工标注作对比,同时与只利用常用特征构建特征向量的CRF模型作对比,结果表明,标注的事件要素F1值均超过60%,与未加入句法和语义角色特征相比,F1值有明显提升。
In view of the lack of corpus in the field of information security,proposes an event corpus labeling method.Takes sentences in news texts as research objects,on the basis of LTP platform analysis of language technology of Harbin Institute of Technology,various features such as part-of-speech,syntactic and semantic roles are integrated into the conditional random field model,and word segmentation in sentences is marked.After word segmentation labels are obtained,XML form results of LTP platform are further improved.The experimental part is not only compared with manual labeling,but also compared with CRF model which only uses common features to construct feature vectors.The F1 value of several event elements annotated by multi-feature CRF model exceeds 60%.Compared with the absence of syntactic and semantic role features,F1 value has been significantly improved.
作者
郭婷婷
刘嘉勇
GUO Ting-ting;LIU Jia-yong(College of Electronics and Information Engineering,Sichuan University,Chengdu 610065;College of Cybersecurity,Sichuan University,Chengdu 610065)
出处
《现代计算机》
2019年第5期27-32,共6页
Modern Computer
关键词
事件标注
信息安全
多特征
条件随机场
Event Tagging
Information Security
Multi-Feature
Conditional Random Field