摘要
【目的】识别出科技论文中分布在不同段落、在语义及版面视觉上具有并列关系的文本块,捕捉并列关系文本特征,为并列关系知识对象识别提供预训练模型。【方法】以段落为处理单元,在字符向量和词向量的基础上附加版面视觉特征,对不同层级具有并列关系的文本进行多维特征表征,利用卷积神经网络(Convolutional Neural Networks, CNN)模型对标注数据进行文本分类训练,得到并列关系文本块识别模型。【结果】在人工标注的科技论文数据集上展开实验,对并列关系文本块分类准确率达96%,比基准模型高出约3%,召回率高出约2%。【局限】仅适用于HTML网页文本数据,对于其他格式的文本数据还有待进一步研究和实验。【结论】以段落为处理单元,综合多种特征后利用卷积神经网络模型能够高效识别篇章级并列关系文本块,可以作为并列关系知识对象识别预训练模型。
[Objective] This paper proposes a method to identify the coordinate text blocks by semantic and layout features, which are distributed in different paragraphs. It also provides a pre-trained model for these knowledge objects.[Methods] First, we used each paragraph as a processing unit and added the layout features based on the character and word vectors. Then, we concatenated multi-dimensional features to represent each paragraph. Third, we employed the convolutional neural network(CNN) model to train the annotated data and obtained the recognition model for coordinate relationship text blocks.[Results] The proposed approach achieved a precision of 96% with manually annotated scientific papers, which was 3% higher than those of the baseline model. The recall was also improved by 2%.[Limitations] Our model can only work with HTML files. More research is needed to examine it with other data formats.[Conclusions] The proposed method is able to effectively identify coordinate text blocks in discourses, which can be used as a pre-trained model for coordinate knowledge objects.
作者
裴晶晶
乐小虬
Pei Jingjing;Le Xiaoqiu(National Science Library,Chinese Academy of Sciences,Beijing 100190,China;Department of Library,Information and Archives Management,School of Economics and Management, University of Chinese Academy of Sciences,Beijing 100190,China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2019年第5期51-56,共6页
Data Analysis and Knowledge Discovery
关键词
并列关系
文本表示
文本块
深度学习
Coordinate Relationship
Text Representation
Text Block
Deep Learning