期刊文献+

篇章级并列关系文本块识别方法研究 被引量:1

Identifying Coordinate Text Blocks in Discourses
原文传递
导出
摘要 【目的】识别出科技论文中分布在不同段落、在语义及版面视觉上具有并列关系的文本块,捕捉并列关系文本特征,为并列关系知识对象识别提供预训练模型。【方法】以段落为处理单元,在字符向量和词向量的基础上附加版面视觉特征,对不同层级具有并列关系的文本进行多维特征表征,利用卷积神经网络(Convolutional Neural Networks, CNN)模型对标注数据进行文本分类训练,得到并列关系文本块识别模型。【结果】在人工标注的科技论文数据集上展开实验,对并列关系文本块分类准确率达96%,比基准模型高出约3%,召回率高出约2%。【局限】仅适用于HTML网页文本数据,对于其他格式的文本数据还有待进一步研究和实验。【结论】以段落为处理单元,综合多种特征后利用卷积神经网络模型能够高效识别篇章级并列关系文本块,可以作为并列关系知识对象识别预训练模型。 [Objective] This paper proposes a method to identify the coordinate text blocks by semantic and layout features, which are distributed in different paragraphs. It also provides a pre-trained model for these knowledge objects.[Methods] First, we used each paragraph as a processing unit and added the layout features based on the character and word vectors. Then, we concatenated multi-dimensional features to represent each paragraph. Third, we employed the convolutional neural network(CNN) model to train the annotated data and obtained the recognition model for coordinate relationship text blocks.[Results] The proposed approach achieved a precision of 96% with manually annotated scientific papers, which was 3% higher than those of the baseline model. The recall was also improved by 2%.[Limitations] Our model can only work with HTML files. More research is needed to examine it with other data formats.[Conclusions] The proposed method is able to effectively identify coordinate text blocks in discourses, which can be used as a pre-trained model for coordinate knowledge objects.
作者 裴晶晶 乐小虬 Pei Jingjing;Le Xiaoqiu(National Science Library,Chinese Academy of Sciences,Beijing 100190,China;Department of Library,Information and Archives Management,School of Economics and Management, University of Chinese Academy of Sciences,Beijing 100190,China)
出处 《数据分析与知识发现》 CSSCI CSCD 北大核心 2019年第5期51-56,共6页 Data Analysis and Knowledge Discovery
关键词 并列关系 文本表示 文本块 深度学习 Coordinate Relationship Text Representation Text Block Deep Learning
  • 相关文献

参考文献6

二级参考文献49

共引文献37

同被引文献4

引证文献1

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部