摘要
在大数据时代,电子表格无处不在,它们的结构样式多变、语义丰富。为了自动化地理解电子表格的逻辑结构,一项关键的步骤是对表格单元格分类,区分出标题单元格和内容单元格。为完成表格单元格分类,首先抽取来自表格的结构、样式和语义的6种特征,其次基于深度学习的方法对多样化的特征进行编码和融合,最后构建了一个U-Net结构的神经网络模型来学习特征与单元格类型间的关系。实验结果显示了特征选择和模型结构设计的合理性,证明了所提方法的有效性。
Spreadsheets are ubiquitous in the era of big data,built with varied structures and rich semantics.A key step in automatically understanding the logical structure of a spreadsheet is to classify the tabular cells,distinguish header cells and content cells.In order to complete the classification of tabular cells,this paper first extracts six different features from the structure,style and semantics of spreadsheets,and then encodes and fuses diverse features based on deep learning methods,and finally builds a U-Net neural network model to learn the relationship between features and tabular cell types.Experimental results indicate the rationality of feature selection and model structure design,and demonstrate the effectiveness of the proposed method.
作者
彭滢
吴杰
齐伟钢
PENG Ying;WU Jie;QI Weigang(Westone Information Industry Inc.,Chengdu Sichuan 610041,China)
出处
《通信技术》
2022年第9期1146-1152,共7页
Communications Technology
关键词
电子表格
表格单元格分类
深度学习
特征融合
spreadsheet
tabular cell classification
deep learning
feature fusion