摘要
复杂表格能够以简单、直观的方式描述数据,被广泛应用于各行各业,然而,复杂表格具有结构复杂、单元格类型多样、表格文档构成方式不一等问题,需要进行数据化处理后才能实现共享与复用。因此,本文构建一种基于无监督学习的单元格语义关系识别模型来实现复杂表格数据化,首先利用机器视觉技术实现复杂表格分割,然后基于表格结构和内容相似度识别同模板表格,在此基础上,结合表头单元格、说明性单元格、表体单元格3类单元格的取值、位置特点,设置启发式规则进行单元格语义关系的识别,最后通过实证研究验证本文的方法能够在复杂表格数据化中取得较高的准确率和召回率,具有可行性。
Complex tables can describe data in a simple and intuitive way,and are widely used in all walks of life.However,complex tables have problems such as complex structures,diverse cell types,and different forms of table documents.They need to be data processed before they can be shared and reused.Therefore,this paper constructs a cell semantic relationship recognition model based on unsupervised learning to realize the digitization of complex tables.First,it uses machine vision technology to realize the segmentation of complex tables,and then recognizes the same template table based on the similarity of table structure and content.On this basis,heuristic rules are set to identify the semantic relationship of cells in combination with the value and location characteristics of header cells,illustrative cells and table body cells.Finally,the empirical research verifies that the method in this paper can achieve high accuracy and recall rate in complex table digitization,which is feasible.
作者
林鑫
余华娟
闫奕臻
LIN Xin;YU HuaJuan;YAN YiZhen(School of Information Management,Central China Normal University,Wuhan 430079,P.R.China;Research Center for Data Governance and Intelligent Decision Making of Hubei Province,Wuhan 430079,P.R.China)
出处
《数字图书馆论坛》
CSSCI
2022年第9期28-35,共8页
Digital Library Forum
基金
国家社会科学基金青年项目“社会网络中基于用户认知结构的知识标注研究”(编号:17CTQ024)资助。
关键词
复杂表格
语义关系
表格数据化
机器视觉
Complex Table
Semantic Relationship
Form Digitization
Machine Vision