期刊文献+
共找到2篇文章
< 1 >
每页显示 20 50 100
非关系型表格理解前沿进展
1
作者 罗平 杨清平 +2 位作者 曹逸轩 曹荣禹 何清 《中文信息学报》 2024年第5期1-21,共21页
表格理解是指通过计算机对广泛存在于互联网、垂直领域的表格进行自动识别、解析和应用的过程。表格可大致分为关系型表格和非关系型表格。前者类似关系数据库表格,具有结构固定、机器易解析等特点,其研究历史由来已久。后者通常布局多... 表格理解是指通过计算机对广泛存在于互联网、垂直领域的表格进行自动识别、解析和应用的过程。表格可大致分为关系型表格和非关系型表格。前者类似关系数据库表格,具有结构固定、机器易解析等特点,其研究历史由来已久。后者通常布局多变,语法灵活,具有更明显的语言特性,这也导致计算机在解析和应用非关系型表格时面临着极大挑战。非关系型表格理解是自然语言和计算机视觉多模态交叉的重要新兴领域之一。随着近年来深度学习技术的普及应用,非关系型表格在表格识别、语义分析、创新应用几个方向得到了长足发展。该文介绍了非关系型表格的结构特点,阐述了其在研究过程中面临的独特挑战,然后从表格识别、语义分析、创新应用三个研究方向简要介绍了近年来此领域的发展,归纳了相关数据集,最后总结了目前非关系型表格理解领域亟需解决的问题,展望了未来研究方向。 展开更多
关键词 表格智能 深度学习 多模态自然语言处理
下载PDF
Extracting Variable-Depth Logical Document Hierarchy from Long Documents:Method,Evaluation,and Application
2
作者 曹荣禹 曹逸轩 +1 位作者 周干斌 罗平 《Journal of Computer Science & Technology》 SCIE EI CSCD 2022年第3期699-718,共20页
In this paper,we study the problem of extracting variable-depth"logical document hierarchy"from long documents,namely organizing the recognized"physical document objects"into hierarchical structure... In this paper,we study the problem of extracting variable-depth"logical document hierarchy"from long documents,namely organizing the recognized"physical document objects"into hierarchical structures.The discovery of logical document hierarchy is the vital step to support many downstream applications(e.g.,passage-based retrieval and high-quality information extraction).However,long documents,containing hundreds or even thousands of pages and a variable-depth hierarchy,challenge the existing methods.To address these challenges,we develop a framework,namely Hierarchy Extraction from Long Document(HELD),where we"sequentially"insert each physical object at the proper position on the current tree.Determining whether each possible position is proper or not can be formulated as a binary classification problem.To further improve its effectiveness and efficiency,we study the design variants in HELD,including traversal orders of the insertion positions,heading extraction explicitly or implicitly,tolerance to insertion errors in predecessor steps,and so on.As for evaluations,we find that previous studies ignore the error that the depth of a node is correct while its path to the root is wrong.Since such mistakes may worsen the downstream applications seriously,a new measure is developed for a more careful evaluation.The empirical experiments based on thousands of long documents from Chinese financial market,English financial market and English scientific publication show that the HELD model with the"root-to-leaf"traversal order and explicit heading extraction is the best choice to achieve the tradeoff between effectiveness and efficiency with the accuracy of 0.972,6,0.729,1 and 0.957,8 in the Chinese financial,English financial and arXiv datasets,respectively.Finally,we show that the logical document hierarchy can be employed to significantly improve the performance of the downstream passage retrieval task.In summary,we conduct a systematic study on this task in terms of methods,evaluations,and applications. 展开更多
关键词 logical document hierarchy long documents passage retrieval
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部