期刊文献+

一种面向PDF文档的文本复制优化方法研究 被引量:1

Research on a Text Copy Optimization Method for PDF Documents
下载PDF
导出
摘要 针对当前主流PDF阅读器复制文字尤其是中英文混合排版文字时存在的全角字符、错误标点符号、多余换行符和空格等问题,提出了一种面向PDF文档的文本复制优化方法,通过剪贴板监听自动感知复制内容变化,基于正则表达式分析复制文本内容特点并采用不同优化策略修正文本格式错误,并提出了3种不同的段落切分策略正确识别文本中的段落,实现了用户“无感知”情况下的复制文本自动优化。在报纸、社科、理工和国防类期刊等4类PDF数据集的实验表明,与直接复制相比,提出的方法能够消除95%以上的格式错误,极大地减轻了人工负担,提高了处理效率。 To solve the problems of full-corner characters,wrong punctuation marks,redundant line breaks,and spaces in the copying of text,especially the mixed typesetting text in Chinese and English,in the current mainstream PDF readers,a text copying optimization method for PDF documents was proposed.Based on the regular expression analysis of the characteristics of the copied text content,different optimization strategies were adopted to correct the formatting errors of the text.Three different paragraph segmentation strategies were proposed to correctly identify paragraphs in the text,which realized the automatic optimization of the copied text in the case of"No Perception"by users.Experiments on four kinds of PDF data sets,such as newspaper,social science,science and technology,and national defense journals,show that compared with direct copying,the proposed method can eliminate more than 95%of format errors,significantly reduce the manual burden and improve the processing efficiency.
作者 贺伟雄 柏林元 郭文娟 HE Weixiong;BAI Linyuan;GUO Wenjuan(Academy of People's Armed Police,Beijing 100010;Army Engineering University of PLA,Nanjing Jiangsu 210001)
出处 《软件》 2022年第7期63-67,共5页 Software
关键词 PDF文档 文本复制 文本优化 段落切分 PDF document text copy text optimization paragraph segmentation
  • 相关文献

参考文献7

二级参考文献52

共引文献67

同被引文献7

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部