摘要
针对当前主流PDF阅读器复制文字尤其是中英文混合排版文字时存在的全角字符、错误标点符号、多余换行符和空格等问题,提出了一种面向PDF文档的文本复制优化方法,通过剪贴板监听自动感知复制内容变化,基于正则表达式分析复制文本内容特点并采用不同优化策略修正文本格式错误,并提出了3种不同的段落切分策略正确识别文本中的段落,实现了用户“无感知”情况下的复制文本自动优化。在报纸、社科、理工和国防类期刊等4类PDF数据集的实验表明,与直接复制相比,提出的方法能够消除95%以上的格式错误,极大地减轻了人工负担,提高了处理效率。
To solve the problems of full-corner characters,wrong punctuation marks,redundant line breaks,and spaces in the copying of text,especially the mixed typesetting text in Chinese and English,in the current mainstream PDF readers,a text copying optimization method for PDF documents was proposed.Based on the regular expression analysis of the characteristics of the copied text content,different optimization strategies were adopted to correct the formatting errors of the text.Three different paragraph segmentation strategies were proposed to correctly identify paragraphs in the text,which realized the automatic optimization of the copied text in the case of"No Perception"by users.Experiments on four kinds of PDF data sets,such as newspaper,social science,science and technology,and national defense journals,show that compared with direct copying,the proposed method can eliminate more than 95%of format errors,significantly reduce the manual burden and improve the processing efficiency.
作者
贺伟雄
柏林元
郭文娟
HE Weixiong;BAI Linyuan;GUO Wenjuan(Academy of People's Armed Police,Beijing 100010;Army Engineering University of PLA,Nanjing Jiangsu 210001)
出处
《软件》
2022年第7期63-67,共5页
Software
关键词
PDF文档
文本复制
文本优化
段落切分
PDF document
text copy
text optimization
paragraph segmentation