期刊文献+

一种基于统计的复杂页面正文提取方法 被引量:1

A STATISTICS-BASED COMPLEX WEB TEXT EXTRACTION METHOD
下载PDF
导出
摘要 随着信息技术的发展,web页面复杂多样的特点愈来愈明显,传统页面正文提取方法的效率和精确度较低。针对这种情况,提出一种基于统计的正文提取算法。该算法依据Html标签特征提取经过过滤的每对“〉”和“〈”之间的文本信息,对其长度进行统计并按照匹配顺序进行排序。根据文本长度最优阈值,划定文本行号区间,最后利用公共子序列进行优化并完成正文提取。实验结果表明,该方法能够精确高效地提取复杂页面的正文信息且具有较好的通用性。 With the development of information technology, complex and diverse characteristics of webpages are getting more and more ap- parent, but the efficiency and accuracy of conventional web text extraction methods are quite low. Aiming at this situation, we propose a sta- tistics-based web text extraction method. The algorithm extracts the text information between every pair of " 〉 " and" 〈 ", which has been fil- tered, based on the features of Html tags, and makes statistic on its length and then sorts according to the matching sequence. Depending on the optimal threshold of text length, it delimits the ranges of text line numbers, finally it uses the public sub-sequences to optimise and com- plete the text extraction. Experimental results show that this method can extract the text information from complex web accurately and effec- tively, of course, with better universality.
出处 《计算机应用与软件》 CSCD 2015年第7期90-92,147,共4页 Computer Applications and Software
关键词 复杂页面 正文提取 统计 公共子序列 文本长度最优阈值 文本行号区间 Complex web pages Text extraction Statistics Public sub-sequence Text length optimal threshold Text line number range
  • 相关文献

参考文献12

二级参考文献74

共引文献313

同被引文献11

引证文献1

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部