摘要
合理的网页正文提取技术可以将海量互联网数据中冗余的、重复的、无用的信息去除,获取更加有实际意义和价值的数据。经过对网页的观察,发现同一网站下的网页具有在内容布局和样式结构上非常相似的特点,提出并实现了一种基于布局相似性的网页正文提取方法,即通过比对来自同一网站同一专题的网页DOM树中节点数据信息的相似性来实现正文提取,并对相关问题进行了尝试性的研究和实现。实验证明该方法思路简单、实用性强、普适性好,在满足较高准确率的同时,能为众多互联网内容分析应用提供支撑。
Appropriate Web content extraction technique can remove the data which is redundant, repetitive and useless from massive Web pages while extracting more meaningful and more useful data. Through the observation of Web pages, this paper proposed and implemented a Web content extraction method based on the layout similarity that the pages under the same Web site showed similar in content layout and style structure. It ,achieves the purpose of main content extraction by comparing the similarity of the DOM node structure data from the Web pages belong to the same topic of the same sites. It also did some tenta- tive research and implementation on some other content relevent to this content extraction method. Experiments prove that this method is simple, pratical and universal, and it can not only meet the requirement of both high accuracy but also provide sup- port for more Internet applications of content analysis.
出处
《计算机应用研究》
CSCD
北大核心
2015年第9期2581-2586,共6页
Application Research of Computers
基金
国家自然科学基金面上项目(61375039)
国家自然科学基金青年资助项目(61005029)
中国科学院计算机网络信息中心"一三五"规划重点培育方向专项基金资助项目(CNIC_PY_1402)
关键词
布局相似性
网页正文提取
信息检索
layout similarity
Web page content extract
information retrieval