摘要
提出并实现了一种针对HTML文档的页面分割方法,其目的是为了能有效提取新闻网页的正文以进行数据挖掘。基本思想是通过模拟网页浏览器的部分渲染工作,来还原HTML文档中每个标签在浏览器窗口上的显示位置,并以此对页面分割,用于提取一些重要区域的信息。在实验中,对10多个知名新闻站点如新浪、网易、TOM新闻等,利用这一方法提取其网页中的新闻正文,准确率在88.5%左右,表明了这一方法的有效性和可行性。
In this paper a position-based page segmentation method against HTML documents is presented and implemented, which intends to effectively extract the content of news sites for data mining. The basic idea is to restore the display position of each tag of the HTML document in browser window by simulating part of the rendering process that web browser does, and then to segment the page by this for extracting some information in important areas. This method has been used on ten more noted news websites such as Sina, NetEase and Tom news, etc. , in the experiments. The extracted news contents in their webpage with this method have the accurate rate up to 88.5% ,and this proves the effectiveness and feasibility of this method.
出处
《计算机应用与软件》
CSCD
2009年第7期155-159,共5页
Computer Applications and Software