一种基于位置信息的Web页面分割方法被引量：3

A POSITION INFORMATION-BASED WEB PAGE SEGMENTATION METHOD

下载PDF

导出

摘要提出并实现了一种针对HTML文档的页面分割方法,其目的是为了能有效提取新闻网页的正文以进行数据挖掘。基本思想是通过模拟网页浏览器的部分渲染工作,来还原HTML文档中每个标签在浏览器窗口上的显示位置,并以此对页面分割,用于提取一些重要区域的信息。在实验中,对10多个知名新闻站点如新浪、网易、TOM新闻等,利用这一方法提取其网页中的新闻正文,准确率在88.5%左右,表明了这一方法的有效性和可行性。 In this paper a position-based page segmentation method against HTML documents is presented and implemented, which intends to effectively extract the content of news sites for data mining. The basic idea is to restore the display position of each tag of the HTML document in browser window by simulating part of the rendering process that web browser does, and then to segment the page by this for extracting some information in important areas. This method has been used on ten more noted news websites such as Sina, NetEase and Tom news, etc. , in the experiments. The extracted news contents in their webpage with this method have the accurate rate up to 88.5% ,and this proves the effectiveness and feasibility of this method.

作者陈翰生曾剑平张世永

机构地区复旦大学计算机与信息技术系

出处《计算机应用与软件》 CSCD 2009年第7期155-159,共5页 Computer Applications and Software

关键词网页分割 HTML文档网页浏览器信息抽取 Page segmentation HTML document Web browser Information extraction

分类号 TP393.092 [自动化与计算机技术—计算机应用技术] TN929.53 [电子电信—通信与信息系统]

引文网络
相关文献

参考文献12

1Vadrevu S,Gelgi F.Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge.World Wide Web,2007,10:157.
2Arasu A,Garcia-Molina H.Extracting Structured Data from Web Pages.International Conference on Management of Data,Proceedings of the 2003 ACM SIGMOD international conference on Management of data,2003.
3Deng Cai,Shipeng Yu,Ji-Rong Wen,et al.VIPS:a Vision-based Page Segmentation Algorithm.http://research.microsoft.com/～jrwen/jrwen_files/publications/VIPS_Technical%20Report.PDF 2003.
4Kovacevic M,Diligenti M,Gori M,et al.Recognition of Common Areas in a Web Page Using Visual Information:a possible application in a page classification.Second IEEE International Conference on Data Mining (ICDM'02),2002:250.
5Peifeng Xiang,Xin Yang,Yuanchun Shi.Effective Page Segmentation Combining Pattern Analysis and Visual Separators for Browsing on Small Screens.Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI'06),2006:831.
6Jinlin Chen,Baoyao Zhou,Jin Shi,et al.Function-Based Object Model Towards Website Adaptation.Proceedings of the 10th international conference on World Wide Web,2001:587.
7于满泉,陈铁睿,许洪波.基于分块的网页信息解析器的研究与设计[J].计算机应用,2005,25(4):974-976. 被引量：55
8朱精南,赵明生.网页版面中区域几何信息的确定[J].计算机工程,2004,30(10):45-48. 被引量：4
9Layout Engine Technical Documentation.http://www.mozilla.org/newlayout/doc/.
10高波.嵌入式浏览器开发.http://jserv.sayya.org/netbit/.

二级参考文献15

1[1]HTM L4.0 Spccification. W3C Recommendation, 1998-04-24
2[2]Document Object Model(DOM) Level 2 HTML Specification(Version 1.0).W3C Working Draft,2000-11-13
3EMBLEY DW,JIANG YS,NG YK.Record-Boundary Discovery in Web Documents[A].SIGMOD'99 Proceedings[C].1999.
4EMBLEY DW,LI X.Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents[A].WebDB'00 Proceedings[C].2000.
5LIM SJ,NG YK.Extracting Structures of HTML Documents Using a High-Level Stack Machine[M].Information Networking in Asia,Gordon and Breach Science Publishers,Newark,New Jersey,2001.
6LIM SJ,NG YK,YANG XC.Integrating HTML Tables Using Semantic Hierarchies And Meta-Data Sets[A].International Database Engineering and Applications Symposium(IDEAS'02)[C].Edmonton,Canada,2002.
7LIM SJ,NG YK.A Heuristic Approach for Converting HTML Documents to XML Documents[A].Proceedings of the Sixth International Conference on Rules and Objects in Databases(DOOD 2000)[C].London,England,2000.1182-1196.
8LIN SH,HO JM.Discovering Informative Content Blocks from Web Documents[A].KDD 2002[C].2002.588-593.
9YU SP,CAI D,WEN JR,et al.Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation[EB/OL].http://research.microsoft.com/research/pubs/view.aspx?type=Technical%20Report&id=632,2002-12.
10WEN JR,SONG RH,CAI D,et al.Microsoft Research Asia at The Web Track of TREC 2003[A].The Twelfth Text Retrieval Conference(TREC'12)[C].2003.

共引文献55

1孙皓,董守斌.基于标签密度的自适应正文提取方法[J].郑州大学学报（理学版）,2009,41(1):44-47. 被引量：3
2郑志材,张晶.基于JAVA的网络蜘蛛的设计与实现[J].硅谷,2009,2(14):46-47.
3贾志洋,高炜,王勇刚.结合信息检索技术的半监督文本分类方法[J].苏州大学学报（自然科学版）,2012,28(1):34-39. 被引量：1
4吴鹏飞,孟祥增,刘俊晓,马凤娟.网页区域分割与识别技术[J].现代计算机,2006(6):48-50. 被引量：4
5吴鹏飞,孟祥增,刘俊晓,马凤娟.基于结构与内容的网页主题信息提取研究[J].山东大学学报（理学版）,2006,41(3):41-44. 被引量：15
6郑俭,许家成,冯素梅,叶帮利.对因特网特殊教育资源的整合与多方式传播[J].中国特殊教育,2006(8):46-49. 被引量：4
7邵斐,孙济庆.一种适用于动态网页的网络蜘蛛爬行策略研究[J].情报杂志,2007,26(5):28-30. 被引量：5
8黄文蓓,杨静,顾君忠.基于分块的网页正文信息提取算法研究[J].计算机应用,2007,27(B06):24-26. 被引量：32
9张恒,屈景辉,张亮.网页文本信息提取及结果评价[J].微计算机应用,2007,28(9):921-924. 被引量：10
10李蕾,王劲林,白鹤,胡晶晶.基于FFT的网页正文提取算法研究与实现[J].计算机工程与应用,2007,43(30):148-151. 被引量：15

同被引文献27

1于满泉,陈铁睿,许洪波.基于分块的网页信息解析器的研究与设计[J].计算机应用,2005,25(4):974-976. 被引量：55
2王芳,于浩,谭红叶,赵铁军.基于链接分块的相关链接提取方法[J].计算机工程与应用,2006,42(31):110-113. 被引量：2
3胡国平,张巍,王仁华.基于双层决策的新闻网页正文精确抽取[J].中文信息学报,2006,20(6):1-9. 被引量：16
4刘迁,焦慧,贾惠波.信息抽取技术的发展现状及构建方法的研究[J].计算机应用研究,2007,24(7):6-9. 被引量：41
5Baumgartner R, Flesca S, Gottlob G with Lixto [ C ]//Proc. of the Intl. (VLDB'01) ,2001:119 - 128.
6Visual web information extraction Conf. on Very Large Data Bases Zhai Y, Liu B. Extracting Web Data Using Instance-Based Learning [ C ]//Proc. of the 6th Intl. Cone on Web Information Systems Engi- neering( WISE' 05 ) ,2005:318 - 331.
7Gupta S, Kaiser G, Neistadt D, et al. DOM-based Content Extraction of HTML Documents [ C]//proceedings 12th International World Wide Web Conference ,2003.
8Cai D,Yu S,Wen J R,et al. VIPS:A vision-based page segmentation al- gorithm[ R 1. Microsoft Technical Report, MSR-TR-2003-79. 2003 : 10.
9Cai D, Yu S, Wen J R, et al. VIPS: Improving Pseudo- Relevance Feedback in Web Information Retrieval Using Web Page Segmentation [ C ]//Proceeding of The 12th International Conference on World Wide Web,2003.
10Abel O, Li Longzhuang, Liu Yonghuai. Visual Segmen- tation-Based Data Record Extraction from Web Documents [ C ]//Proceedings of IEEE International Conference on Information Reuse and Integration, 2007: 502-507.

引证文献3

1邵俊.基于视觉热区的网页内容抽取方法[J].计算机应用与软件,2012,29(6):199-201. 被引量：1
2于洪涛,王冬青,张付志.基于网页分块和链接特征的卷期目录链接提取方法[J].情报学报,2012,31(7):686-693. 被引量：1
3伍杰华,倪振声.改进多分类器集成AdaBoost算法的Web主题分类[J].计算机应用与软件,2013,30(11):64-67. 被引量：2

二级引证文献4

1何颖.嵌入拒识的极限学习机在基因表达数据分类中的应用[J].计算机应用与软件,2015,32(7):177-181. 被引量：1
2蒲国林.基于粗糙集与信息增益的情感特征选择方法[J].微电子学与计算机,2016,33(1):96-99. 被引量：5
3向菁菁,耿光刚,李晓东.一种新闻网页关键信息的提取算法[J].计算机应用,2016,36(8):2082-2086. 被引量：6
4龙科,李伟强,卢来.基于网页分块的科技信息采集系统的设计与实现[J].电脑迷,2017(3):179-180.

1孙晓辉,刘建,王劲林,陈晓.基于CSS的网页分割算法[J].微计算机应用,2008,29(9):46-51. 被引量：4
2陈明,孙丽丽.基于WAP的移动搜索模型[J].计算机工程,2008,34(3):205-206. 被引量：6
3热门QA[J].数码世界,2007,0(4):145-153.
4沈达峰.基于网页分割的语义信息检索研究[J].西昌学院学报（自然科学版）,2009,23(4):57-61.
5俞扬信,严云洋.一种基于网页分割的Web信息检索方法[J].图书情报工作,2009,53(3):108-110. 被引量：3
6彭红超,童名文,邹军华,郝秋红.基于规则的网页分割预处理算法研究[J].计算机科学,2013,40(11A):379-382. 被引量：1
7段昕,马军,宋玲.利用分块重要度进行中文网页分类的研究[J].山东大学学报（理学版）,2006,41(3):1-4.
8侯明燕,杨天奇.基于网页分割的Web信息提取算法[J].微型机与应用,2011,30(5):54-56. 被引量：2
9李军,陈君,王玲芳,倪宏.一种垂直页面分割与信息提取方法的研究[J].计算机应用研究,2013,30(3):844-847. 被引量：3
10Currents:新闻站点的再设计[J].硅谷,2011(23):52-52.

计算机应用与软件

2009年第7期

浏览历史

内容加载中请稍等...

一种基于位置信息的Web页面分割方法被引量：3

参考文献12

二级参考文献15

共引文献55

同被引文献27

引证文献3

二级引证文献4

相关作者

相关机构

相关主题

浏览历史

一种基于位置信息的Web页面分割方法 被引量：3

参考文献12

二级参考文献15

共引文献55

同被引文献27

引证文献3

二级引证文献4

相关作者

相关机构

相关主题

浏览历史

一种基于位置信息的Web页面分割方法被引量：3