分块布局下的主题型网页的内容抽取被引量：3

Page Content Extraction Based on Web Page Segmentation

下载PDF

导出

摘要本篇论文以去除网页噪声，整合网页内容为目标，提出了面向主题型网页，根据网页规划布局抽取网页内容的方法。算法首先分析原始网页的DOM结构生成标签树，再根据标签分类和对应节点的信息对标签树自底向上进行划分，并依据划分块的文字密度，链接密度及图片密度，分类信息块。进一步，提炼网页主题的文本特征向量，采用基于词条空间的文本相似度计算，获取划分块的主题相关度，以主题相关度为量化基准剔除噪声，识别网页主旨内容，重构页面描述。这一算法被应用于面向人才资讯的信息采集项目中，实验表明，算法适用于主题型网页的“去噪”及内容提取，具体应用中有较理想的表现。 A Web page extraction method based on the layout of Web page is proposed in this paper to implement tasks of page cleaning and content extraction. Firstly, a tag-tree is constructed by analyzing the corresponding DOM structure of original page. Then the tree is partitioned into a set of blocks from bottom to up in terms of categories of tags and concerning information of nodes, furthermore, blocks are classified on the basis of the proportion of word, link and image in blocks. Next, by using VSM （Vector Space Model）, text eigenvector of page＇s subject is abstracted, which has been used to calculate degree of correlation between block＇ s content and page＇ s subject. In the light of degree of correlation, we can judge which blocks should be got rid of and which ones should be kept. The content blocks with high degree of correlation are kept to reconstruct the description of Web page. The method has been applied in a project concerning Talent Information Collection. Test results indicate effectiveness of the method in page cleaning and content extraction.

作者聂卉张津华

机构地区中山大学资讯管理学院

出处《情报学报》 CSSCI 北大核心 2012年第1期31-39,共9页 Journal of the China Society for Scientific and Technical Information

基金本文系2008年度教育部人文社会科学研究项目“基于信息抽取的数字图书馆的知识获取研究”（项目批准号08JC870013）及2009年度中山大学青年教师培育项目“智能化深度搜索引擎实现技术的研究”（项目编号：2000-3161101）研究成果.

关键词网页内容抽取网页分块网页去噪 Web page content extraction, page segmentation, Web page cleaning

分类号 TP393.092 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献8

1李晓明,闫宏飞,王继民.搜索引擎一原理,技术与系统[M].北京:科学出版社,2007.
2Liu Y Q, Wang C H,Zhang M, et al. Web Data Cleansing for Information Retrieval using Key Resource Page Selection [ C ]//WWW 2005, Chiba, Japan: May 10-14, 2005 : 1136-1137.
3Mehta R R, Madaan A. Web Page Sectioning Using Regex- based Template [ C ]. http://www2008, org/papers/pdf/ p1151 -mehtaA. pdf. [ 2010-09-05 ].
4Gupta S, Kaiser G, Neistadt D, et al. DOM-based content extraction of HTML documents [ C ]//WWW2003. Budapest, Hungary : ACM press, 2003 : 207-214.
5王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792. 被引量：81
6Debnath S, Mitra P, Pal N, et al. Automatic Identification of Informative Sections of Web-pages [ J ]. lEEE Transactions on Knowledge and Data Engineering,2005.
7Kovacevic M, Dilligenti M, Gori M, et al. Recognition of Common Areas in a Web Page Using a Visualization Approach [ C ]//Artificial Intelligence : Methodology, Systems,and Applications, 10~h International Conference (AIMSA 2002 ), Varna, Bulgaria: Springer-Verlag, September 4-6,2002 : 203-212.
8Cai D,Yu S P,Wen J R,et al. Extracting Content Structure for Web Pages Based on Visual Representation [ C ]//5'h Asian-Pacific Web Conference (APWeb) , Xian, China: Springer-Werlag, April 23-25,2003:406-417.

二级参考文献13

1O Buyukkokten, H Garcia-Molina, A Paepcke. Accordion summarization for end-game browsing on PDAs and cellular phones. In: Proc of ACM Conf on Human Factors in Computing Systems(CHI 2001). New York: ACM Press, 2001. 213～220
2Wang Tengjiao, Tang Shiwei, Yang Dongqing, et al. COMIIX:Towards effective WEB information extraction, integration and query answering. In: Proc of SIGMOD' 02. New York: ACM Press, 2002. 620
3Liu Ling, Pu Calton, Han Wei. XWRAP: An XML-enabled wrapper construction system for Web information sources. In:Proc of the 16th Int'l Conf on Data Engineering. Washington:IEEE Computer Society Press, 2000. 611～621
4R Baumgartner, S Flesca, G Gottlob. Visual Web information extraction with Lixto. In: Proc of the 27th Int'l Conf on Very Large Data Bases. San Francisco: Morgan Kaufmann, 2001. 119～ 128
5D Freitag. Machine learning for information extraction in information domains. Machine Learning, 2000, 39 (2-3): 169 ～202
6S SoderLan. Learning information extraction rules for semistructured and free text. Machine Learning, 1999, 34(1-3): 233～ 272
7R D Doorenbos, O Etzioni, D S Weld. A scalable comparasonshopping agent for the World-Wide Web. In: ACM Agents' 97.New York: ACM Press, 1997. 39～48
8D W Embley, et al. Conceptual-model-based data extraction from multiple-record Web pages. Data and Knowledge Engineering,1999, 31(3): 227～251
9A Finn, A Kushmerick, B Smyth. Fact or fiction: Content classification for digital libraries. The 2nd DELOS Network of Excellence Workshop on Personalisation and Recommender Systems in Digital Libraries, Dublin, Ireland, 2001
10S Gupta, G Kaiser, D Neistadt, et al. DOM-based content extraction of HTML documents. In: Proc of the 12th Int'l World-Wide Web Conf. New York: ACM Press, 2003. 207～214

共引文献80

1赵彦斌,李庆华,赵峰.Web网页语义树的构造与利用[J].华中科技大学学报（自然科学版）,2005,33(z1):229-231. 被引量：1
2张聚弘,山岚.基于页面对比分析的数据提取[J].计算机与数字工程,2006,34(1):49-52. 被引量：1
3吴鹏飞,孟祥增,刘俊晓,马凤娟.网页区域分割与识别技术[J].现代计算机,2006(6):48-50. 被引量：4
4吴鹏飞,孟祥增,刘俊晓,马凤娟.基于结构与内容的网页主题信息提取研究[J].山东大学学报（理学版）,2006,41(3):41-44. 被引量：15
5贺智平,徐学洲,李爱玲.一种基于信息熵的Web页面主题信息抽取方法[J].计算机工程与应用,2007,43(4):164-166. 被引量：6
6赵欣欣,索红光,刘玉树.基于标记窗的网页正文信息提取方法[J].计算机应用研究,2007,24(3):144-145. 被引量：33
7谢华,刘卫国.基于局部语义的网页净化算法[J].计算机系统应用,2007,16(5):25-28.
8章勤,余洋,陶文兵.图像搜索中基于网页分块的图像分类研究[J].计算机工程与科学,2007,29(6):42-44. 被引量：1
9高琰,谷士文,谭立球.基于多种策略的页面内容提取算法[J].西南交通大学学报,2007,42(4):473-477. 被引量：4
10张恒,屈景辉,张亮.网页文本信息提取及结果评价[J].微计算机应用,2007,28(9):921-924. 被引量：10

同被引文献19

1欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报（自然科学版）,2005,45(S1):1743-1747. 被引量：70
2张志刚,陈静,李晓明.一种HTML网页净化方法[J].情报学报,2004,23(4):387-393. 被引量：57
3Edge Side Include[ EB/OL]. [2013 -03 -02]. http://www, esi. org.
4Document Object Model - W3 C Recommendation [ EB/OL ]. [ 2013 - 03 - 02 ]. http ://www. w3. org/.
5DOM. MA JUN - CHANG, GU ZHI - MIN. Automatic detection of shared fragments in large collections of Web pages and its appli- cation[ J ]. Journal of Algorithms and Computational Technology ,2007,1 ( 2 ) :215 - 217.
6A Broder. On resemblance and containment of documents [ C ]//In Proceedings of SEQUENCES- 97,1997.
7Gibson D,Punera K,Tomkins A.The Volume and Evolution of Web Page Templates[C]//Proc.of the 14th International Conference on World Wide Web.New York,USA:ACM Press,2005.
8Rahman A,Alam H,Hartono R.Content Extraction from HTML Documents[C]//Proc.of the 1st International Workshop on Web Document Analysis.New York,USA:ACM Press,2001.
9Wang Jiying,Lochovsky F H.Data-rich Section Extraction from HTML Pages[C]//Proc.of the 3rd International Conference on Web Information Systems Engineering.Washington D.C.,USA:IEEE Computer Society,2002.
10Sun Fei,Song Dandan,Liao Lejian.Dom Based Content Extraction via Text Density[C]//Proc.of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval.New York,USA:ACM Press,2011.

引证文献3

1张铃丽.动态网页共享片段自动检测技术研究[J].荆楚理工学院学报,2013,28(2):34-37.
2熊忠阳,蔺显强,张玉芳,牙漫.结合网页结构与文本特征的正文提取方法[J].计算机工程,2013,39(12):200-203. 被引量：15
3龙科,李伟强,卢来.基于网页分块的科技信息采集系统的设计与实现[J].电脑迷,2017(3):179-180.

二级引证文献15

1穆翠霞,周琳琳.电子商务网络资讯管理系统的设计与实现[J].电脑开发与应用,2014,27(4):12-15.
2王建斌,刘臻,胡昌振,单纯,钟松延.基于静态分析的缺陷模式匹配研究[J].信息安全研究,2018,4(4):359-363. 被引量：1
3秦玉海,刘禄源,高浩航,刘晟桥.网页恶意挖矿行为的检测及防范[J].网络安全技术与应用,2018(12):51-53. 被引量：2
4王庆福.基于PageRank算法的文本关键词权重计算研究[J].网络新媒体技术,2015,4(3):37-41.
5吴飞飞,姬东鸿,吕超镇.基于LDA和CTR的用户模型分析[J].计算机工程与应用,2016,52(6):50-54. 被引量：1
6袁琰伟,陆培军.一种面向高校招投标公告主题爬虫的设计[J].软件导刊,2018,17(2):117-119.
7周雪,刘乃文.引入主题链接块因子的候选链接搜索策略研究[J].计算机与数字工程,2018,46(5):874-878. 被引量：1
8王海涌,冯兆旭,杨海波,张津栋.基于结构相似网页聚类的正文提取算法研究[J].计算机工程与应用,2018,54(11):122-127. 被引量：2
9刘锐,谭文韬,付园斌,王红.一种通用论坛信息提取方法[J].小型微型计算机系统,2018,39(7):1398-1404.
10张文超,胡玉兰.基于PyQt的全文搜索引擎平台开发[J].软件导刊,2018,17(9):132-135. 被引量：2

1解姝,叶施仁,肖春.社会媒体网页内容的分割与抽取[J].计算机工程,2011,37(21):155-158.
2汪锐,傅连东,郑梁,王毅.LabVIEW在数据采集中的应用[J].机械,2007,34(11):33-35. 被引量：7
3李志义,沈之锐.基于自然标注的网页信息抽取研究[J].情报学报,2013,32(8):853-859. 被引量：4
4赵泳鑫,钟诚.识别稳定的局部社区结构算法[J].信息技术,2016,40(3):19-23.
5于静波,余敦一,陈秋月,胡文学.互联网新闻搜索设计[J].计算机系统应用,2008,17(7):18-20.
6孙楠,张华伟.一种新的用于数据挖掘工具的网页净化算法[J].郑州轻工业学院学报（自然科学版）,2011,26(3):85-87.
7冯鲲,叶高英.支持PostScript格式输出的二维绘图[J].兵工自动化,1998,17(1):57-59. 被引量：1
8阿镙.随心打造个性化Windows Live搜索[J].电脑迷,2006,0(22):73-73.
9黄汉永.PostScript 语言及其应用[J].长沙铁道学院学报,1992,10(4):33-38.
10余文彬,董军利,管文强,王海勇.字体颜色在文本数字水印中的应用研究[J].硅谷,2011(2):68-68.

情报学报

2012年第1期

浏览历史

内容加载中请稍等...

分块布局下的主题型网页的内容抽取被引量：3

参考文献8

二级参考文献13

共引文献80

同被引文献19

引证文献3

二级引证文献15

相关作者

相关机构

相关主题

浏览历史

分块布局下的主题型网页的内容抽取 被引量：3

参考文献8

二级参考文献13

共引文献80

同被引文献19

引证文献3

二级引证文献15

相关作者

相关机构

相关主题

浏览历史

分块布局下的主题型网页的内容抽取被引量：3