期刊文献+

分块布局下的主题型网页的内容抽取 被引量:3

Page Content Extraction Based on Web Page Segmentation
下载PDF
导出
摘要 本篇论文以去除网页噪声,整合网页内容为目标,提出了面向主题型网页,根据网页规划布局抽取网页内容的方法。算法首先分析原始网页的DOM结构生成标签树,再根据标签分类和对应节点的信息对标签树自底向上进行划分,并依据划分块的文字密度,链接密度及图片密度,分类信息块。进一步,提炼网页主题的文本特征向量,采用基于词条空间的文本相似度计算,获取划分块的主题相关度,以主题相关度为量化基准剔除噪声,识别网页主旨内容,重构页面描述。这一算法被应用于面向人才资讯的信息采集项目中,实验表明,算法适用于主题型网页的“去噪”及内容提取,具体应用中有较理想的表现。 A Web page extraction method based on the layout of Web page is proposed in this paper to implement tasks of page cleaning and content extraction. Firstly, a tag-tree is constructed by analyzing the corresponding DOM structure of original page. Then the tree is partitioned into a set of blocks from bottom to up in terms of categories of tags and concerning information of nodes, furthermore, blocks are classified on the basis of the proportion of word, link and image in blocks. Next, by using VSM (Vector Space Model), text eigenvector of page's subject is abstracted, which has been used to calculate degree of correlation between block' s content and page' s subject. In the light of degree of correlation, we can judge which blocks should be got rid of and which ones should be kept. The content blocks with high degree of correlation are kept to reconstruct the description of Web page. The method has been applied in a project concerning Talent Information Collection. Test results indicate effectiveness of the method in page cleaning and content extraction.
作者 聂卉 张津华
出处 《情报学报》 CSSCI 北大核心 2012年第1期31-39,共9页 Journal of the China Society for Scientific and Technical Information
基金 本文系2008年度教育部人文社会科学研究项目“基于信息抽取的数字图书馆的知识获取研究”(项目批准号08JC870013)及2009年度中山大学青年教师培育项目“智能化深度搜索引擎实现技术的研究”(项目编号:2000-3161101)研究成果.
关键词 网页内容抽取 网页分块 网页去噪 Web page content extraction, page segmentation, Web page cleaning
  • 相关文献

参考文献8

  • 1李晓明,闫宏飞,王继民.搜索引擎一原理,技术与系统[M].北京:科学出版社,2007.
  • 2Liu Y Q, Wang C H,Zhang M, et al. Web Data Cleansing for Information Retrieval using Key Resource Page Selection [ C ]//WWW 2005, Chiba, Japan: May 10-14, 2005 : 1136-1137.
  • 3Mehta R R, Madaan A. Web Page Sectioning Using Regex- based Template [ C ]. http://www2008, org/papers/pdf/ p1151 -mehtaA. pdf. [ 2010-09-05 ].
  • 4Gupta S, Kaiser G, Neistadt D, et al. DOM-based content extraction of HTML documents [ C ]//WWW2003. Budapest, Hungary : ACM press, 2003 : 207-214.
  • 5王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792. 被引量:81
  • 6Debnath S, Mitra P, Pal N, et al. Automatic Identification of Informative Sections of Web-pages [ J ]. lEEE Transactions on Knowledge and Data Engineering,2005.
  • 7Kovacevic M, Dilligenti M, Gori M, et al. Recognition of Common Areas in a Web Page Using a Visualization Approach [ C ]//Artificial Intelligence : Methodology, Systems,and Applications, 10~h International Conference (AIMSA 2002 ), Varna, Bulgaria: Springer-Verlag, September 4-6,2002 : 203-212.
  • 8Cai D,Yu S P,Wen J R,et al. Extracting Content Structure for Web Pages Based on Visual Representation [ C ]//5'h Asian-Pacific Web Conference (APWeb) , Xian, China: Springer-Werlag, April 23-25,2003:406-417.

二级参考文献13

  • 1O Buyukkokten, H Garcia-Molina, A Paepcke. Accordion summarization for end-game browsing on PDAs and cellular phones. In: Proc of ACM Conf on Human Factors in Computing Systems(CHI 2001). New York: ACM Press, 2001. 213~220
  • 2Wang Tengjiao, Tang Shiwei, Yang Dongqing, et al. COMIIX:Towards effective WEB information extraction, integration and query answering. In: Proc of SIGMOD' 02. New York: ACM Press, 2002. 620
  • 3Liu Ling, Pu Calton, Han Wei. XWRAP: An XML-enabled wrapper construction system for Web information sources. In:Proc of the 16th Int'l Conf on Data Engineering. Washington:IEEE Computer Society Press, 2000. 611~621
  • 4R Baumgartner, S Flesca, G Gottlob. Visual Web information extraction with Lixto. In: Proc of the 27th Int'l Conf on Very Large Data Bases. San Francisco: Morgan Kaufmann, 2001. 119~ 128
  • 5D Freitag. Machine learning for information extraction in information domains. Machine Learning, 2000, 39 (2-3): 169 ~202
  • 6S SoderLan. Learning information extraction rules for semistructured and free text. Machine Learning, 1999, 34(1-3): 233~ 272
  • 7R D Doorenbos, O Etzioni, D S Weld. A scalable comparasonshopping agent for the World-Wide Web. In: ACM Agents' 97.New York: ACM Press, 1997. 39~48
  • 8D W Embley, et al. Conceptual-model-based data extraction from multiple-record Web pages. Data and Knowledge Engineering,1999, 31(3): 227~251
  • 9A Finn, A Kushmerick, B Smyth. Fact or fiction: Content classification for digital libraries. The 2nd DELOS Network of Excellence Workshop on Personalisation and Recommender Systems in Digital Libraries, Dublin, Ireland, 2001
  • 10S Gupta, G Kaiser, D Neistadt, et al. DOM-based content extraction of HTML documents. In: Proc of the 12th Int'l World-Wide Web Conf. New York: ACM Press, 2003. 207~214

共引文献80

同被引文献19

  • 1欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报(自然科学版),2005,45(S1):1743-1747. 被引量:70
  • 2张志刚,陈静,李晓明.一种HTML网页净化方法[J].情报学报,2004,23(4):387-393. 被引量:57
  • 3Edge Side Include[ EB/OL]. [2013 -03 -02]. http://www, esi. org.
  • 4Document Object Model - W3 C Recommendation [ EB/OL ]. [ 2013 - 03 - 02 ]. http ://www. w3. org/.
  • 5DOM. MA JUN - CHANG, GU ZHI - MIN. Automatic detection of shared fragments in large collections of Web pages and its appli- cation[ J ]. Journal of Algorithms and Computational Technology ,2007,1 ( 2 ) :215 - 217.
  • 6A Broder. On resemblance and containment of documents [ C ]//In Proceedings of SEQUENCES- 97,1997.
  • 7Gibson D,Punera K,Tomkins A.The Volume and Evolution of Web Page Templates[C]//Proc.of the 14th International Conference on World Wide Web.New York,USA:ACM Press,2005.
  • 8Rahman A,Alam H,Hartono R.Content Extraction from HTML Documents[C]//Proc.of the 1st International Workshop on Web Document Analysis.New York,USA:ACM Press,2001.
  • 9Wang Jiying,Lochovsky F H.Data-rich Section Extraction from HTML Pages[C]//Proc.of the 3rd International Conference on Web Information Systems Engineering.Washington D.C.,USA:IEEE Computer Society,2002.
  • 10Sun Fei,Song Dandan,Liao Lejian.Dom Based Content Extraction via Text Density[C]//Proc.of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval.New York,USA:ACM Press,2011.

引证文献3

二级引证文献15

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部