期刊文献+

一种基于节点密度分割和标签传播的Web页面挖掘方法 被引量:13

A Method Based on Node Density Segmentation and Label Propagation for Mining Web Page
下载PDF
导出
摘要 获取Web页面中的重要内容如文本和链接,在许多Web挖掘研究领域有着重要的应用价值.目前针对该问题主要采用Web页面分割和区块识别的方法.但现有的方法将Web页面中重要文本和链接的识别视为两个相互独立的问题,这种做法忽略了Web页面中文本和链接的内在语义关系,同时降低了页面处理的效率.文中提出了一种Web页面重要内容挖掘的统一框架,该框架主要由3个部分组成:第一,先将Web页面转换为DOM树表示,然后采用节点密度熵为度量将DOM树分割为不同的页面块;第二,采用基于K最近邻标签传播的半监督方法自动扩展页面块训练集;第三,在扩展的页面块训练集上对SVM分类器进行训练,并用来对页面块进行分类.采用该框架可以将Web页面块区分为多种类型,并且该框架独立于Web页面的类型和布局.我们在真实的Web环境下进行了广泛的实验,实验结果表明了该方法的有效性. For many research fields in Web mining, how to get the important content in a Web page, such as texts and links, has important applications. At present, the main method for solving this problem is to adopt Web page segmentation and informative sections recognition. However, existing approaches use decoupled strategies that attempt to do text content and link content identification in two separate phases. This ignores the inner semantic relationships between texts and links in a Web page, and also results in low efficiency of the processing of Web page. In this paper, we propose a uniform framework for mining important content in a Web page. This framework consists of three components. First, a Web page is transformed into a DOM tree, and then it is segmented into several Web page blocks with a metric based on node density entropy. Second, a semi-supervised approach based on K-Nearest Neighbor label propagation is proposed to automatically extend the training set for classification. Third, a SVM-based classifier is trained over the extended training set, and eventually it is leveraged to classify Web page blocks. The framework can distinguish Web page blocks into a variety of types, and it is independent of the type and layout of Web pages. We conduct the extensive experiment over real Web environment, and the experimental results show that the proposed methods are effective.
出处 《计算机学报》 EI CSCD 北大核心 2015年第2期349-364,共16页 Chinese Journal of Computers
基金 国家自然科学基金(61272109 61202285) 国家星火计划项目(2012GA750007) 河南省科技厅基础与前沿技术研究项目(122300410378) 河南省教育厅科学技术研究重点项目(13A520032)资助~~
关键词 页面分割 节点密度 标签传播 DOM树 块分类 社会计算 社交网络 Web page segmentation node density label propagation DOM tree block classification social computing social networks
  • 相关文献

参考文献24

  • 1Yin Xinyi, Lee Wee Sun. Using link analysis to improve layout on mobile deviees//Proeeedings of the 13th International Conference on World Wide Web (WWW 2004). New York, USA, 2004:338-344.
  • 2Cben Yu, Ma Wei-Ying, Zhang Hong-Jiang. Detecting Web page structure for adaptive viewing on small form factor devices//Proceedings of the 12th International Conference on World Wide Web (WWW 2003). Budapest, Hungary, 2003:225-233.
  • 3Baluja S. Browsing on small screens: Recasting Web-page segmentation into an effcient machine learning framework// Proceedings of the 15th International Conference on World Wide Web (WWW 2006). Edinburgh, Scotland, 2006: 33-42.
  • 4Sun Fei, Song Dandan, Liao Leiian. DOM based content extraction via text density//Proceedings of the 34th Annual International ACM SIGIR Conference (SIGIR 2011). Beijing, China, 2011:245-254.
  • 5Cai Deng, Yu Shipeng, Wen J i-Rong, Ma Wei-Ying. Extracting content structure for Web pages based on visual representation //Proceedings of the 5th Asian-Pacific Web Conference (APWeb 2003). Xi'an, China, 2003:406-417.
  • 6Yi Lan, Liu Bing, Li Xiaoli. Eliminating noisy information in Web pages for data mining//Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2003). Washington, USA, 2003: 296- 305.
  • 7Ramaswamy L, Iyengar A, Liu Ling, Douglis F. Automatic fragment detection in dynamic Web pages and its impact on caching. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2005, 17(6): 859-874.
  • 8Debnath S, Mitra P, Pal N, Giles C L. Automatic identifica- tion of informative sections of Web pages. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2005, 17(9) : 1233-1246.
  • 9Kolcz A, Yih Wen-tau. Site-independent template-block detection//Proceedings of the 11th European Confereneeon on Principles and Practice of Knowledge Discovery in Databases (PKDD 2007). Warsaw, Poland, 2007:152-163.
  • 10Chakrabarti D, Kumar R, Punera K. Page-level template detection via isotonic smoothing//Proceedings of the 16th International Conference on World Wide Web (WWW 2007). Banff, Canada, 2007:61-70.

二级参考文献17

  • 1Chakrabarti Soumen,van den Berg Martin,Dom Byron.Focused crawling:A new approach to topic-specific Web resource discovery.Computer Networks (CN),1999,31(11-16):1623-1640.
  • 2Chakrabarti Soumen,Punera Kunal,Subramanyam Mallela.Accelerated focused crawling through online relevance feedback//Proceedings of the 11th International Conference on World Wide Web (WWW 2002).Honolulu,Hawaii,USA,2002:148-159.
  • 3Diligenti Michelangelo,Coetzee Frans,Lawrence Steve,Giles C Lee,Gori Marco.Focused crawling using context graphs//Proceedings of the 26th International Conference on Very Large Data Bases (VLDB 2000).Cairo,Egypt,2000:527-534.
  • 4Barbosa Luciano,Freire Juliana.An adaptive crawler for locating hidden web entry points//Proceedings of the 16th International Conference on World Wide Web (WWW 2007).Banff,Alberta,Canada,2007:441-450.
  • 5Rennie Jason,McCallum Andrew.Using reinforcement learning to spider the Web efficiently//Proceedings of the 16th International Conference on Machine Learning (ICML-99).Bled,Slovenia,1999:335-343.
  • 6Guilherme T de Assis,Alberto H F Laender,Marcos André Gonalves,Altigran Soares da Silva.A genre-aware approach to focused crawling.World Wide Web (WWW),2009,12(3):285-319.
  • 7Abiteboul S,Preda M,Cobena G.Adaptive on-line page importance computation//Proceedings of the 12th International Conference on World Wide Web (WWW 2003).Budapest,Hungary,2003:280-290.
  • 8Guan Ziyu,Wang Can,Chen Chun,Bu Jiajun,Wang Junfeng.Guide focused crawler efficiently and effectively using on-line topical importance estimation//Proceedings of the 31st Annual International ACM SIGIR Conference (SIGIR 2008).Singapore,2008:757-758.
  • 9Ahlers Dirk,Boll Susanne.Adaptive geospatially focused crawling//Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM 2009).Hong Kong,China,2009:445-454.
  • 10Yang Jiang-Ming,Cai Rui,Wang Chun-Song,Huang Hua,Zhang Lei,Ma Wei-Ying.Incorporating site-level knowledge for incremental crawling of Web forums:A list-wise strategy//Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2009).Paris,France,2009:1375-1384.

共引文献6

同被引文献92

引证文献13

二级引证文献54

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部