期刊文献+

一种有效的网页噪声消除的方法 被引量:3

An Effective Approach to Eliminating Noises in HTML Pages
下载PDF
导出
摘要 大多数网页都有如广告、版权、导航链接等噪声,影响Web应用系统的工作质量,因此快速准确地清除网页中的噪声内容是提高Web应用程序性能的关键技术之一。提出了一种网页净化方法,通过用模式树(PT)表示网页的布局结构,根据模式树中节点的信息熵来消除噪声,以达到网页净化的目的。试验将此方法应用于一个SVM分类系统,结果显示通过净化的网页对分类结果的正确率和高效性都有了一定的改进。 Most Web pages usually have such noisy block~ as navigation panels, copyright and advertisements, which decreases the accuracies of Web applications system. So eliminating noises content accurately and efficiently is a key technique to improve the service qualities of Web application systems. This paper proposes a novel approach to reduce the noise content in Web pages. It uses a tree structure, called pattern tree(PT), to capture the common layout of the pages in a given Web site. It also introduces an entropy-based measure of the node in the PT to reduce noisy blocks of the site. The approach is applied in a SVM-based Web page classification system. The strong evidence of improvement in applications verifies the validity of the approach presented.
出处 《计算机工程》 CAS CSCD 北大核心 2007年第8期89-91,共3页 Computer Engineering
关键词 文档树 模式树 基本节点 风格节点 网页净化 Document tree Pattern tree Element node Style node Web page purification
  • 相关文献

参考文献4

  • 1Lin Shianhua, Ho Janming. Discovering Informative Content Blocks from Web Document[C]//Proc. of Conference on Knowledge Discovery and Data Mining. 2002.
  • 2Yossef Z B, Rajagopalan S.Template Detection via Data Mining and Its Applications[M]. Association for Computing Machinery Press,2002:580-591.
  • 3Davision B D. Recognizing Nepotistic Links on the Web[M].American Association for Artificial Intelligence Press, 2000: 23-28.
  • 4Yi Lan, Liu Bing, Li Xiaoli. Eliminating Noisy Information in Web Pages for Data Mining[M]. Association for Computing Machinery Press, 2003.

同被引文献13

  • 1彭京,杨冬青,唐世渭,王腾蛟,高军.基于概念相似度的文本相似计算[J].中国科学(F辑:信息科学),2009,39(5):534-544. 被引量:17
  • 2黄冉,郭嵩山.基于类别空间模型的文本分类系统的设计与实现[J].计算机应用研究,2005,22(8):60-63. 被引量:11
  • 3樊兴华,孙茂松.一种高性能的两类中文文本分类方法[J].计算机学报,2006,29(1):124-131. 被引量:70
  • 4Zhai Yanhong, Liu Bing. Structured data extraction from the Web based on partial tree alignment[J]. IEEE Trans on Know ledge and Data Engineering,2006,18(12): 1614-1628.
  • 5Document Object Model (DOM)[EB/OL]. [2005-01-19]. http://www, w3. org/DOM/.
  • 6Yamron J P, S Knecht, P van Mulbregt. Dragon' s Tracking and Detection Systems for the TDT2000 Evaluation[ C ]//Proceeding of Topic Detection and Tracking workshop, Washington, 2000:75-80.
  • 7The 2004 Topic Detection and Tracking, Task Definition and Evaluation Plan [ EB/OL]. [2004-02-24]. http://www. nist. gov /speech/ tests/ tdt/ tdt2002 / evalplan/ htm.
  • 8Papka R, Allan J. On-Line New Event Detection using Single Pass Clustering [C]//Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, Melbourne, 1998:37-45.
  • 9Giridhar Kumaran, James Allan. Text Classification and Named Entities for New Event Detection [ C ]//Proceeding of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. Sheffield, England, 2004: 297 -304.
  • 10李艳玲,戴冠中,朱烨行.基于类别空间模型的文本倾向性分类方法[J].计算机应用,2007,27(9):2194-2196. 被引量:12

引证文献3

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部