期刊文献+

基于权值优化的网页正文内容提取算法 被引量:8

Content Extraction Algorithm of HTML Pages Based on Optimized Weight
下载PDF
导出
摘要 目前网页上出现越来越多的广告信息,使得准确抽取网页正文信息变得越来越难.针对这一问题,文中提出了一种基于权值优化的网页正文内容提取算法.该算法首先通过分析网页正文内容的特点,确定主题块的特征属性,得出这些属性的统计特征;然后,利用各个特征属性具有不同重要性的特点,使用粒子群优化算法对特征权值及阈值进行了优化和确定,使其性能得到进一步的提升;最后通过实验对该方法进行验证.结果表明,与未经权值优化的提取算法相比,在基本维持相同精确率的基础上,该方法可使网页正文内容提取的召回率提升至95.8%. With the increase in advertisement amount in HTML pages,it becomes more and more difficult to extract content accurately.In order to solve this problem,an algorithm of content extraction from HTML pages is proposed based on optimized weight.In this algorithm,first,the features of the content are analyzed to obtain the statistical features of the attributes by analyzing the characteristics of the content block in web pages.Then,in view of diffe-rent importance of the features,the weight and threshold of the features are optimized by using the particle swarm optimization algorithm,which further improves the performance of the algorithm.Finally,some experiments are performed to verify the effectiveness of the algorithm.The results show that,as compared with the algorithm with un-optimized weight,the proposed algorithm improves the recall rate of content extraction to 95.8% without reducing the precision.
出处 《华南理工大学学报(自然科学版)》 EI CAS CSCD 北大核心 2011年第4期32-37,共6页 Journal of South China University of Technology(Natural Science Edition)
基金 国家"973"计划项目(2007CB311106)
关键词 权值优化 正文内容提取 特征属性 统计特征 准确率 召回率 weight optimization content extraction feature attribute statistical feature precision recall rate
  • 相关文献

参考文献15

  • 1Wang J Y,Lochovsky F H. Data-rich section extraction from HTML pages [ C ]//Proc of the 3rd International Conference on Web Information Systems Engineering. Singapore: IEEE Computer Society Press,2002:313-322.
  • 2W3C DOM IG. Document object model[ EB/OL]. (2010- 6-5 ) http: //www. w3. org/DOM/.
  • 3Lin S H, Ho J M. Discovering informative content blocks from web documents [ C ] //Proc of the ACM SIGKDD'02. Alberta : ACM ,2002 : 190-195.
  • 4Lan Y, Liu B, Li X L. Eliminating noisy information in web pages for data mining [ C]//Proc of the Ninth ACM SIGKDD International Conference on Knowledge Disco- very and Data Mining. Washington : ACM,2003 : 296- 305.
  • 5Debnath S, Mitra P, Pal N, et al. Automatic identification of informative sections of web pages [ J ]. IEEE Tran. on Knowledge and Data Engineering, 2005, 17 ( 9 ) : 1233- 1246.
  • 6欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报(自然科学版),2005,45(S1):1743-1747. 被引量:70
  • 7荆涛,左万利.基于可视布局信息的网页噪音去除算法[J].华南理工大学学报(自然科学版),2004,32(z1):84-87. 被引量:21
  • 8Suhit G, Gail K, David N, et al. DOM-based content extraction of HTML documents [ C]//Proc of the 12th International World Wide Web Conference. Budapest :ACM, 2003:207-217.
  • 9Cai Deng, He Xiao-fei, Wen Ji-rong, et al. Block-level link analysis [ C ]//Proc of SIGIR'04. Sheffied : ACM, 2004 : 134-142.
  • 10Song Rui-hua, Liu Hai-feng,Wen Ji-rong,et al. Learning block importance models for web pages [ C ] // Proc of World Wide Web Conference. New York: ACM, 2004: 343-348.

二级参考文献9

  • 1荆涛,左万利.基于可视布局信息的网页噪音去除算法[J].华南理工大学学报(自然科学版),2004,32(z1):84-87. 被引量:21
  • 2张志刚,陈静,李晓明.一种HTML网页净化方法[J].情报学报,2004,23(4):387-393. 被引量:57
  • 3常育红,姜哲,朱小燕.基于标记树表示方法的页面结构分析[J].计算机工程与应用,2004,40(16):129-132. 被引量:24
  • 4[1]Lin Shian-hua, Ho Jan-ming. Discovering informative content blocks from Web documents [A]. Proceeding of the 8th ACM SIG KDD International Conference on Knowledge Discovery and Data Mining [C]. Edmonton :ACM Press,2002.588 - 593.
  • 5[2]Yi Lan,Liu Bing, Li Xiao-li. Eliminating noisy information in Web pages for data mining [A]. Proceeding of the 8th ACM SIG KDD International Conference on Knowledge Discovery and Data Mining [C]. Washington, DC: ACM Press ,2003. 296 - 305.
  • 6[3]Kovacevic Milos, Dilligenti Michelangelo, Gori Marco,et al. Recognition of common areas in a Web page using a visualization approach [A]. Proceeding of the 10th International Conference on Artificial Intelligence: Methodology, Systems, Applications [C]. Varna: Springer,2002.203 - 212.
  • 7[4]Gupta Suhit, Kaiser Gail E, Neistadt David. et al. DOMbased content extraction of HTML documents [A].Proce-eding of the 12th International World Wide Web Conference [C]. Budapest: ACM Press ,2003. 207 - 214.
  • 8[5]Cai Deng, Yu Shi-peng, Wen Ji-rong, et al. Extracting content structure for Web pages Based on visual representation [A]. Proceeding of the 6th Asia Pacific Web Conference [C]. Xian: Springer,2003. 406 - 417.
  • 9李效东,顾毓清.基于DOM的Web信息提取[J].计算机学报,2002,25(5):526-533. 被引量:101

共引文献83

同被引文献76

引证文献8

二级引证文献26

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部