期刊文献+

Relevance-based content extraction of HTML documents

Relevance-based content extraction of HTML documents
下载PDF
导出
摘要 Content extraction of HTML pages is the basis of the web page clustering and information retrieval,so it is necessary to eliminate cluttered information and very important to extract content of pages accurately.A novel and accurate solution for extracting content of HTML pages was proposed.First of all,the HTML page is parsed into DOM object and the IDs of all leaf nodes are generated.Secondly,the score of each leaf node is calculated and the score is adjusted according to the relationship with neighbors.Finally,the information blocks are found according to the definition,and a universal classification algorithm is used to identify the content blocks.The experimental results show that the algorithm can extract content effectively and accurately,and the recall rate and precision are 96.5% and 93.8%,respectively. Content extraction of HTML pages is the basis of the web page clustering and information retrieval, so it is necessary to eliminate cluttered information and very important to extract content of pages accurately. A novel and accurate solution for extracting content of HTML pages was proposed. First of all, the HTML page is parsed into DOM object and the IDs of all leaf nodes are generated. Secondly, the score of each leaf node is calculated and the score is adjusted according to the relationship with neighbors. Finally, the information blocks are found according to the definition, and a universal classification algorithm is used to identify the content blocks. The experimental results show that the algorithm can extract content effectively and accurately, and the recall rate and precision are 96.5% and 93.8%, respectively.
出处 《Journal of Central South University》 SCIE EI CAS 2012年第7期1921-1926,共6页 中南大学学报(英文版)
基金 Project(2012BAH18B05) supported by the Supporting Program of Ministry of Science and Technology of China
关键词 HTML文件 提取 HTML页面 关联 信息检索 网页内容 分类算法 DOM content extraction DOM node relevance information block
  • 相关文献

参考文献15

  • 1OU J W, DONG X B, CAI B. Topic information extraction from template web pages [J]. Journal of Tsinghua University: Science and Technology, 2005, 45(S1): 1743-1747.
  • 2SANDIP D, PRASENJIT M, C LEE G. Identifying content blocks from web documents [C]// 2005 International Symposium on Methodologies for Intelligent Systems (ISMIS 2005). New York: LNAL 2005: 285-293.
  • 3MOHSEN A, MIR M P, AMIR M R. Main content extraction from detailed web pages [J]. International Journal of Computer Applications, 2010, 4(11): 18-21.
  • 4YI L, LIU B, LI X L. Eliminating noisy information in web pages for data mining [C]// The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington: ACM Press, 2003: 296-305.
  • 5SUHIT G, HILA B, GAIL K, SALVATORE S. Verifying genre-based clustering approach to content extraction [C]//The 15th International World Wide Web Conference. Budapest: ACM Press, 2006: 875-876.
  • 6DEBNATH S, Automatic identification of informative sections of web pages [J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(9): 1233-1246.
  • 7GOTTRON T. Combining content extraction heuristics: the combined system [C]// The 10th International Conference on Information Integration and Web-based Application & Services. New York: ACM Press, 2008: 591-594.
  • 8GOTTRON T. An evolutionary approach to automatically optimize web content extraction [C]// The Joint Venture of the 17th International Conference Intelligent Information System (IIS) and the 24th Iutemational Conference on Artificial Intelligence (AI). Krakow: The IEEE Computational Intelligence Society, 2009:331-341.
  • 9JAVIER A M, KOEN D, MARIE F M. Language independent content extraction from web pages [C]// The 9th Dutch-Belgian Information Retrieval Workshop. Netherland: University of Twente, 2009: 50-55.
  • 10TIM W, WILLIAM H H. Web content extraction through histogram clustering [C]// The 18th International Conference on Artificial Neural Networks in Engineering (ANNIE 2008). St. Louis: Lecture Notes in Computer Science, 2008: 124-132.

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部