Relevance-based content extraction of HTML documents

Relevance-based content extraction of HTML documents

下载PDF

导出

摘要 Content extraction of HTML pages is the basis of the web page clustering and information retrieval,so it is necessary to eliminate cluttered information and very important to extract content of pages accurately.A novel and accurate solution for extracting content of HTML pages was proposed.First of all,the HTML page is parsed into DOM object and the IDs of all leaf nodes are generated.Secondly,the score of each leaf node is calculated and the score is adjusted according to the relationship with neighbors.Finally,the information blocks are found according to the definition,and a universal classification algorithm is used to identify the content blocks.The experimental results show that the algorithm can extract content effectively and accurately,and the recall rate and precision are 96.5% and 93.8%,respectively. Content extraction of HTML pages is the basis of the web page clustering and information retrieval, so it is necessary to eliminate cluttered information and very important to extract content of pages accurately. A novel and accurate solution for extracting content of HTML pages was proposed. First of all, the HTML page is parsed into DOM object and the IDs of all leaf nodes are generated. Secondly, the score of each leaf node is calculated and the score is adjusted according to the relationship with neighbors. Finally, the information blocks are found according to the definition, and a universal classification algorithm is used to identify the content blocks. The experimental results show that the algorithm can extract content effectively and accurately, and the recall rate and precision are 96.5% and 93.8%, respectively.

作者吴麒陈兴蜀朱锴王春晖

机构地区 Network and Trusted Computing Institute

出处《Journal of Central South University》 SCIE EI CAS 2012年第7期1921-1926,共6页 中南大学学报（英文版）

基金 Project(2012BAH18B05) supported by the Supporting Program of Ministry of Science and Technology of China

关键词 HTML文件提取 HTML页面关联信息检索网页内容分类算法 DOM content extraction DOM node relevance information block

分类号 TP393.092 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献15

1OU J W, DONG X B, CAI B. Topic information extraction from template web pages [J]. Journal of Tsinghua University: Science and Technology, 2005, 45(S1): 1743-1747.
2SANDIP D, PRASENJIT M, C LEE G. Identifying content blocks from web documents [C]// 2005 International Symposium on Methodologies for Intelligent Systems (ISMIS 2005). New York: LNAL 2005: 285-293.
3MOHSEN A, MIR M P, AMIR M R. Main content extraction from detailed web pages [J]. International Journal of Computer Applications, 2010, 4(11): 18-21.
4YI L, LIU B, LI X L. Eliminating noisy information in web pages for data mining [C]// The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington: ACM Press, 2003: 296-305.
5SUHIT G, HILA B, GAIL K, SALVATORE S. Verifying genre-based clustering approach to content extraction [C]//The 15th International World Wide Web Conference. Budapest: ACM Press, 2006: 875-876.
6DEBNATH S, Automatic identification of informative sections of web pages [J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(9): 1233-1246.
7GOTTRON T. Combining content extraction heuristics: the combined system [C]// The 10th International Conference on Information Integration and Web-based Application & Services. New York: ACM Press, 2008: 591-594.
8GOTTRON T. An evolutionary approach to automatically optimize web content extraction [C]// The Joint Venture of the 17th International Conference Intelligent Information System (IIS) and the 24th Iutemational Conference on Artificial Intelligence (AI). Krakow: The IEEE Computational Intelligence Society, 2009:331-341.
9JAVIER A M, KOEN D, MARIE F M. Language independent content extraction from web pages [C]// The 9th Dutch-Belgian Information Retrieval Workshop. Netherland: University of Twente, 2009: 50-55.
10TIM W, WILLIAM H H. Web content extraction through histogram clustering [C]// The 18th International Conference on Artificial Neural Networks in Engineering (ANNIE 2008). St. Louis: Lecture Notes in Computer Science, 2008: 124-132.

1王庆一,王继成,周源远,袁春风.多信息块Web页面中的抽取规则[J].计算机工程,2003,29(9):42-44. 被引量：6
2Shuang Lin,Jie Chen,Zhendong Niu.Combining a Segmentation-Like Approach and a Density-Based Approach in Content Extraction[J].Tsinghua Science and Technology,2012,17(3):256-264.
3Shi Lin,Chen Chen.A UNIFIED EXTENDING METHOD FOR CONTENT-IGNORANT WEB PAGE CLUSTERING[J].Journal of Electronics(China),2010,27(1):105-112.
4车万翔,刘挺,李生.实体关系自动抽取[J].中文信息学报,2005,19(2):1-6. 被引量：115
5陈锦秀,姬东鸿.基于图的半监督关系抽取[J].软件学报,2008,19(11):2843-2852. 被引量：16
6周伟.Energy efficient clustering algorithm based on neighbors for wireless sensor networks[J].Journal of Shanghai University(English Edition),2011,15(2):150-153. 被引量：1
7DENG Ze-lin,TAN Guan-zheng,HE Pei,YE Ji-xiang.A decision hyper plane heuristic based artificial immune network classification algorithm[J].Journal of Central South University,2013,20(7):1852-1860. 被引量：4
8林如琦,陈锦秀,杨肖方,许红磊.多信息融合中文关系抽取技术研究[J].厦门大学学报（自然科学版）,2011,50(3):540-545. 被引量：2
9Yu Lintao.Neighbors And Friends[J].Beijing Review,2016,59(16):24-25.
10CHEN Yunfang,WANG Ruchuan.A Classification Algorithm Based on Artificial Immune[J].Chinese Journal of Electronics,2008,17(3):432-436. 被引量：3

Journal of Central South University

2012年第7期

浏览历史

内容加载中请稍等...

Relevance-based content extraction of HTML documents

参考文献15

相关作者

相关机构

相关主题

浏览历史