期刊文献+

基于树形结构的Web信息抽取

Web Information Extraction Based on Tree Structure
下载PDF
导出
摘要 提出了一种基于树形结构的Web结构化数据抽取算法.该算法基于HTML的树形层次结构,包括HTML树构造算法,数据区域挖掘算法,数据记录挖掘算法以及数据记录模式生成算法.算法引入了页面元素布局位置等信息用于清洗页面,采用层次划分思想实现页面数据区域的挖掘,并通过树匹配生成记录模式,实现最终数据项抽取.实验表明,该方法可以有效地实现Web结构化数据抽取. It proposes tree structure based Web data extraction algorithm in view of the inadequacies of the existing methods. The tree structure based algorithm includes: the algorithm of HTML tree construction, the algorithm of data region mining, the algorithm of data record mining, and the algorithm of record schema generation. The algorithm cleans the Web pages using the position information of page elements, mines data region by hierarchical Clustering, and generates record schema finishing data item extraction through tree matching. Experimental results show that our algorithm can-improve the accuracy and efficiency of Web data extraction.
出处 《福建师范大学学报(自然科学版)》 CAS CSCD 北大核心 2009年第3期39-46,共8页 Journal of Fujian Normal University:Natural Science Edition
基金 国家自然科学基金资助项目(50474033) 福建省自然科学基金资助项目(A0310008) 福建省重点科技项目(2003H043)
关键词 WEB数据抽取 WEB挖掘 信息抽取 Web data extraction Web mining information extraction
  • 相关文献

参考文献8

  • 1任仲晟.一种新的HTML页面清洗压缩算法[J].福建电脑,2009,25(1):60-61. 被引量:1
  • 2Sandip Debnath, Prasenjit Mitra, Nirmal Pal, et al. Automatic identification of informative sections of web pages[J]. IEEE transactions on knowledge and data engineering, 2005, 17 (9): 1233-1246.
  • 3Jiying Wang, Fred H Lochovsky. Data-rich section extraction from HTML pages[C]//Proceedings of the 3rd International Conference on Web Information Systems Engineering (WISE'02), 2002: 313-322.
  • 4Lan Yi, Bing Liu, Xiaoli Li. Eliminating noisy information in web pages for data mining[C]//Proc Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, 2003: 296-305.
  • 5Lan Yi, Bing Liu. Web Page Cleaning for web mining through feature weighting [C]//Proeeedings of Eighteenth International Joint Conference on Artifieial Intelligenee. Aeapulco, Mexico, 2003 : 9- 15.
  • 6Ji He, Ah-Hwee Tan, Chew-Lim Tan, et al. On quantitative evaluation of clustering systems [J]. Information Retriveal And Clustering, 2002:105-134.
  • 7Wuu Yang. Identifying syntactic differences between two programs [J]. Software-practice and Experience, 1991,21 (7): 739-755.
  • 8Raghavan V V,Wang G S,Bollmann P. A critical investigation of recall and precision as measures of retrieval system performance [J]. ACM Trans Information Systems, 1989 (3): 205-229.

二级参考文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部