期刊文献+

一种基于扩展DOM树的Web数据自动抽取方法 被引量:1

Automatically extracting web data based on expanded DOM tree
下载PDF
导出
摘要 Web数据抽取是当前的一个研究热点,目前还没有统一有效的抽取方法.在此提出一种研究思路,首先将Web页面的DOM树进行扩展,添加视觉特征和链接特征.然后计算多个相似页面的扩展DOM树中节点和子树的新颖度,接着由新颖度识别对象数据并且依据数据项角色抽取出数据,最后将对象数据保存为XML文档.通过实验分析,验证了这个方法具有较好的抽取效果. Web data extraction is a hotspot of research nowadays, however, there is no uniform and effective extraction method up to now. This paper presents a research idea. At first, Web page DOM(document object model) tree was expanded and added with visual features and links features, then the nodes and sub trees' novelty degree of some similar pages' expanded DOM tree were calculated, and then the object data were identified in the light of sub trees' novelty and data were extracted according to the role of data, finally the object data were saved as XML documents. The experimental analysis validates that this method has better effect of data extraction.
作者 陈远斌
出处 《应用科技》 CAS 2009年第8期52-55,共4页 Applied Science and Technology
关键词 WEB数据抽取 扩展DOM树 新颖度 Web data extraction expended DOM tree novelty degree
  • 相关文献

参考文献8

  • 1SAHUGUET A, AZAVANT F. WysiWyg Web Wrapper Factory (W4F)[ C]// Proceedings of WWW Conference. Colorado, 1999 : 32 -45.
  • 2LIU L,PU C. An XML-enabled wrapper construction system for Web information sources [ C ]// Proceedings of the 16th International Conference on Data Engineering. San Diego, USA,2000 : 122-135.
  • 3CRESCENZI V,MECCA G. RoadRunner: towards automatic data extraction from large Web site[ C]//27th VLDB. Roma, Italy ,2001:222-235.
  • 4FINN A, KUSHMERICK A, SMYTH B. Fact or fiction:Content classification for digital libraries[ C ]//The 2nd DELOS Network of Excel-lence Workshop on Personalisation and Recommender Systems in Digital Libraries. Dublin, Ireland ,2001.
  • 5KAASINEN E,AALTONEN M, KOLARI J ,et al. Two approaches to bringing Internet services to WAP devices [ C ]// Proc of the 9th Intel World Wide Web Conf on Computer Networks. Amsterdam : North-Holland Publishing Co, 2000 : 231-246.
  • 6BUYUKKOKTEN O,GARCIA-MOLINA H, PAEPCKE A. Seeing the whole in part :Text summarization for Web browsing on handheld devices[C]//. Proc of the 10th Int Conf on World Wide Web. New York : ACM Press. 2001:652-662.
  • 7胡东东,孟小峰.一种基于树结构的Web数据自动抽取方法[J].计算机研究与发展,2004,41(10):1607-1613. 被引量:21
  • 8张树瑜,杜国宁,朱仲英.基于Web的半结构化信息抽取技术研究[J].系统工程与电子技术,2004,26(5):610-612. 被引量:6

二级参考文献18

  • 1苏海菊,王永成.中文科技文献文摘的自动编写[J].情报学报,1989,8(6):433-439. 被引量:26
  • 2Meng X F, Lu H J, Wang H Y, et al. SG-WRAP: A schemaguided wrapper generator demonstration. In: Proc of ICDE'2002. Los Alamitos, CA: IEEE Computer Society Press, 2002.331 ~332
  • 3Meng X F, Hu D D, Li C. Schema guided wrapper maintenance for Web-data extraction. In: Proc of ACM WIDM' 2003. New York: ACM Press, 2003. 1~8
  • 4Meng X F, Wang H Y, Hu D D, et al. Sg-wram: Schema guided wrapper maintenance. In: Proc of ICDE' 2003. Los Alamitos,CA: IEEE Computer Society Press, 2003. 750~752
  • 5Meng X F, Lu H J, Wang H Y, et al. Schema-guided data extraction from the Web. Journal of Computer Science and Technology, 2002, 17(4): 377~388
  • 6V Crescenzi, G Mecca, P Merialdo. ROADRUNNER: Towards automatic data extraction from large Web sites. In: Proc of VLDB'2001. San Francisco, CA: Morgan Kaufmann, 2001. 109~118
  • 7A Arasu, H Garcia-Molina. Extracting structured data from Web pages. In: Proc of ACM SIGMOD'03. New York: ACM Press,2003. 337~348
  • 8St(e)phane Grumbach, Giansalvatore Mecca. In search of the lost schema. In: Proc of ICDT'1999. Berlin: Springer, 1999. 314~331
  • 9Voert A. Automatic Extraction of Information Blocks Using PAT Trees ICI. Proc. of the National Computer Symposium, Taipei, Taiwan,1999(6) :223-226.
  • 10John D. The Anatomy of Large-Scale Hypertertextual Web Search Engine[C]. In: Proc ofthe7th Int'l world wide Web Conf. Brisbane. Austrilian, 1999.

共引文献25

同被引文献10

引证文献1

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部