摘要
Web数据抽取是当前的一个研究热点,目前还没有统一有效的抽取方法.在此提出一种研究思路,首先将Web页面的DOM树进行扩展,添加视觉特征和链接特征.然后计算多个相似页面的扩展DOM树中节点和子树的新颖度,接着由新颖度识别对象数据并且依据数据项角色抽取出数据,最后将对象数据保存为XML文档.通过实验分析,验证了这个方法具有较好的抽取效果.
Web data extraction is a hotspot of research nowadays, however, there is no uniform and effective extraction method up to now. This paper presents a research idea. At first, Web page DOM(document object model) tree was expanded and added with visual features and links features, then the nodes and sub trees' novelty degree of some similar pages' expanded DOM tree were calculated, and then the object data were identified in the light of sub trees' novelty and data were extracted according to the role of data, finally the object data were saved as XML documents. The experimental analysis validates that this method has better effect of data extraction.
出处
《应用科技》
CAS
2009年第8期52-55,共4页
Applied Science and Technology