摘要
论文研究并实现了一种包装器全自动生成算法,使用两个页面的树形结构,从对比两棵树之间的相同与差异发现模式,从树结构中结点的不匹配之处推导出包装器.在实际HTML页面上的实验已经证明,这种方法能够更好的发现可选结构和迭代结构.
This paper investigates the wrapper generation problem under a new perspective. Our system works with two trees at a time, pattern discovery is based on the study of similarities and dissimilarities between the trees, mismatches are used to indentify the wrappers. The intensive experiments on real Web sites show that the approach with tree automata compared favorable against some other approaches in finding of the structured data with optional and iterator.
出处
《河北工业大学学报》
CAS
2007年第6期41-46,共6页
Journal of Hebei University of Technology
关键词
WEB数据抽取
包装器
树结构
匹配算法
自动
web data extraction
wrapper
tree structure
match algorithm
automatic