摘要
提出一种改进的树匹配算法,通过考量HTML特性,对树编辑距离方法进行改进,根据不同HTML树结点在浏览器中所显示的相关数据的不同权重赋以不同的权重值。算法由HTML数据对象构造具有结点权重的HTML树,模式识别通过取得两棵构造树的最大映射值达成。通过基于商用网站的实验对算法有效性进行了证实。
An enhanced tree matching algorithm is proposed, which improves the tree edit distance method by considering HTML features, assigns different values to HTML tree nodes according to their weights for displaying the relevant data in browser. The algorithm constructs the node-weighted HTML tree from HTML data objects and the pattern recognition is done by obtaining the maximum mapping value of two constructed trees. The effectiveness of the algorithm has been verified by the experiments based on commercial websitcs.
出处
《计算机时代》
2010年第3期49-51,共3页
Computer Era
关键词
信息抽取
DOM
树编辑距离
模式识别
information extraction
DOM
tree edit distance
pattern recognition