期刊文献+

Web信息抽取中基于结点权重的树编辑距离匹配法研究 被引量:2

Research on Node-Weighted Tree Edit Distance Matching in Web Information Extraction
下载PDF
导出
摘要 提出一种改进的树匹配算法,通过考量HTML特性,对树编辑距离方法进行改进,根据不同HTML树结点在浏览器中所显示的相关数据的不同权重赋以不同的权重值。算法由HTML数据对象构造具有结点权重的HTML树,模式识别通过取得两棵构造树的最大映射值达成。通过基于商用网站的实验对算法有效性进行了证实。 An enhanced tree matching algorithm is proposed, which improves the tree edit distance method by considering HTML features, assigns different values to HTML tree nodes according to their weights for displaying the relevant data in browser. The algorithm constructs the node-weighted HTML tree from HTML data objects and the pattern recognition is done by obtaining the maximum mapping value of two constructed trees. The effectiveness of the algorithm has been verified by the experiments based on commercial websitcs.
出处 《计算机时代》 2010年第3期49-51,共3页 Computer Era
关键词 信息抽取 DOM 树编辑距离 模式识别 information extraction DOM tree edit distance pattern recognition
  • 相关文献

参考文献8

  • 1A. Hemnani and S. Bressan, Information Extraction-Tree Alignment Approach to Pattern Discovery in Web Documents, EEXA 2002, Lecture Notes in Computer Science, 2002.2453:789-798.
  • 2D. Buttler, L. Liu, and C. Pu, A Fully Automated Object Extraction System for the World Wide Web,ICDCS 01,2001:361-370.
  • 3D. Reis, P. Golgher, A. Silva, and A. Laender, AutomaticWeb News Extraction Using Tree Edit Distance, World Wide Web-04,2004:502-511.
  • 4D. Embley, Y. Jiang, and Y. Ng, Record-Boundary Discovery in Web Documents, SIGMOD,1999:467-478.
  • 5C. Chia-Hui and K. Shih-Chien, OLERA: Semisupervised- Web-Data Extraction with Visual Support, IEEE Intelligent Systems,2004.19(6):56-64.
  • 6高强,张敬之,耿桦,潘金贵.基于重复模式的Web信息抽取[J].计算机科学,2007,34(4):210-212. 被引量:6
  • 7K. Tai, The Tree-to-Tree Correction Problem, Journal of ACM, 1979.26(3):422-433.
  • 8W. Yang,Identifying Syntactic Differences Between Two Programs, Software-Practice and Experience,1991.21(7):739-755.

二级参考文献10

  • 1.RSS 2.0站点[EB/OL].http://blogs.law.harvard.edu/tech/rss,.
  • 2Laender A,Ribeiro-Neto B,Silva A,et al.A brief survey of Web data extraction tools.SIGMOD Record,2002,31(2)
  • 3Arasu A,Garcia-Molina H.Extracting Structured Data from Web Pages.SIGMOD-03,2003
  • 4Chang C H,Lui S L.IEPAD:Information extraction based on pattern discovery.WWW-10,2001
  • 5Embley D W,Jiang Y,Ng Y K.Record-Boundary Discovery in Web Documents.In:Proc.SIGMOD'99,1999
  • 6McCreight E.A space-economical suffix tree construction algorithm.Journal of the ACM,1976,23:262~272
  • 7Ukkonen E.On-line construction of suffix trees.Algorithmica,1995,I4:249~60
  • 8Muslea I,Minton S,Knoblock C.A Hierarchical Approach to Wrapper Induction.In:Proceedings of the 3rd International Conference on Autonomous Agents,1999
  • 9Kushmerick N,Weld D,Doorenbos B.Wrapper induction for information extraction.In:Proc.Int Joint Conf.Artificial Intelligence,1997
  • 10Soderland S.Learning Information Extraction Rules for Semistructured and Free Text.Machine Learning,1999

共引文献5

同被引文献18

  • 1张锐.Wordnet综述[J].辽宁教育行政学院学报,2003,20(9):5-7. 被引量:3
  • 2乔少杰 唐常杰 陈瑜等.基于树编辑距离的层次聚类算法.计算机科学与探索,2007,1(3):282-292.
  • 3CRESCENZI V, MECCA G, MERIALDO P. RoadRunner: Towards automatic data extraction from large Web sites[ C]// Proceedings of the 27th Very Large Data Base Endowment Conference. San Fran- cisco: Morgan Kaufmann Publishers Inc, 2001 : 109 - 118.
  • 4CHANG CHIA-HUI, LUI SHAO-CHEN. IEPAD: information ex- traction based on pattern discovery[ C]// Proceedings of the 10th International Conference on World Wide Web. New York: ACM, 2001:681 -688.
  • 5LIU BING, GROSSMAN R L, ZHAI YANHONG. Mining data re- cords in Web pages[ C]//Proceedings of the 9th ACM SIGKDD In- ternational Conference on Knowledge Discovery and Data Mining. New York: ACM, 2003:601 -606.
  • 6RAID H X.窜和序列处理2--字符串编辑距离算法[EB/OL].[2011-11-20].http://hxraid.iteye.com/blog/615469.
  • 7张勇,门涛.基于WORDNET的领域本体半自动构建研究[J].渤海大学学报(自然科学版),2007,28(4):381-384. 被引量:2
  • 8胡仁龙,袁春风,武港山,濮小佳.基于重复模式的自动Web信息抽取[J].计算机工程,2008,34(22):73-76. 被引量:8
  • 9姜波,丁岳伟.基于约束树编辑距离与导航树的信息采集[J].计算机工程,2009,35(14):75-77. 被引量:9
  • 10顾韵华,田伟.基于DOM模型扩展的Web信息提取[J].计算机科学,2009,36(11):235-237. 被引量:21

引证文献2

二级引证文献16

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部