期刊文献+

基于树结构的包装器全自动生成方法的研究 被引量:1

Research of a Tree Structure Based Fully Automatic Wrapper Method
下载PDF
导出
摘要 论文研究并实现了一种包装器全自动生成算法,使用两个页面的树形结构,从对比两棵树之间的相同与差异发现模式,从树结构中结点的不匹配之处推导出包装器.在实际HTML页面上的实验已经证明,这种方法能够更好的发现可选结构和迭代结构. This paper investigates the wrapper generation problem under a new perspective. Our system works with two trees at a time, pattern discovery is based on the study of similarities and dissimilarities between the trees, mismatches are used to indentify the wrappers. The intensive experiments on real Web sites show that the approach with tree automata compared favorable against some other approaches in finding of the structured data with optional and iterator.
出处 《河北工业大学学报》 CAS 2007年第6期41-46,共6页 Journal of Hebei University of Technology
关键词 WEB数据抽取 包装器 树结构 匹配算法 自动 web data extraction wrapper tree structure match algorithm automatic
  • 相关文献

参考文献7

  • 1Alberto H F, Laender Berthier A. Ribeiro-Neto A Brief Survey of Web Data Extraction Tools [J]. ACM SIGMOD Record, 2002, 31 (2) : 84-93.
  • 2Grumbach S, Mecca G. In search of the lost schema [A]. In Seventh International Conference on Data Base Theory, (ICDT' 99) [C]. Jerusalem (Israel), Lecture Notes in Computer Science, Springer-Verlag, 1999, 314-331.
  • 3Angluin D. Inference of reversible languages [J]. Journal of the Association for Computing Machinery, 1982, 29 (3) : 741-765.
  • 4Radhakrishnan V, Nagaraja G. Inference of regular grammars via skeletons [J]. IEEE Transactions on Systems, 1987, 3 (6) : 982-992.
  • 5Crescenzi V. On Automatic Information Extraction from Large Web Sites [D]. PhD thesis, Dipartimento di Informatica e Sistemistica, Universit a di Roma La Sapienza, Rome (Italy): 2002, 731-779.
  • 6Femau H. Identification of function distinguishable languages [J]. Theoretical Computer Science, 2003, 1 679-1 711.
  • 7张志刚,陈静,李晓明.一种HTML网页净化方法[J].情报学报,2004,23(4):387-393. 被引量:57

二级参考文献11

  • 1Shian-Hua Lin, Jan-Ming Ho. Discovering informative content blocks from Web documents. In: SIGKDD, 2002
  • 2Soumen Chakrabarti, Mukul M. Joshi and Vivek B. Tawde.Enhanced topic distillation using text, markup tags, and hyperlinks. In: SIGIR, 2001
  • 3S. Chakrabarti, M. Joshi, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In :WWW, Hawaii. ACM, 2002
  • 4Yiming Yang. Noise reduction in a statistical approach to text categorization. In: Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval, 1995
  • 5Li Xiaoli and Shi Zhongzhi. Innovating Web page classification through reducing noise. Journal of Computer Science & Technology, 2002 ,17(1): 9 ~ 17
  • 6http://162. 105.80.84/cgi-bin/getdirectory? ccode = 0
  • 7http://e. pku. edu. cn
  • 8Yang Y. Expert network:effective and efficient learning from human decisions in text categorization and retrieval. In: Proceedings of the Seventeenth International ACM SIGIR Conference on Research and Development in Information Retrieval,1994. 13 ~ 22
  • 9Lewis D. D., et al. Training algorithms for linear text classitiers. In: Proceedings of the Nineteenth International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996. 298 ~ 306
  • 10Michael W. Berry, Murray Browne. Understand Search Engines (Mathematical Modeling and Text Retrieval). SLAM,1999

共引文献56

同被引文献4

引证文献1

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部