基于树结构的包装器全自动生成方法的研究被引量：1

Research of a Tree Structure Based Fully Automatic Wrapper Method

下载PDF

导出

摘要论文研究并实现了一种包装器全自动生成算法,使用两个页面的树形结构,从对比两棵树之间的相同与差异发现模式,从树结构中结点的不匹配之处推导出包装器.在实际HTML页面上的实验已经证明,这种方法能够更好的发现可选结构和迭代结构. This paper investigates the wrapper generation problem under a new perspective. Our system works with two trees at a time, pattern discovery is based on the study of similarities and dissimilarities between the trees, mismatches are used to indentify the wrappers. The intensive experiments on real Web sites show that the approach with tree automata compared favorable against some other approaches in finding of the structured data with optional and iterator.

作者李亚桥王晓东李智

机构地区中国民航大学交通工程学院河北工业大学继续教育学院河北工业大学计算机科学与软件学院

出处《河北工业大学学报》 CAS 2007年第6期41-46,共6页 Journal of Hebei University of Technology

关键词 WEB数据抽取包装器树结构匹配算法自动 web data extraction wrapper tree structure match algorithm automatic

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献7

1Alberto H F, Laender Berthier A. Ribeiro-Neto A Brief Survey of Web Data Extraction Tools [J]. ACM SIGMOD Record, 2002, 31 (2) : 84-93.
2Grumbach S, Mecca G. In search of the lost schema [A]. In Seventh International Conference on Data Base Theory, (ICDT' 99) [C]. Jerusalem (Israel), Lecture Notes in Computer Science, Springer-Verlag, 1999, 314-331.
3Angluin D. Inference of reversible languages [J]. Journal of the Association for Computing Machinery, 1982, 29 (3) : 741-765.
4Radhakrishnan V, Nagaraja G. Inference of regular grammars via skeletons [J]. IEEE Transactions on Systems, 1987, 3 (6) : 982-992.
5Crescenzi V. On Automatic Information Extraction from Large Web Sites [D]. PhD thesis, Dipartimento di Informatica e Sistemistica, Universit a di Roma La Sapienza, Rome (Italy): 2002, 731-779.
6Femau H. Identification of function distinguishable languages [J]. Theoretical Computer Science, 2003, 1 679-1 711.
7张志刚,陈静,李晓明.一种HTML网页净化方法[J].情报学报,2004,23(4):387-393. 被引量：57

二级参考文献11

1Shian-Hua Lin, Jan-Ming Ho. Discovering informative content blocks from Web documents. In: SIGKDD, 2002
2Soumen Chakrabarti, Mukul M. Joshi and Vivek B. Tawde.Enhanced topic distillation using text, markup tags, and hyperlinks. In: SIGIR, 2001
3S. Chakrabarti, M. Joshi, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In :WWW, Hawaii. ACM, 2002
4Yiming Yang. Noise reduction in a statistical approach to text categorization. In: Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval, 1995
5Li Xiaoli and Shi Zhongzhi. Innovating Web page classification through reducing noise. Journal of Computer Science & Technology, 2002 ,17(1): 9 ～ 17
6http://162. 105.80.84/cgi-bin/getdirectory? ccode = 0
7http://e. pku. edu. cn
8Yang Y. Expert network:effective and efficient learning from human decisions in text categorization and retrieval. In: Proceedings of the Seventeenth International ACM SIGIR Conference on Research and Development in Information Retrieval,1994. 13 ～ 22
9Lewis D. D., et al. Training algorithms for linear text classitiers. In: Proceedings of the Nineteenth International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996. 298 ～ 306
10Michael W. Berry, Murray Browne. Understand Search Engines (Mathematical Modeling and Text Retrieval). SLAM,1999

共引文献56

1欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报（自然科学版）,2005,45(S1):1743-1747. 被引量：70
2孙皓,董守斌.基于标签密度的自适应正文提取方法[J].郑州大学学报（理学版）,2009,41(1):44-47. 被引量：3
3陈雪,徐慧,沈家峻.基于网页结构的网页去噪算法设计[J].软件,2013,34(8):95-97. 被引量：1
4胡冬梅.泰达图书馆个性化信息服务系统的探索与实践[J].现代图书情报技术,2004(10):92-95. 被引量：8
5孟涛,闫宏飞,王继民.Web网页信息变化的时间局部性规律及其验证[J].情报学报,2005,24(4):398-406. 被引量：8
6翟东升,余旸.国际贸易技术壁垒信息采集系统设计与实现[J].情报杂志,2005,24(8):33-35. 被引量：3
7张健,欧红.应用正则式抽取Google网页内容[J].现代图书情报技术,2005(9):50-53. 被引量：6
8翟东升,余旸,李莉.网络信息抽取技术及其在TBT预警中的应用[J].现代图书情报技术,2005(9):76-79. 被引量：1
9贡正仙,朱巧明,李培峰.基于相似页面的Web信息抽取系统的实现[J].计算机应用,2006,26(8):1983-1986. 被引量：3
10王芳,于浩,谭红叶,赵铁军.基于链接分块的相关链接提取方法[J].计算机工程与应用,2006,42(31):110-113. 被引量：2

同被引文献4

1余洪山,王耀南.一种基于树结构的立体图像特征提取算法[J].计算机应用,2004,24(10):78-81. 被引量：3
2杨少军,李杭生.μC/OS-II任务栈处理的改进设计[J].单片机与嵌入式系统应用,2004(5):73-74. 被引量：2
3LABROSEJJ.嵌入式实时操作系统μC/OS-II[M].邵贝贝,译.2版.北京:北京航空航天大学出版社,2003:126.
4LEWISDW.嵌入式软件基础-C语言与汇编的融合[M].陈宗斌,译.北京:高等教育出版社,2005:96-100.

引证文献1

1张光建,刘政.基于树结构的μC/OS-Ⅱ任务栈空间计算方法及应用[J].计算机应用,2009,29(4):1165-1167. 被引量：1

二级引证文献1

1李岩,贾小梨.基于FPGA的栈空间管理器的研究和设计[J].电子技术应用,2010,36(7):62-65.

1李广建,乔建忠.全自动生成网页信息抽取包装器的主要技术方法研究[J].情报理论与实践,2010,33(1):100-104. 被引量：4
2龚安,刘华山.基于编辑距离的XML文档结构聚类的改进算法[J].微计算机应用,2008,29(2):88-91. 被引量：2
3歆笙.两棵树的爱情[J].电脑知识与技术（网络文化）,2004(11M):83-83.
4古惑狼.神奇!英文网站全自动生成[J].网友世界,2009(9):34-35.
5吕常义.ORACLE数据库交互式全自动生成系统研制[J].中国海上油气（地质）,1996,10(6):423-424.
6周斌.基于Excel实现JJG443-2015记录证书全自动生成[J].中国科技博览,2016,0(1):303-304.
7杨杨,张雪锋,张雁冰.基于耦合映像格子的分组图像加密算法[J].西安邮电学院学报,2012,17(1):30-33. 被引量：3
8毛华,窦林立,杨蕾.树同构的判定方法[J].计算机应用与软件,2009,26(11):107-108. 被引量：5
9李梦东,杜飞.SHA-3第三轮候选算法简评[J].北京电子科技学院学报,2012,20(2):39-42.
10梅雪,程学旗,郭岩,张刚,丁国栋.一种全自动生成网页信息抽取Wrapper的方法[J].中文信息学报,2008,22(1):22-29. 被引量：21

河北工业大学学报

2007年第6期

浏览历史

内容加载中请稍等...

基于树结构的包装器全自动生成方法的研究被引量：1

参考文献7

二级参考文献11

共引文献56

同被引文献4

引证文献1

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

基于树结构的包装器全自动生成方法的研究 被引量：1

参考文献7

二级参考文献11

共引文献56

同被引文献4

引证文献1

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

基于树结构的包装器全自动生成方法的研究被引量：1