基于有限状态自动机提取不规范表结构Web信息

Unregulated table structure Web information extraction based on finite state automata

下载PDF

导出

摘要大量的不规范表结构信息是当前Web信息提取所必须解决的问题.在现有方法基础上,给出了归纳学习相邻属性间上下文规则集算法,提出了以Web页为粒度的属性转换机和有限状态自动机包装器概念,最后介绍了采用有限状态自动机包装器提取不规范表结构Web信息的算法. lots of unregulated table structure information currently come to be the unavoidable issue of Web information extraction. Based on the existing method, a further research about inducing contextual rules of adjoining attributes has been done; and a new conception of the finite state automata wrapper and attributes transducer, whose granularity is Web pages, is presented. Finally, the algorithms for the unregulated table structure Web information extraction by finite state automata wrapper, are introduced.

作者李石君欧伟杰简伟黄河

机构地区武汉大学计算机学院

出处《武汉大学学报（工学版）》 CAS CSCD 北大核心 2005年第6期128-132,共5页 Engineering Journal of Wuhan University

基金国家自然科学基金项目资助(No.60273072) 国家高技术研究发展计划(863)项目(No.2002AA423450)资助

关键词信息提取上下文规则集有限状态自动机自动机包装器 information extraction contextual rules finite state automata（FSA）

分类号 TP393.09 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献6

1Hammer J,Garcia-Molina H,Nestorov S,Yerneni R,Breunig M M,Vassalos V.Template-based wrappers in the TSIMMIS system[J].In SIGMOD Conference,1997.532-535.
2Baumgartner R,Flesca S,Gottlob G.Visual web information extraction with lixto[J].In Proc.Vldb'01,2001.682-689.
3李效东,顾毓清.基于DOM的Web信息提取[J].计算机学报,2002,25(5):526-533. 被引量：101
4Nicholas K,Weld D,Doorenbos R.Wrapper induction for information extraction[J].In Proc.IJCAI,1997.753-761.
5Hsu C,Dung M.Generating finite-state transducers for semi-structured data extraction from the web[J].Journal of Information Systems,1998,23(8):521-538.
6胡东东,孟小峰.一种基于树结构的Web数据自动抽取方法[J].计算机研究与发展,2004,41(10):1607-1613. 被引量：21

二级参考文献24

1Meng X F, Lu H J, Wang H Y, et al. SG-WRAP: A schemaguided wrapper generator demonstration. In: Proc of ICDE'2002. Los Alamitos, CA: IEEE Computer Society Press, 2002.331 ～332
2Meng X F, Hu D D, Li C. Schema guided wrapper maintenance for Web-data extraction. In: Proc of ACM WIDM' 2003. New York: ACM Press, 2003. 1～8
3Meng X F, Wang H Y, Hu D D, et al. Sg-wram: Schema guided wrapper maintenance. In: Proc of ICDE' 2003. Los Alamitos,CA: IEEE Computer Society Press, 2003. 750～752
4Meng X F, Lu H J, Wang H Y, et al. Schema-guided data extraction from the Web. Journal of Computer Science and Technology, 2002, 17(4): 377～388
5V Crescenzi, G Mecca, P Merialdo. ROADRUNNER: Towards automatic data extraction from large Web sites. In: Proc of VLDB'2001. San Francisco, CA: Morgan Kaufmann, 2001. 109～118
6A Arasu, H Garcia-Molina. Extracting structured data from Web pages. In: Proc of ACM SIGMOD'03. New York: ACM Press,2003. 337～348
7St(e)phane Grumbach, Giansalvatore Mecca. In search of the lost schema. In: Proc of ICDT'1999. Berlin: Springer, 1999. 314～331
8Florescu D, Levy A Y, Mendelzon A. Database techniques for the World-Wide Web: A Survery. In: ACM The SIGMOD Record, 1998.59-74
9Atzeni P, Mecca G, Merialdo P. To weave the Web. In: Proc the 23rd International Conference on Very Large Data Bases. Athens, Greece, 1997. 206-215
10Pemberton S et al. XHTML 1.0: The extensible hyperText markup language. In: http://www.w3.org/MarkUp/

共引文献119

1王丽,唐建雄.基于DOM和网页模板的Web信息抽取[J].电脑知识与技术（过刊）,2007(18):1617-1619. 被引量：1
2杨桢,赵燕平,朱东华.基于正则表达式的信息抽取系统在国防技术监测中的应用[J].北京理工大学学报,2006,26(z1):74-78. 被引量：9
3欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报（自然科学版）,2005,45(S1):1743-1747. 被引量：70
4孙皓,董守斌.基于标签密度的自适应正文提取方法[J].郑州大学学报（理学版）,2009,41(1):44-47. 被引量：3
5王茹,宋瀚涛,陆玉昌.网页数据自动抽取系统[J].计算机工程与应用,2004,40(19):135-138. 被引量：8
6王茹,宋瀚涛,陆玉昌.基于树自动机的网页数据抽取[J].北京理工大学学报,2004,24(9):790-793. 被引量：6
7孟宪福,狄慧.基于Agent和XML的Web页面信息抽取研究与设计[J].计算机工程与设计,2004,25(8):1411-1414. 被引量：6
8李向阳,张亚非.一种网上图书信息抽取方法[J].情报学报,2004,23(6):655-660. 被引量：6
9张清军,朱才连.基于主动学习的Web页面信息抽取[J].情报学报,2004,23(6):667-671. 被引量：5
10LIXiang-yang,ZHANGYa-fei,LUJian-jiang,XUBao-wen.A Classification Method for Web Information Extraction[J].Wuhan University Journal of Natural Sciences,2004,9(5):823-827. 被引量：2

1孟尧.关系模型到ER模型的转换研究[J].网络新媒体技术,2013,2(6):56-59. 被引量：3
2张冬波,李中奇,王健.SVG文件及技术属性转换与添加的程序设计[J].工矿自动化,2006,32(1):16-19. 被引量：1
3陈华.库存管理系统中数据库的设计与实现[J].电脑开发与应用,2009,22(1):61-62. 被引量：12
4李星毅,高文浩,施化吉.基于本体的异构数据集成方法[J].计算机工程与设计,2009,30(8):1931-1933. 被引量：14
5徐苏娅,胡彩平,王立松.WSNS中基于Fusion-Bayes的离群点检测[J].电子科技,2013,26(5):102-105.
6蒋智谋,姚唐龙.基于子模性质的基因表达谱特征基因提取[J].电脑知识与技术（过刊）,2015,21(6X):194-196.
7黄莉,玉素甫.艾白都拉.用ORACLE分析函数实现行列转换[J].计算机与信息技术,2009(5):98-100.
8舍月,晓雨.一样的256色,不一样的效果——妙用桌面属性转换256色美图[J].电脑爱好者,2003(20):70-70.
9王群明.基于软硬属性转换的遥感图像亚像元定位算法[J].测绘学报,2016,45(4):503-503. 被引量：4
10杜鹢,李德毅.基于云的概念划分及其在关联采掘上的应用[J].软件学报,2001,12(2):196-203. 被引量：68

武汉大学学报（工学版）

2005年第6期

浏览历史

内容加载中请稍等...

基于有限状态自动机提取不规范表结构Web信息

参考文献6

二级参考文献24

共引文献119

相关作者

相关机构

相关主题

浏览历史