网络信息抽取技术分析与比较被引量：3

Analysis and Comparison of Web Information Extraction Technologies

下载PDF

导出

摘要随着互联网爆炸式的发展和普及,网络信息已经成为了一种宝贵的信息数据资源。海量的网络数据使得数据分析与挖掘系统进入了一个新时代,越来越多的网络应用系统需要对来自不同数据源的结构化数据进行抽取、挖掘和整合。然而,由于网页文档的半结构化性质,网页上呈现的数据往往不能被机器自动地抽取和理解,因此,网络信息抽取的研究目标在于提取网页的结构化数据。互联网数据的海量规模与高度异构,为网络信息抽取带来了巨大的挑战。分析和总结了近年来网络信息抽取相关的研究与工作,剖析了各个工作的优势和局限,并进一步作了综合的分类与比较。 The World Wide Web has become an important resource of information due to its explosive growth and spread in the past two decades. The tremendous amount of web data has opened a new era for data analysis and mining systems. More and more web applications need to extract, mine, and integrate data from enormous data sources. However, due to the semi - structure characteristic of web pages, web data exhibited on web pages is not directly consumable by machines. Web information extraction aims at extracting structured data from web pages, which is a very challenging problem clue to the large - scale and highly - heterogeneous characteristic of web data. This paper introduces the state - of - the - art web information extraction studies, analyzes the advantages and limitations of each method, and conducts categorization and comparison of existing approaches.

作者宋鑫莹赵铁军

机构地区哈尔滨工业大学计算机科学与技术学院

出处《智能计算机与应用》 2013年第5期24-27,30,共5页 Intelligent Computer and Applications

基金国家高技术研究发展计划(863)(2011AA01A207) 国家自然科学基金(61073130)

关键词网络信息抽取包装器模板 Web Information Extraction Wrapper Template

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献29

1CHANG C H,KAYED M,GIRGIS M R. A survey of web information extraction systems[J].IEEE Trans on Knowl and Data Eng,2006.1411-1428.
2LIU B. Web data mining:exploring hyperlinks,contents,and usage data[M].{H}Springer-Verlag,2007.
3HAMMER J,MCHUGH J. GARCIA-MOLINA,Semistructured data:the TSIMMIS experience[A].1997.1-8.
4CRESCENZI V,MECCA G. Grammars have exceptions[J].{H}Information Systems,1998,(08).
5KUSHMERICK N,WELD D S,DOORENBOS R B. Wrapper induction for information extraction[A].1997.729-737.
6MUSLEA I,MINTON S,KNOBLOCK C. A hierarchical approach to wrapper induction[A].1999.
7CHUANG S L,HSU J Y J. Tree-structured template generation for web pages[A].2004.
8ZHENG S,SONG R,WEN J R. Efficient record-level wrapper induction[A].2009.
9LERMAN S M K,KNOBLOCK C. Wrapper maintenance:a machine learning approach[J].{H}JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH,2003.149-181.
10ZHU J,NIE Z,WEN J R. Simultaneous record detection and attribute labeling in web data extraction[A].2006.

同被引文献36

1Chang C H,Kayed M,Girgis M R,et al.A survey of Web information extraction systems[J].IEEE Trans on Knowledge Data Engineering,2006,18(10):1411-1428.
2Liu Bing,Grossman R,Zhai Yanhong.Mining data records in Web pages[C]//Proc of Knowledge Discovery and Data Mining.2003:601-606.
3Zhao Hongkun,Meng Weiyi,Wu Zonghuan,et al.Fully automatic wrapper generation for search engines[C]//Proc of the 14th International Conference on World Wide Web.2005.
4Zhai Yanhong,Liu Bing.Web data extraction based on partial tree alignment[C]//Proc of the 14th International Conference on World Wide Web.2005:76-75.
5Zhiwei F., 2002, Evolution and Present Situation of Corpus Research In China, Journal of Chinese Lan- guage and Computing, 12(1) .43-62.
6李素芳.《“知之于困学,好之于交流,乐之于应用”—专访梁茂成教授,李文中教授和许家金博士》,《中国英语教育》2010年第1期.
7Zhan Weidong, Chang Baobao, Duan Huiming, Zhang Huarui. 2006, "Recent Developments in Chinese Corpus Re- search", The 13'h NIJL International Symposium, Language Corpora. Their Compliation and Application. Tokyo, Ja- pan. 3.6-7. http .//ccl. pku. edu. cn/doubtfire/papers/2006_Corpora_NIJL Workshop. pdf, 2014 年7 月 11日.
8刘成飞.《汉语中介语语料库中汉字偏误处理的比较研究》,http.//www.doe88.com/p-0116174114179.html,2015年06月11日.
9中国大百科全书出版社编辑部.《中国大百科全书·语言文字》,北京.中国大百科全书出版社,1988年,第336页.
10Cobb, T. 2003, Analyzing late interlanguage with learner corpora . Quebec replications of three European studies, The Canadian Modem Language Review/La Revue canadienne des langues vivantes, 59 (3) .393-423.

引证文献3

1常丽君,钱钢.面向不规则列表的网页数据抽取技术的研究[J].计算机应用研究,2015,32(9):2651-2654. 被引量：1
2任松,文鸿,石成锋.无标度异构网络蠕虫传播仿真研究[J].邵阳学院学报（自然科学版）,2016,13(1):36-40.
3郑通涛,曾小燕.大数据时代的汉语中介语语料库建设[J].厦门大学学报（哲学社会科学版）,2016,66(2):53-63. 被引量：15

二级引证文献16

1周晗,吴定敏,刘轩.韩汉双语新闻语料库建设研究[J].译苑新谭,2020,1(1):135-139.
2郑通涛,曾小燕.大数据时代的汉语国别化教材研发——兼论教材实时修订功能[J].海外华文教育,2016(3):291-302. 被引量：12
3蔡武,郑通涛.我国汉语中介语语料库研究现状与热点透视——基于CiteSpace的可视化分析[J].华文教学与研究,2017(3):79-87. 被引量：9
4郑通涛.复杂动态系统理论与语言交际能力发展[J].海外华文教育,2017(10):1301-1310. 被引量：4
5徐中云.中国学习者韩语中介语语料库建设方案[J].昆明学院学报,2018,40(1):127-132. 被引量：3
6呼媛玲,寇媛媛.基于音素的英文发音自动评测系统设计[J].自动化与仪器仪表,2018,0(11):160-163.
7蒋琴琴.近十年国内汉语中介语语料库建设研究概述[J].海外英语,2019(6):82-83. 被引量：1
8王红羽,周良永.大数据背景下留学生汉语自主性习得模式研究[J].四川文理学院学报,2019,29(5):122-127. 被引量：2
9郑通涛,郭旭.“一带一路”倡议下国际汉语人才培养模式研究[J].厦门大学学报（哲学社会科学版）,2020,0(1):69-81. 被引量：11
10柳华.网络驱动语料库及其特征初探[J].西部皮革,2020,42(9):39-40.

1殷悦,郑钧文.大数据时代下对数据的新认知[J].电子技术与软件工程,2017(4):180-180.
2丁伟雄,周灵,杨文茵.IPv4向IPv6过渡技术研究[J].福建电脑,2008,24(6):56-57. 被引量：2
3方向,王丽娜,贾颖.智能化入侵检测算法研究综述[J].通信技术,2015,48(12):1321-1328. 被引量：4
4张铭来,金成飚,赵文耘.分布式入侵检测系统的数据采集技术[J].计算机工程,2002,28(2):165-167. 被引量：3
5Roger Burns.定制振荡器的比较[J].今日电子,2006(4):46-46.
6施洋,张奇,黄萱菁.含有语义特征的网页新闻自动抽取[J].计算机工程,2010,36(7):173-175. 被引量：5
7史西兵,王浩鸣.隐马尔可夫模型解决信息抽取问题的仿真研究[J].计算机仿真,2010,27(5):132-135. 被引量：5
8李珍,田学东.PDF文件信息的抽取与分析[J].计算机应用,2003,23(12):145-147. 被引量：21
9王毅.基于web的信息抽取方法研究[J].科技与生活,2010(13):11-11.
10闵宜仁.抢占互联网上的话语权——写在“天地图”正式版上线三周年之时[J].中国测绘,2014(3):22-23.

智能计算机与应用

2013年第5期

浏览历史

内容加载中请稍等...

网络信息抽取技术分析与比较被引量：3

参考文献29

同被引文献36

引证文献3

二级引证文献16

相关作者

相关机构

相关主题

浏览历史

网络信息抽取技术分析与比较 被引量：3

参考文献29

同被引文献36

引证文献3

二级引证文献16

相关作者

相关机构

相关主题

浏览历史

网络信息抽取技术分析与比较被引量：3