期刊文献+

网络信息抽取技术分析与比较 被引量:3

Analysis and Comparison of Web Information Extraction Technologies
下载PDF
导出
摘要 随着互联网爆炸式的发展和普及,网络信息已经成为了一种宝贵的信息数据资源。海量的网络数据使得数据分析与挖掘系统进入了一个新时代,越来越多的网络应用系统需要对来自不同数据源的结构化数据进行抽取、挖掘和整合。然而,由于网页文档的半结构化性质,网页上呈现的数据往往不能被机器自动地抽取和理解,因此,网络信息抽取的研究目标在于提取网页的结构化数据。互联网数据的海量规模与高度异构,为网络信息抽取带来了巨大的挑战。分析和总结了近年来网络信息抽取相关的研究与工作,剖析了各个工作的优势和局限,并进一步作了综合的分类与比较。 The World Wide Web has become an important resource of information due to its explosive growth and spread in the past two decades. The tremendous amount of web data has opened a new era for data analysis and mining systems. More and more web applications need to extract, mine, and integrate data from enormous data sources. However, due to the semi - structure characteristic of web pages, web data exhibited on web pages is not directly consumable by machines. Web information extraction aims at extracting structured data from web pages, which is a very challenging problem clue to the large - scale and highly - heterogeneous characteristic of web data. This paper introduces the state - of - the - art web information extraction studies, analyzes the advantages and limitations of each method, and conducts categorization and comparison of existing approaches.
出处 《智能计算机与应用》 2013年第5期24-27,30,共5页 Intelligent Computer and Applications
基金 国家高技术研究发展计划(863)(2011AA01A207) 国家自然科学基金(61073130)
关键词 网络信息抽取 包装器 模板 Web Information Extraction Wrapper Template
  • 相关文献

参考文献29

  • 1CHANG C H,KAYED M,GIRGIS M R. A survey of web information extraction systems[J].IEEE Trans on Knowl and Data Eng,2006.1411-1428.
  • 2LIU B. Web data mining:exploring hyperlinks,contents,and usage data[M].{H}Springer-Verlag,2007.
  • 3HAMMER J,MCHUGH J. GARCIA-MOLINA,Semistructured data:the TSIMMIS experience[A].1997.1-8.
  • 4CRESCENZI V,MECCA G. Grammars have exceptions[J].{H}Information Systems,1998,(08).
  • 5KUSHMERICK N,WELD D S,DOORENBOS R B. Wrapper induction for information extraction[A].1997.729-737.
  • 6MUSLEA I,MINTON S,KNOBLOCK C. A hierarchical approach to wrapper induction[A].1999.
  • 7CHUANG S L,HSU J Y J. Tree-structured template generation for web pages[A].2004.
  • 8ZHENG S,SONG R,WEN J R. Efficient record-level wrapper induction[A].2009.
  • 9LERMAN S M K,KNOBLOCK C. Wrapper maintenance:a machine learning approach[J].{H}JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH,2003.149-181.
  • 10ZHU J,NIE Z,WEN J R. Simultaneous record detection and attribute labeling in web data extraction[A].2006.

同被引文献36

  • 1Chang C H,Kayed M,Girgis M R,et al.A survey of Web information extraction systems[J].IEEE Trans on Knowledge Data Engineering,2006,18(10):1411-1428.
  • 2Liu Bing,Grossman R,Zhai Yanhong.Mining data records in Web pages[C]//Proc of Knowledge Discovery and Data Mining.2003:601-606.
  • 3Zhao Hongkun,Meng Weiyi,Wu Zonghuan,et al.Fully automatic wrapper generation for search engines[C]//Proc of the 14th International Conference on World Wide Web.2005.
  • 4Zhai Yanhong,Liu Bing.Web data extraction based on partial tree alignment[C]//Proc of the 14th International Conference on World Wide Web.2005:76-75.
  • 5Zhiwei F., 2002, Evolution and Present Situation of Corpus Research In China, Journal of Chinese Lan- guage and Computing, 12(1) .43-62.
  • 6李素芳.《“知之于困学,好之于交流,乐之于应用”—专访梁茂成教授,李文中教授和许家金博士》,《中国英语教育》2010年第1期.
  • 7Zhan Weidong, Chang Baobao, Duan Huiming, Zhang Huarui. 2006, "Recent Developments in Chinese Corpus Re- search", The 13'h NIJL International Symposium, Language Corpora. Their Compliation and Application. Tokyo, Ja- pan. 3.6-7. http .//ccl. pku. edu. cn/doubtfire/papers/2006_Corpora_NIJL Workshop. pdf, 2014 年7 月 11日.
  • 8刘成飞.《汉语中介语语料库中汉字偏误处理的比较研究》,http.//www.doe88.com/p-0116174114179.html,2015年06月11日.
  • 9中国大百科全书出版社编辑部.《中国大百科全书·语言文字》,北京.中国大百科全书出版社,1988年,第336页.
  • 10Cobb, T. 2003, Analyzing late interlanguage with learner corpora . Quebec replications of three European studies, The Canadian Modem Language Review/La Revue canadienne des langues vivantes, 59 (3) .393-423.

引证文献3

二级引证文献16

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部