期刊文献+

基于正则表达式的HTML信息提取 被引量:4

The HTML Information Extraction Based on Regular Expressions
下载PDF
导出
摘要 在实际应用中经常需要分析Web页面的源代码,对HTML标记进行分析提取有用的数据。研究了如何利用正则表达式获取常见的HTML标记内容,实现了对HTML信息的定制提取,并以如何抓取一个学生成绩表的数据信息为例介绍了其实现过程。 Under the actual application, we need to information from html tags. The paper researched how to get realized to extract custom html tag information, and took illustrate the implementation process. analyse source code of web and extract useful common html tag content by regular expressions, grabing a student score data as an example to
出处 《电脑开发与应用》 2012年第4期44-46,共3页 Computer Development & Applications
关键词 正则表达式 HTML 信息提取 regular expressions, HTML, information extraction
  • 相关文献

参考文献3

二级参考文献32

  • 1徐振航,刘莉芹.XML与面向Web的数据挖掘技术[J].软件世界,2000(10):120-122. 被引量:16
  • 2Jackson J Myllymaki J.基于Web的数据挖掘:自动抽取用HTM、XML和Java编写的信息[J/0L].http://www.IBM.com.2001.6.,.
  • 3孟小峰.数据挖掘走向Internet .微电脑世界,2001,(52).
  • 4EIKVIL L. Information extraction from World Wide Web--a survey [R]. [S. l. ] : Norwegian Computing Center, 1999.
  • 5ALBERTO H F, ALTIGRAN S, et al. A brief survey of Web data extraction tools [J]. SIGMOD Rec. , 2002, 31 (2).
  • 6CRESCENZI V, MECCA G, MERIALDO P. RoadRunner: towards automatic data extraction from large Web sites [ C ]// VLDB2001 : 109-118.
  • 7MENG Xiaofeng, L U Hongjun, et al. SG-WRAP: a schemaguided wrapper generator data engineering [ C ]//Proceedings of 18th International Conference on Data Engineering, 2002.
  • 8ARASU A, GARCIA-MOLINA H. Extracting structured data from Web pages [ C]//ACM SIGMOD Conference, 2003.
  • 9LIU B, GROSSMAN R, ZHAI Y. Mining data records in Web pages [C]//KDD2003, 2003: 601-606.
  • 10WANG J, LOCHOVSKY F H. Data extraction and label assignment for Web databases [ C] //Proceedings of the 12th International Conference on World Wide Web, 2003: 187-196.

共引文献11

同被引文献31

  • 1邓莉琼,吴玲达,陈丹雯,袁志民.基于OpenGL的时空信息可视化系统设计与实现[J].系统仿真学报,2009,21(S1):163-165. 被引量:1
  • 2刘群,张华平,俞鸿魁,程学旗.基于层叠隐马模型的汉语词法分析[J].计算机研究与发展,2004,41(8):1421-1429. 被引量:198
  • 3AndrewW.正则表达式入门经典[M].北京:清华大学出版社,2008.
  • 4王鹏,张永奎,张彦,刘睿.基于新闻网页主题要素的网页去重方法研究[J].计算机工程与应用,2007,43(28):177-180. 被引量:7
  • 5Weskamp M. Newsmap[DB/OL]. 2013-03-04. http:// www.marumushi.com/apps/newsmap.
  • 6Mod C. Buzztracker-World News[DB/OL]. 2013-03-04. http://www.buzztracker.org.
  • 7Bradshaw P. Yahoo Tracker by FlatFeetPete[DB/OL].2013- 03-04. http://www.flat feetpete.com/ytrack/index.html.
  • 8Zuylen C V. From documents to information: A new mod- el for information retrieval[EB/OL]. 2013-03-04. http:// www.inxight.com/pdfs/TimeWall_FinalPrint.pdf.
  • 9Havre S, Nowell L. ThemeRiver: Visualizing theme changes over time[J]. Proceedings of the IEEE Sympo-sium on Information Visualization, 2000(10): 115 - 123.
  • 10Havre S, Hetzler E, Whitney P, et al. ThemeRiver: Visual- ization thematic changes in large document collections [J]. Proceedings of the IEEE Transactions on Visualiza- tion and Computer Graphics, 2002,18(1):9-20.

引证文献4

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部