期刊文献+

基于HTML5语义化标签的Web文本提取技术 被引量:3

Extracting Web Text Content Techinque Based on HTML5's new Semantic Tags
下载PDF
导出
摘要 本文通过研究新Web标准网页的数据结构,并在信息抽取技术的基础上,提出了一种基于HTML5语义化标签的网页正文提取技术。该技术能够有效的过滤掉与网页主题无关的噪音信息,从而能有效提取有价值文本信息。 On the basis of deep analysis and reasearch on the data structure of the web page and page cleanup techniques. This paper puts forward a new web page cleanup techniques based on HTML 5 structural tags. This method can effectively distinguish the Web content and noise from the subject of page, so it has a good practicality value and useful prospect.
作者 韦佳佳 WEI Jia-Jia(Department of Information Engineering, Anhui Techincal College of Mechanical and Electrical, Anhui Wuhu 241002, China)
出处 《贵阳学院学报(自然科学版)》 2017年第3期25-28,共4页 Journal of Guiyang University:Natural Sciences
基金 2015院级青年教师发展支持计划教科研项目(项目编号:2015yjjy022)
关键词 网页 文本抽取 HTML5 语义化标记 Web Text extaction HTML 5 Semantic tags
  • 相关文献

参考文献3

二级参考文献23

  • 1刘华.网页信息抽取及建库系统C#实现[J].计算机工程,2006,32(16):49-51. 被引量:5
  • 2黄文蓓,杨静,顾君忠.基于分块的网页正文信息提取算法研究[J].计算机应用,2007,27(B06):24-26. 被引量:32
  • 3DAVISION B D. Recognizing nepotistic links on the Web [ C ]//Proceedings of the AAAI-2000 Workshop on Arti-ficial Intelligence for Web Search. Austin: AAAI Press, 2000 : 23-28.
  • 4JUSHMERICK N. Learning to remove Internet advertise- ments[ C]// Proceedings of the 3th International Confer- ence on Autonomous Agents. Washington: ACM Press, 1999: 1-7.
  • 5LIN S H, HO J M. Discovering informative content blocks from web documents [ C ]// Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discov- ery and Data Mining. NewYork:ACM Press, 2002: 588- 593.
  • 6SUHIT G, GAIL K, DAVID N, et al. DOM-based content extraction of HTML documents [ C ]// Proceedings of the 12th International World Wide Web Conference. Buda- pest: ACM Press, 2003: 207-217.
  • 7Dave Raggett. Clean up your web pages with HTML TI- DY [EB/OL ].[ 2011-05-30 ]. http://www, w3. org/ People/Raggett/tidy/.
  • 8EMBLEY DW,JIANG YS,NG YK.Record-Boundary Discovery in Web Documents[A].SIGMOD'99 Proceedings[C].1999.
  • 9EMBLEY DW,LI X.Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents[A].WebDB'00 Proceedings[C].2000.
  • 10LIM SJ,NG YK.Extracting Structures of HTML Documents Using a High-Level Stack Machine[M].Information Networking in Asia,Gordon and Breach Science Publishers,Newark,New Jersey,2001.

共引文献57

同被引文献28

引证文献3

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部