期刊文献+

HTML数据内容的抽取与集成 被引量:8

Data Extraction and Integration from HTML Documents
下载PDF
导出
摘要 在XML基础上,利用HTMLTidy可实现轻量级的Web数据挖掘和转换。转换过程主要解决的是HTML文档及其集合要表达的模式信息的分离。转换步骤是利用HTMLTidy提供的标准类库,净化HTML文档,借助DOM生成树对HTML元素结构做进一步分析,最后通过XSL、XPATH等自动提取转换。 Using XML and HTML Tidy tools set, we can get a lightweight method of Web data mining and transformation. The purpose of transformation is to separate HTML document content from its schema. The processes included purifying HTML documents by HTML Tidy Standard class library, analyzing HTML element's structure through DOM, and extracting data with XSL and XPATH.
出处 《华东理工大学学报(自然科学版)》 CAS CSCD 北大核心 2003年第6期613-616,共4页 Journal of East China University of Science and Technology
关键词 XML HTML 数据抽取 XML HTML data extraction
  • 相关文献

参考文献3

二级参考文献9

  • 1[1]Joachim Hammer, Hector Garcia-Molina, Jumghoo Cho, et al.Extracting Semistructured Information from the Web [C].Proceedings of the First Workshop on Management of Semistructured Data, Tucson, Arizona, 1997.18-25.
  • 2[2]Arnaud Sahuguet, Fabien Azavant. Building Light-weight Wrap-pers for Legacy Web Data-sources Using W4F[C]. International Conference on Very Large Databases (VLDB), Edinburgh,Scotland, 1999.738-741.
  • 3[3]S Soderland. Learning Information Extraction Rules for Semi-structured and FreeText [ J ]. Machine Learning, 1999, 1-44.
  • 4[4]N Kushmerick, D Weld, B Doorenbos. Wrapper Induction for Information Extraction [ C ]. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-97), Osaka, Japan, 1997.729-737.
  • 5[5]Ion Muslea, Steve Minton, Craig Knoblock. Stalker: Learning Extraction Rules for Semistructured, Web-based Information Sources [ C ]. AAAI-98 Workshop on "AI & Information Integration", Madison, 1998.74-81.
  • 6[6]Ion Muslea. Extraction Patterns: From Information Extraction to Wrapper Induction[ R]. Technical Report, Information Sciences Institute, University of Southern Californi, 1998.
  • 7Wang Q,Proc EDBT 2000,2000年
  • 8Liu L,Proc of ICDE 2000,2000年,611页
  • 9Li Qingshan,The 3th International Asia-Pacific Web Conference,2000年,87页

共引文献104

同被引文献35

引证文献8

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部