期刊文献+

基于HTML模式代数的Web信息提取方法 被引量:8

Web Information Extraction Based on HTML Pattern Algebra
下载PDF
导出
摘要 高效地生成提取Web信息的包装器有着广阔的应用前景,同时也是至今没有得到有效解决的难题.为此,提出了基于HTML文档的模式代数,该代数包括一致模式集等重要概念以及模式的加法运算.在此基础上,提出了一种提取Web信息的新方法,该方法采用在整个训练例子中学习表示各属性提取规则的一致模式集,再由多个模式组成的一致模式集提取数据,适用于提取具有缺省属性、多值属性、属性具有多种不同顺序的表结构网页和层次结构网页,其有效性在原型系统中通过实验得到验证. Generating wrapper efficiently for extracting Web data has broad application prospect, but is also a difficult problem that is not yet solved efficiently till now. To tackle this problem, a pattern algebra for HTML documents is introduced, which includes key concepts, such as consistent pattern set, and the addition operation of pattern, and based on it a new approach to extract Web information is presented. It induces the consistent pattern set which represents identifying rules of each attribute by exploring the whole samples, and then extracts data by the consistent pattern set with multiple patterns. It can apply Web pages with tabular structure, in which there are missing attributes or attributes with multiple values or different order and hierarchical structure, and has been validated experimentally in the prototype.
出处 《计算机研究与发展》 EI CSCD 北大核心 2006年第9期1644-1650,共7页 Journal of Computer Research and Development
基金 国家自然科学基金项目(60573095) 湖北省自然科学基金项目(2005ABA238).
关键词 WEB信息提取 包装器归纳学习 WEB挖掘 Web information extraction wrapper induction Web mining
  • 相关文献

参考文献12

二级参考文献57

  • 1[1]Baumgartner R.,Flesca S.,Gottlob G.. Visual web information extraction with lixto. In: Proceedings of the 27th International Conference on Very Large Data Bases, Roma, 2001,119~128
  • 2[2]Liu L.,Pu C., Han W.. XWRAP: An XML-enabled wrapper construction system for web information sources. In: Proceedings of the 16th International Conference on Data Engineering, California, 2000, 611~621
  • 3[3]Gottlob G., Koch C.. Monadic datalog and the expressive power of languages for web Information extraction. In: Proceedings of the 21th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Wisconsin, 2002, 17~28
  • 4[4]Hamer J.,Brennig M., Garcia-Molina H.. Template-based wrappers in the TSIMMIS system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Arizona, 1997, 532~535
  • 5[5]Atzeni P., Mecca G.. Cut and paste. In: Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Arizona, 1997, 144~153
  • 6[6]Crescenzi V., Mecca G., Merialdo P.. RoadRunner: Towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, Roma, 2001, 109~118
  • 7[7]Soderland S.. Learning information extraction rules for semistructured and free text. Machine Learning,1999, 34(1~3):233~272
  • 8[8]Adelberg B.. Nodose-A tool for semi automatically extracting structured and semi-structured data from text document. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington, 1998, 283~294
  • 9[9]Ribeiro-Neto B.A., Laender A., da silva A.S.. Extracting semistructured data through examples. In: Proceedings of the 1999 ACM CIKM International Conference on Information and Knowledge Management, Missouri, 1999,94~101
  • 10[10]EmbleyD.W., Campbell D.M., Jiang Y.S.. A conceptual-modeling approach to extracting data from web. In: Proceedings of the 17th International Conference on Conceptual Modeling, Singapore, 1998,78~91

共引文献211

同被引文献57

引证文献8

二级引证文献41

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部