期刊文献+

基于规则模型的网页主题文本提取方法 被引量:3

Web content information extraction method based on rule model
下载PDF
导出
摘要 通过对网页结构化和半结构化信息的分析,提出了一种基于规则模型的网页正文提取方法。该方法在总结HTML标签的不同应用特征和网页布局的结构特征的基础上,通过定义一系列过滤、提取和合并规则来建立一个通用的网页正文抽取模型,以达到有效提取网页主题文本的目的。实验结果表明,该方法对于各类型网页主题文本的提取均具有较高的准确率,通用性强。 A web content information extraction method based on rule model is presented by analysing on structured and semi-structured web data. Based on learning from the feature of HTML tag and web page layout, a universal extraction model is built by defining a series of filtering, extracting and merging rule and web content is extracted effectively. The practice shows that this method has good accuracy in extracting web content information and is applied widely.
出处 《计算机工程与设计》 CSCD 北大核心 2009年第20期4665-4667,共3页 Computer Engineering and Design
关键词 规则模型 信息抽取 主题文本提取 数据采集 WEB挖掘 rule model web information extraction main body extraction data gathering web mining
  • 相关文献

参考文献6

二级参考文献47

  • 1王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792. 被引量:81
  • 2崔继馨,张鹏,杨文柱.基于DOM的Web信息抽取[J].河北农业大学学报,2005,28(3):90-93. 被引量:12
  • 3[1]A. McCallum, K. Nigam, J. Rennie, and K. Seymore. A machine learning approach to building Domain-Specific Search Engines [A]. In Proceedings of IJCAI-99 [C]. 622-667.
  • 4[2]Ellien Riloff. Automatically Constructing a Dictionary for Information Extraction Task [A]. Proceeding for the Eleventh National Conference on Artificial Intelligence [C]. 1993. 811-816.
  • 5[3]E. Riloff , R. Jones. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping [A]. Proceedings of the Sixteenth National Conference on Artificial Intelligence [C]. 1999. 811-816.
  • 6[4]S. Soderland. Learning information extraction rules for semi-structured and free text [J]. Machine Learning, 1999, 1-44.
  • 7[5]Kushmerick, N. Wrapper induction: efficiency and Expressiveness [J]. Artificial Intelligence,2000, Vol. 118, pp. 15--68.
  • 8[6]Leek,T. R. Information Extraction Using Hidden Markov Models [D]. Master's thesis, UC san Diego,1997.
  • 9[7]Kristie Seymore, Andrew McCallum, Ronal Rosenfel. Learning Hidden Markov Model Structure for Information Extract [A]. AAAI' 99 Workshop on Machine Learning for Information Extraction [C]. 1999. 37-42.
  • 10[8]Dayne Frietag, Andrew McCallum. Information Extraction with HMMs and shrinkage [A]. In Proceedings of the AAAI'99 Workshop on Machine Learning for Information Extraction [C], 1999, pp. 31-36.

共引文献178

同被引文献30

引证文献3

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部