期刊文献+

一种基于信息熵的Web页面主题信息抽取方法 被引量:6

Extracting topic information of Web page based on entropy
下载PDF
导出
摘要 提出了一种剪枝信息熵增较大结点的信息抽取方法。通过对HTML文档解析来构造DOM树,根据配置过滤掉不需处理的相关内容并建立语义模型树,最后对熵增超过阈值的结点进行剪枝并输出抽取的主题信息页面,初步实验结果验证了用这种方法进行Web页面信息抽取的有效性。方法的数学模型简单可靠,基本不需要人工干预即可完成主题信息抽取。可应用于Web数据挖掘系统以及PDA等移动设备的信息获取方面。 This paper presents a method of information extraction by pruning the nodes of which information entropy production reach a certain extent.Firstly,a DOM tree is constructed by parsing HTML document.Then,the nodes which don't need to be dealt with are filtrated out,and a STU tree is created.Lastly,the nodes whose information entropy's increase overtops the threshold value are pruned,and the topic information of the Web pages is obtained.The primary experiment result proves the validity of the method using for extracting Web page's information.The mathematical model of the method is simple and credible,so it can work automatically without intervention of people.This method can be applied to Web data mining and information extraction for mobile device such as PDA etc.
出处 《计算机工程与应用》 CSCD 北大核心 2007年第4期164-166,共3页 Computer Engineering and Applications
关键词 WEB 抽取 STU-DOM树 信息熵 Web extraction STU-DOM tree information entropy
  • 相关文献

参考文献5

  • 1王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792. 被引量:81
  • 2李保利,陈玉忠,俞士汶.信息抽取研究综述[J].计算机工程与应用,2003,39(10):1-5. 被引量:177
  • 3Gupta S,Kaiser G,Neistadt D,et al.DOM-based content extraction of HTML documents[C]//12th International World Wide Web Conference,May 2003.
  • 4Gupta S,Kaiser G E,Grimm P,et al.Automating Content Extraction of HTML Documents[J].World Wide Web Journal.
  • 5Rahman A F R,Alam H,Hartono R.Content extraction from HTML documents[C]//lst Int Workshop on Web Document Analysis (WDA2001),2001.

二级参考文献33

  • 1[16]Hobbs J,Appelt D,Bear J et al.FASTUS:A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text[C].In:Roche,Schabes eds. Finite State Devices for Natural Language Processing, MIT Press,Cambridge MA, 1996
  • 2[17]Appelt D E.Introduction to Information Extraction[J].AI COMMUNICATIONS, 1999; 12(3)
  • 3[18]Yangarber R.Scenario Customization for Information Extraction[D].Ph D Thesis.New York University,2001-01
  • 4[19]Cowie J, Lehnert W.Information Extraction[J].Communications of the ACM, 1996;39(1)
  • 5[20]Grishman R Adaptive information extraction and sublangu age analysis[C].In:Proceedings of IJCAI-2001 Workshop on Adaptive Text Extraction and Mining,2001
  • 6[1]Applet D E,Israel D J.Introduction to Information Extraction Technology. A Tutorial for IJCAI-99,1999
  • 7[2]Gaizauskas R,Wilks Y.Information Extraction:Beyond Document Retrieval[J].Journal of Documentation, 1997
  • 8[3]Sager N.Natural Language Information Processing. Reading,Massachusetts:Addison Wesley, 1981
  • 9[4]Dejong G.An Overview of the FRUMP System[C].In:LEHNERT W,RINGLE M h eds. Strategies for Natural Language Processing,Lawrence Erlbaum, 1982:149~176
  • 10[5]Grishman R,Sundheim B.Message Understanding Conference-6:A Brief History[C].In :Proceedings of the 16h International Conference on Computational Linguistics(COLING-96),1996-08

共引文献255

同被引文献67

引证文献6

二级引证文献12

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部