期刊文献+

基于DOM的网页主题信息的抽取 被引量:19

DOM BASED EXTRACTION OF TOPICAL INFORMATION FROM WEB PAGES
下载PDF
导出
摘要 随着Internet的发展,Web页面信息量不断加大,信息密集程度不断加强。但Web页面的主题信息通常不太明确,抽取主题信息也比较困难。针对这一难题,提出一种算法:构建文档对象模型DOM(Document Object Model)树,然后针对HTML半结构特征的不足,为DOM添加显示、语义(链接数、非链接文字数、高度、宽度)等属性,并提出一种聚类规则来对其进行分块,最后对其进行剪枝,删除掉无用的信息,提取主题信息。实验表明,该方法能够准确抽取主题信息。 With the development of the Internet,the amount as well as the density of Web pages information increase day by day.However the representation of the topical information is usually not manifest enough,and this makes it difficult to acquire the topical information.A new extraction algorithm is proposed to solve this issue by constructing the DOM tree and then adding attributes to it such as display,semantics(link number,unlinked words number,height and width,etc.),as well as presenting a clustering rule for partitioning the DOM tree,the last part of the algorithm is to prune the DOM tree to expel redundancies so as to pick up the topical information.This approach can accurately extract the topical information as shown by the experiment.
作者 刘军 张净
出处 《计算机应用与软件》 CSCD 2010年第5期188-190,共3页 Computer Applications and Software
关键词 DOM 主题 信息抽取 分块 剪枝 DOM Topic Information extraction Partition Prune
  • 相关文献

参考文献8

  • 1Han W,Buttler D,Pu C.Wrapping Web Data into XML[J].SIGMOD Record,2001,30(3):33-39.
  • 2Baumgartner R,Flesca S,Gottlob G.Visual Web Information Extraction with Lixto[C].San Francisco:Morgan Kaufmann,2001:119-128.
  • 3W3C Tidy[EB/OL].[2005-01-08].http://www.w3.org/People/Raggett/tidy/.
  • 4张志刚,陈静,李晓明.一种HTML网页净化方法[J].情报学报,2004,23(4):387-393. 被引量:57
  • 5Bouras C,Kapoulas V,Misedakis I.A Web page Fragmentation Technique for Personalized Browsing[C] //ACM SAC 2004,March 2004:14-17.
  • 6李效东,顾毓清.基于DOM的Web信息提取[J].计算机学报,2002,25(5):526-533. 被引量:101
  • 7Cai Deng,Yu Shipeng,Wen Jirong,Ma Weiying.VIPS:a Vision-based Pages Segmentation Algorithm[R].Microsoft Technical Report MSR-TR-2003-79,November,2003.
  • 8常育红,姜哲,朱小燕.基于标记树表示方法的页面结构分析[J].计算机工程与应用,2004,40(16):129-132. 被引量:24

二级参考文献34

  • 1Shian-Hua Lin, Jan-Ming Ho. Discovering informative content blocks from Web documents. In: SIGKDD, 2002
  • 2Soumen Chakrabarti, Mukul M. Joshi and Vivek B. Tawde.Enhanced topic distillation using text, markup tags, and hyperlinks. In: SIGIR, 2001
  • 3S. Chakrabarti, M. Joshi, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In :WWW, Hawaii. ACM, 2002
  • 4Yiming Yang. Noise reduction in a statistical approach to text categorization. In: Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval, 1995
  • 5Li Xiaoli and Shi Zhongzhi. Innovating Web page classification through reducing noise. Journal of Computer Science & Technology, 2002 ,17(1): 9 ~ 17
  • 6http://162. 105.80.84/cgi-bin/getdirectory? ccode = 0
  • 7http://e. pku. edu. cn
  • 8Yang Y. Expert network:effective and efficient learning from human decisions in text categorization and retrieval. In: Proceedings of the Seventeenth International ACM SIGIR Conference on Research and Development in Information Retrieval,1994. 13 ~ 22
  • 9Lewis D. D., et al. Training algorithms for linear text classitiers. In: Proceedings of the Nineteenth International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996. 298 ~ 306
  • 10Michael W. Berry, Murray Browne. Understand Search Engines (Mathematical Modeling and Text Retrieval). SLAM,1999

共引文献165

同被引文献150

引证文献19

二级引证文献50

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部