期刊文献+

一个RSS级别的网页主题内容抽取方法与系统

A RSS Level Web Page Main Content Extraction Method and System
原文传递
导出
摘要 提出一个RSS级别的网页主题内容抽取方法与系统,利用RSSfeed中的少量entry信息训练得到主题内容模板,通过模板可以对RSSfeed下的所有网页进行主题内容抽取。该方法支持分别抽取网页的标题、正文、类别等信息;另外,该方法有自适应机制,能实时侦测模板的变化。从实验结果来看,该方法和系统有很高的召回率和准确率。 This paper proposes a RSS level web page main content extraction method and system. This method uses small amount of entry RSS meta informations in the RSS feed to train main content template, and based on this template, extract main content for all of web page in the RSS feed. This method also supports extracting title, body and category information separately. Furthermore, this method has self adaptation mechanism, it can real-time detect template change. From experiment results, this method and system has high recall and precision.
作者 张艳
出处 《图书情报工作》 CSSCI 北大核心 2010年第14期107-110,130,共5页 Library and Information Service
基金 南京信息工程大学科研基金资助项目"基于语义Web的数字图书馆研究与实现"(项目编号:SK20080153)研究成果之一
关键词 网页主题内容抽取 RSS 模板 自适应机制 web page main content extraction RSS template self adaptation mechanism
  • 相关文献

参考文献10

  • 1Bar-Yossef Z, Rajagopalan S. Template detection via data mining and its application//Proceedings of 11 ^th Conference on WWW. Hawaii: ACM,2002:580 - 591.
  • 2Gibson D, Punera K, Tomkins A. The volume and evolution of web page templates//Proceedings of 14^th Conference on WWW.New York :ACM, 2005:830 - 839.
  • 3Cai Derg, He Xiaofei, Wen Jirong. Block-level Link Analysis// Proceedings. of 27th annual international ACM-SIGIR. New York: ACM, 2004:440 - 447.
  • 4Yi Lan, Liu Bing. Web page cleaning for web mining through feature weightinge//Proeeedings of the 18th international joint conference on Artificial intelligence. San Francisco: Morgan Kaufmann Publishers, 2003:43 -50.
  • 5Rupesh R, Madaan A. Web page sectioning using regex-based template//Proceedings of 17th Conference on WWW. New York: ACM, 2008:1151 -1152.
  • 6曹冬林,廖祥文,许洪波,白硕.基于网页格式信息量的博客文章和评论抽取模型[J].软件学报,2009,20(5):1282-1291. 被引量:15
  • 7Debnath S, Mitra P, Pal N. Automatic identification of informative sections of Web-pages//IEEE transactions on knowledge and data engineering. Piscataway: IEEE Educational Activities Department, 2005 : 1233 - 1246.
  • 8Chakrabarti D, Kumar R, Punera K. Page-level template detection via isotonic smoothing//Proceedings of 16th Conference. on WWW. New York :ACM, 2007:61 - 70.
  • 9Berkman Center. RSS 2.0 Specification. [2010 -04 - 18]. http:// cyber, law. harvard, edu/rss/rss, html.
  • 10Li Q C, Li Y M. Extracting content from web pages based on RSS//Proceedings of International Conference on Computer Science and Software Engineering. Washington : IEEE Computer Society, 2008:218 -221.

二级参考文献1

共引文献14

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部