一个RSS级别的网页主题内容抽取方法与系统

A RSS Level Web Page Main Content Extraction Method and System

导出

摘要提出一个RSS级别的网页主题内容抽取方法与系统,利用RSSfeed中的少量entry信息训练得到主题内容模板,通过模板可以对RSSfeed下的所有网页进行主题内容抽取。该方法支持分别抽取网页的标题、正文、类别等信息;另外,该方法有自适应机制,能实时侦测模板的变化。从实验结果来看,该方法和系统有很高的召回率和准确率。 This paper proposes a RSS level web page main content extraction method and system. This method uses small amount of entry RSS meta informations in the RSS feed to train main content template, and based on this template, extract main content for all of web page in the RSS feed. This method also supports extracting title, body and category information separately. Furthermore, this method has self adaptation mechanism, it can real-time detect template change. From experiment results, this method and system has high recall and precision.

作者张艳

机构地区南京信息工程大学图书馆

出处《图书情报工作》 CSSCI 北大核心 2010年第14期107-110,130,共5页 Library and Information Service

基金南京信息工程大学科研基金资助项目"基于语义Web的数字图书馆研究与实现"(项目编号:SK20080153)研究成果之一

关键词网页主题内容抽取 RSS 模板自适应机制 web page main content extraction RSS template self adaptation mechanism

分类号 TP393.092 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献10

1Bar-Yossef Z, Rajagopalan S. Template detection via data mining and its application//Proceedings of 11 ^th Conference on WWW. Hawaii: ACM,2002:580 - 591.
2Gibson D, Punera K, Tomkins A. The volume and evolution of web page templates//Proceedings of 14^th Conference on WWW.New York :ACM, 2005:830 - 839.
3Cai Derg, He Xiaofei, Wen Jirong. Block-level Link Analysis// Proceedings. of 27th annual international ACM-SIGIR. New York: ACM, 2004:440 - 447.
4Yi Lan, Liu Bing. Web page cleaning for web mining through feature weightinge//Proeeedings of the 18th international joint conference on Artificial intelligence. San Francisco: Morgan Kaufmann Publishers, 2003:43 -50.
5Rupesh R, Madaan A. Web page sectioning using regex-based template//Proceedings of 17th Conference on WWW. New York: ACM, 2008:1151 -1152.
6曹冬林,廖祥文,许洪波,白硕.基于网页格式信息量的博客文章和评论抽取模型[J].软件学报,2009,20(5):1282-1291. 被引量：15
7Debnath S, Mitra P, Pal N. Automatic identification of informative sections of Web-pages//IEEE transactions on knowledge and data engineering. Piscataway: IEEE Educational Activities Department, 2005 : 1233 - 1246.
8Chakrabarti D, Kumar R, Punera K. Page-level template detection via isotonic smoothing//Proceedings of 16th Conference. on WWW. New York :ACM, 2007:61 - 70.
9Berkman Center. RSS 2.0 Specification. [2010 -04 - 18]. http:// cyber, law. harvard, edu/rss/rss, html.
10Li Q C, Li Y M. Extracting content from web pages based on RSS//Proceedings of International Conference on Computer Science and Software Engineering. Washington : IEEE Computer Society, 2008:218 -221.

二级参考文献1

1郑家恒,王兴义,李飞.信息抽取模式自动生成方法的研究[J].中文信息学报,2004,18(1):48-54. 被引量：22

共引文献14

1余伟.基于本体的微博客用户行为模型研究[J].广东技术师范学院学报,2010,31(6):27-30. 被引量：6
2曾广朴,陶维安.基于信息量的Web表格信息抽取方法[J].西南师范大学学报（自然科学版）,2010,35(4):159-163. 被引量：2
3陈钊,张冬梅.Web信息抽取技术综述[J].计算机应用研究,2010,27(12):4401-4405. 被引量：22
4范纯龙,夏佳,肖昕,吕红伟,徐蕾.基于功能语义单元的博客评论抽取技术[J].计算机应用,2011,31(9):2417-2420. 被引量：3
5梁正友,欧杰,俞闽敏.基于图文有效信息量的网页正文定位[J].计算机工程,2011,37(23):276-278. 被引量：2
6张玉峰,何超.基于Web评论挖掘的动态竞争情报分析研究(下)——算法设计与实验分析[J].情报理论与实践,2012,35(7):47-50. 被引量：4
7李志义,沈之锐.基于自然标注的网页信息抽取研究[J].情报学报,2013,32(8):853-859. 被引量：3
8向程冠,熊世桓.一种基于特征树的Web碎片信息抽取算法[J].兰州理工大学学报,2014,40(1):104-107. 被引量：3
9李湘东,霍亚勇,黄莉.图书网页的自动识别及书目信息抽取研究[J].现代图书情报技术,2014(4):71-77. 被引量：3
10王琦,霍纬纲.利用博客链接平台选取联合关键字的博客聚类方法[J].计算机应用研究,2017,34(12):3560-3563. 被引量：2

1聂卉,张津华.基于网页规划布局的页面主题内容抽取[J].情报理论与实践,2011,34(12):117-120. 被引量：1
2谷歌开始为Chrome开发RSS订阅功能[J].电子商务,2009,10(4):11-11.
3王二平,王刚,张兴忠.支持多站点的网站内容管理系统开发实例[J].电脑开发与应用,2009,22(8):15-16. 被引量：5
4叶新英,曹玲.RSS技术及其应用探析[J].科技情报开发与经济,2005,15(21):242-243. 被引量：31
5罗永莲,张永奎.基于发布时间的新闻网页去重方法研究[J].计算机工程与应用,2007,43(6):119-121. 被引量：3
6张勇波,宋晓丽.RSS技术及其在高校讲座资源平台中的应用研究[J].微型电脑应用,2011(7):57-58.
7范桂红,迟健光.企业档案部门构建学习型组织的系统模型[J].云南档案,2011(2):60-61.
8罗永莲,秦振吉.新闻网页主题内容提取方法研究[J].微计算机应用,2007,28(5):556-560. 被引量：5
9王军,胡德宇,王祥清.基于神经网络的小波域彩色图像水印方法[J].西华大学学报（自然科学版）,2010,29(2):131-134. 被引量：1
10MXview Lite v2可支持256台设备[J].测控技术,2009,28(7):109-109.

图书情报工作

2010年第14期

浏览历史

内容加载中请稍等...

一个RSS级别的网页主题内容抽取方法与系统

参考文献10

二级参考文献1

共引文献14

相关作者

相关机构

相关主题

浏览历史