Web数据的深度定向采集被引量：1

Deep directional collection of Web data

导出

摘要通过模拟人类访问网页的浏览行为,提取定向爬行子页面集限定爬虫的爬行方向;引入页面继承关系,并通过爬行条目的属性继承实现跨页面复合对象的数据关联关系;设计实现了支持深度定向采集的通用爬行流程。面向天涯热帖的舆情采集实验结果表明:该方法可以在整体处理流程不变的前提下,实现复杂对象的数据采集,并具有较高的采集效率。 Based on the Web surf behaviors of human beings,crawling directions are restricted by extracted crawling sub-pages,and the associated relationships of cross-page compound object are realized through the properties′ inheritance between crawl datum.Then,the generalized crawl process with deep directional collection support is designed and implemented.Experimental results about the hot posts of the Tianya site show that this method can achieve data collection of complicated objects without changing the main procedure,and has high collection efficiency.

作者夏天

机构地区数据工程与知识工程教育部重点实验室中国人民大学信息资源管理学院

出处《山东大学学报（理学版）》 CAS CSCD 北大核心 2011年第5期34-38,共5页 Journal of Shandong University(Natural Science)

基金国家社会科学基金资助项目(09CTQ027)

关键词深度采集定向网络爬虫公共网络舆情 deep collection； directional web crawler； public web opinion；

分类号 TP393 [自动化与计算机技术—计算机应用技术] G350 [文化科学—情报学]

引文网络
相关文献

参考文献12

1刘兵.Web数据挖掘[M].北京:清华大学出版社,2009.
2钱爱兵.基于主题的网络舆情分析模型及其实现[J].现代图书情报技术,2008(4):49-55. 被引量：72
3王伟,许鑫.基于聚类的网络舆情热点发现及分析[J].现代图书情报技术,2009(3):74-79. 被引量：62
4Cho Junghoo, Hector Garcia-Molina, Lawrence Page. Ef-ficient crawling through URL ordering [ J ]. Computer Networks and ISDN Systems, 1998, 30(1-7) :161-172.
5Soumen Chakrabarti, Martin van den Berg, Byron Dom. Focused crawling: a new approach to topic-specific Web resource discovery [ J]. Computer Networks, 1999, 31 (11-16) : 1623-1640.
6宫进,胡长军,曾广平.互联网信息定向采集系统的设计与实现[J].计算机应用,2007,27(B06):16-17. 被引量：7
7徐健,张智雄.基于Nutch的Web网站定向采集系统[J].现代图书情报技术,2009(4):1-6. 被引量：10
8张霞亮,陈家骏.基于逻辑行和最大接纳距离的网页正文抽取[J].计算机工程与应用,2009,45(25):125-128. 被引量：5
9王利,刘宗田,王燕华,廖涛.基于内容相似度的网页正文提取[J].计算机工程,2010,36(6):102-104. 被引量：20
10XIA Tian. Extracting multi-records from Web pages [ C ]//Proceedings of the 4th Intemational Conference on Semantics, Knowledge and Grid ( SKG 08 ). Wash- ington: IEEE Computer Society, 2008: 396-399.