期刊文献+

互联网网站存档增量采集研究

Research on Incremental Collection of Internet Archive
下载PDF
导出
摘要 互联网网站存档随着互联网的普及,每年的存储量都在快速增长,导致服务器的存储空间、运行负载和网络带宽已无法满足采集量的增长速度。因此,采集系统过滤掉采集周期内重复的文档实现增量采集将是解决这些问题的关键。本文首先讨论增量采集的采集策略和工具,然后根据采集策略选取合适的工具进行实际采集验证增量采集效果。通过对采集系统添加附加工具的形式实现互联网网站存档增量采集,并对采集的结果进行分析讨论,实现减轻服务器的运行负载、减少网络带宽的占用、降低互联网网站存档存储空间和提高采集资源展示质量的目标。 Internet archive with the popularity of the internet,the amount of storage is growing rapidly every year.The storage space,operating load,and network bandwidth of the server can no longer meet the growth rate of the collection volume.Therefore,the key to solve these problems is to filter out the repeated documents in the collection cycle and realize incremental collection.This paper first discusses the acquisition strategy and tools of incremental acquisition,and then selects the appropriate tool according to the acquisition strategy for actual acquisition to verify the effect of incremental acquisition.Through the addition of additional tools to the collection system,the Internet archives incremental collection is realized,and the collected results are analyzed and discussed.The goals of reducing the operating load of the server,reducing the occupation of network bandwidth,reducing internet archive storage space and improving the display quality of collected resources are achieved.
作者 杨云鹏 YANG YunPeng(National Library of China,Beijing 100081,China)
机构地区 国家图书馆
出处 《数字图书馆论坛》 CSSCI 2020年第12期17-21,共5页 Digital Library Forum
关键词 互联网网站存档 增量采集 采集策略 网络抓取 Internet Archive Incremental Acquisition Acquisition Strategy Web Scraping
  • 相关文献

参考文献6

二级参考文献20

  • 1孟涛,王继民,闫宏飞.网页变化与增量搜集技术[J].软件学报,2006,17(5):1051-1067. 被引量:22
  • 2杨天奇,周晔.一种增量式并行Web信息采集方法[J].计算机工程,2006,32(20):97-99. 被引量:5
  • 3[1]Fielding R., Gettys J., Mogul J., et al. Hypertext Transfer Protocol - HTTP/1.1[S]. RFC 2616, June 1999.
  • 4[2]Crocker D. Standard for The Format of ARPA Internet Text Messages[S]. STD 11, RFC 822, August 1982.
  • 5[3]Franks J., Hallam-Baker P., Hostetler J., et al. HTTPAuthentication: Basic and Digest Access Authentication[S]. RFC2617, June 1999.
  • 6[4]Freed N., N. Borenstein. Multipurpose Internet Mail Extensions(MIME) Part One: Format of Internet Message Bodies[S]. RFC 2045, November 1996.
  • 7[5]Freed N., N. Borenstein. Multipurpose Internet Mail Extensions(MIME) Part Two: Media Types[S]. RFC 2046, November 1996.
  • 8[6]Troost R., Dorner S., Moore K. Communicating PresentationInformation in Internet Messages: The Content-Disposition Header[S]. RFC 2183, August 1997.
  • 9[7]Berners-Lee T., Fielding R., H. Frystyk. Hypertext Transfer Protocol - HTTP/1.0[S]. RFC 1945, May 1996.
  • 10DONG H,HUSSAIN F K. Focused crawling for automatic service discovery,annotation and classification in industrial digital ecosys-tems[J].IEEE Transactions on Industrial Electronics,2011,(06):2106-2116.

共引文献16

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部