期刊文献+

互联网上信息报道的最早发布时间检测 被引量:3

The Determination of the Earliest News Reporting Time on the Web
下载PDF
导出
摘要 准确提取网上信息报道的最早发布时间,对于使用计算机辅助的社会科学研究来说具有重要价值。数据表明,有40%的信息报道无法从网页中直接提取出文章发布时间,此时,如果单纯依靠搜集时间和HTTP协议提供的网页文件最后修改时间信息来估计文章发布时间,就会造成较大误差。提出了两种能够提高计算精度的方法:链接分析法和拷贝分析法。大数据量实验表明,这两种方法具有很小的出错概率,是切实可用的。其中,链接分析法能够在一定程度上减少计算误差,而拷贝分析法则具有决定性的作用。当一篇信息报道能在网上找到多个拷贝(转载)时,就会有很大的概率准确推断出该报道在网上的最早发布时间。 Determination of the earliest time when an event is reported on the Web is of particular use for computer aided social science researches. Statistics has shown that 40% of Web pages have no evidence of publication time from their contents. For those cases, the crawling time or LMT (last-modified-time) from the H3TP header are often far off the real publication time. Therefore two methods for achieving better accuracy are proposed. The first one is based on link analysis and the other is based on replicas analysis. Experiments have shown that combining these two methods often gives rise to quite accurate results.
出处 《计算机科学与探索》 CSCD 2009年第1期51-59,共9页 Journal of Frontiers of Computer Science and Technology
基金 国家自然科学基金 国家高技术研究发展计划(863) 广东省重点实验室基金~~
关键词 文章发布时间检测 网络信息挖掘 网页内容分析 文本消重 publication time information mining content analysis replica detection
  • 相关文献

参考文献3

二级参考文献17

  • 1Huang Lianen, Yan Hongfei, Li Xiaoming. Engineering of Web InfoMall : The Chinese Web Archive[C]//Proc of the World Engineers Convention, 2004: 217-222.
  • 2Internet Archive[EB/OL]. [2007-10-12]. http://www.archive. org.
  • 3董关鹏.关于媒体与公关问题[R].在中央社会主义学院无党派人士学习班上的报告,2005.
  • 4Koehler W. Web Page Change and Persistence: A Four-Year Longitudinal Study[J]. Journal of the American Society for Information Science and Technology,2002,53(2):162-171.
  • 5祝建华 李晓明.一个易用廉价的社会科学研究工具-‘易猫’.中国计算机学会通讯,2007,3(4):39-43.
  • 6李晓明.当我们一天能搜集一千万网页后…….中国计算机学会通讯,2007,3(10):52-57.
  • 7Zhang Zhigang, Chert Jing, Li Xiaoming A Preprocessing Framework and Approach for Web Applications[J]. Journal of Web Engineering, 2004,2 (3) : 176-192.
  • 8Chakrabarti S. Mining the Web (Discovering Knowledge from Hypertext Data)[M]. San Fransisco, CA: Morgan Kaufmann, 2003.
  • 9Broder A Z,Najork M,Janet L,et al.Efficient URL caching for world wide web crawling[].Proc th Int World Wide Web Conference.2003
  • 10Cho J,Garcia-Molina H.Estimating frequency of change[].A CM Transactions on Internet Technology.2003

共引文献8

同被引文献27

  • 1王继民,陈翀,彭波.大规模中文搜索引擎的用户日志分析[J].华南理工大学学报(自然科学版),2004,32(z1):1-5. 被引量:24
  • 2王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792. 被引量:81
  • 3SALTON G, YANG C S, YU C T. A theory of term importance in automatic text analysis[J].Journal of the American Society for Information Science, 1975,26( 1 ) :33-44.
  • 4ROBERTSON S, JONES K S. Relevance weighting of search terms [ J]. Journal of the American Society for Information Science, 1976,27(3) :129-146.
  • 5PONTE J, CROFT W B. A language modeling approach to information retrieval[C]//Proc of the 21st ACM SIGIR Conference on Research and Development in Information Retrieval. 1995:275-281.
  • 6SINGHAL A. Modem information retrieval: a brief overview [ J ].Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 200] ,24(4) :35-43.
  • 7ROBERTSON S E, WALKER S, BEAULIEU M. Okapi at TREC-7 : automatic Ad hoc, filtering, VLC and interactive track[ C ]//Proc of the 7th Text Retrieval Conference, NIST Special Publication 500- 242. 1999:253-264.
  • 8LAFFERTY J, ZHAI Cheng-xiang. Document language models, query models, and risk minimization for information retrieval [ C ]//Proc of the 24th ACM SIGIR Conference on Research and Development in Information Retrieval. 2001 : 111 - 119.
  • 9SANDERSON M. Retrieving with good sense [ J ]. Information Retrieval, 2000,2( ] ) :49-69.
  • 10SCHUTZE H, PEDERSEN J O. A cooccurrence-based thesaurus and two applications to information retrieval[ J]. Information Processing and Management, 1997,33(3) :307-318.

引证文献3

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部