摘要
准确提取网上信息报道的最早发布时间,对于使用计算机辅助的社会科学研究来说具有重要价值。数据表明,有40%的信息报道无法从网页中直接提取出文章发布时间,此时,如果单纯依靠搜集时间和HTTP协议提供的网页文件最后修改时间信息来估计文章发布时间,就会造成较大误差。提出了两种能够提高计算精度的方法:链接分析法和拷贝分析法。大数据量实验表明,这两种方法具有很小的出错概率,是切实可用的。其中,链接分析法能够在一定程度上减少计算误差,而拷贝分析法则具有决定性的作用。当一篇信息报道能在网上找到多个拷贝(转载)时,就会有很大的概率准确推断出该报道在网上的最早发布时间。
Determination of the earliest time when an event is reported on the Web is of particular use for computer aided social science researches. Statistics has shown that 40% of Web pages have no evidence of publication time from their contents. For those cases, the crawling time or LMT (last-modified-time) from the H3TP header are often far off the real publication time. Therefore two methods for achieving better accuracy are proposed. The first one is based on link analysis and the other is based on replicas analysis. Experiments have shown that combining these two methods often gives rise to quite accurate results.
出处
《计算机科学与探索》
CSCD
2009年第1期51-59,共9页
Journal of Frontiers of Computer Science and Technology
基金
国家自然科学基金
国家高技术研究发展计划(863)
广东省重点实验室基金~~