It is a widely discussed question that where the web latency comes from. In this paper, we propose a novel chunk-level latency dependence model to give a better illustration of the web latency. Based on the fact that ...It is a widely discussed question that where the web latency comes from. In this paper, we propose a novel chunk-level latency dependence model to give a better illustration of the web latency. Based on the fact that web content is delivered in chunk sequence, and clients care more about whole page retrieval latency, this paper carries out a detailed study on how the chunk sequence and relations affect the web retrieval latency. A series of thorough experiments are also conducted and data analysis are also made. The result is useful for further study on how to reduce the web latency.展开更多
现有的重复数据删除技术大部分是基于变长分块(content defined chunking,CDC)算法的,不考虑不同文件类型的内容特征.这种方法以一种随机的方式确定分块边界并应用于所有文件类型,已经证明其非常适合于文本和简单内容,而不适合非结构化...现有的重复数据删除技术大部分是基于变长分块(content defined chunking,CDC)算法的,不考虑不同文件类型的内容特征.这种方法以一种随机的方式确定分块边界并应用于所有文件类型,已经证明其非常适合于文本和简单内容,而不适合非结构化数据构成的复合文件.分析了OpenXML标准的复合文件属性,给出了对象提取的基本方法,并提出基于对象分布和对象结构的去重粒度确定算法.目的是对于非结构化数据构成的复合文件,有效地检测不同文件中和同一文件不同位置的相同对象,在文件物理布局改变时也能够有效去重.通过对典型的非结构化数据集合的模拟实验表明,在综合情况下,对象重复数据删除比CDC方法提高了10%左右的非结构化数据的去重率.展开更多
基金Supported by the National Natural Science Foundation of China (No.60673001) the State Key Development Program of Basic Research of China (No. 2004CB318203).
文摘It is a widely discussed question that where the web latency comes from. In this paper, we propose a novel chunk-level latency dependence model to give a better illustration of the web latency. Based on the fact that web content is delivered in chunk sequence, and clients care more about whole page retrieval latency, this paper carries out a detailed study on how the chunk sequence and relations affect the web retrieval latency. A series of thorough experiments are also conducted and data analysis are also made. The result is useful for further study on how to reduce the web latency.
文摘现有的重复数据删除技术大部分是基于变长分块(content defined chunking,CDC)算法的,不考虑不同文件类型的内容特征.这种方法以一种随机的方式确定分块边界并应用于所有文件类型,已经证明其非常适合于文本和简单内容,而不适合非结构化数据构成的复合文件.分析了OpenXML标准的复合文件属性,给出了对象提取的基本方法,并提出基于对象分布和对象结构的去重粒度确定算法.目的是对于非结构化数据构成的复合文件,有效地检测不同文件中和同一文件不同位置的相同对象,在文件物理布局改变时也能够有效去重.通过对典型的非结构化数据集合的模拟实验表明,在综合情况下,对象重复数据删除比CDC方法提高了10%左右的非结构化数据的去重率.