Based on variable sized chunking, this paper proposes a content aware chunking scheme, called CAC, that does not assume fully random file contents, but tonsiders the characteristics of the file types. CAC uses a candi...Based on variable sized chunking, this paper proposes a content aware chunking scheme, called CAC, that does not assume fully random file contents, but tonsiders the characteristics of the file types. CAC uses a candidate anchor histogram and the file-type specific knowledge to refine how anchors are determined when performing de- duplication of file data and enforces the selected average chunk size. CAC yields more chunks being found which in turn produces smaller average chtmks and a better reduction in data. We present a detailed evaluation of CAC and the experimental results show that this scheme can improve the compression ratio chunking for file types whose bytes are not randomly distributed (from 11.3% to 16.7% according to different datasets), and improve the write throughput on average by 9.7%.展开更多
Linear mixed models are popularly used to fit continuous longitudinal data, and the random effects are commonly assumed to have normal distribution. However, this assumption needs to be tested so that further analysis...Linear mixed models are popularly used to fit continuous longitudinal data, and the random effects are commonly assumed to have normal distribution. However, this assumption needs to be tested so that further analysis can be proceeded well. In this paper, we consider the Baringhaus-Henze-Epps-Pulley (BHEP) tests, which are based on an empirical characteristic function. Differing from their case, we consider the normality checking for the random effects which are unobservable and the test should be based on their predictors. The test is consistent against global alternatives, and is sensitive to the local alternatives converging to the null at a certain rate arbitrarily close to 1/V~ where n is sample size. ^-hlrthermore, to overcome the problem that the limiting null distribution of the test is not tractable, we suggest a new method: use a conditional Monte Carlo test (CMCT) to approximate the null distribution, and then to simulate p-values. The test is compared with existing methods, the power is examined, and several examples are applied to illustrate the usefulness of our test in the analysis of longitudinal data.展开更多
基金Supported by the National Natural Science Foundation of China (No.60673001) the State Key Development Program of Basic Research of China (No. 2004CB318203).
文摘Based on variable sized chunking, this paper proposes a content aware chunking scheme, called CAC, that does not assume fully random file contents, but tonsiders the characteristics of the file types. CAC uses a candidate anchor histogram and the file-type specific knowledge to refine how anchors are determined when performing de- duplication of file data and enforces the selected average chunk size. CAC yields more chunks being found which in turn produces smaller average chtmks and a better reduction in data. We present a detailed evaluation of CAC and the experimental results show that this scheme can improve the compression ratio chunking for file types whose bytes are not randomly distributed (from 11.3% to 16.7% according to different datasets), and improve the write throughput on average by 9.7%.
基金supported in part by a grant of Research Grants Council of Hong Kong,and National Natural Science Foundation of China (Grant No. 11101157)
文摘Linear mixed models are popularly used to fit continuous longitudinal data, and the random effects are commonly assumed to have normal distribution. However, this assumption needs to be tested so that further analysis can be proceeded well. In this paper, we consider the Baringhaus-Henze-Epps-Pulley (BHEP) tests, which are based on an empirical characteristic function. Differing from their case, we consider the normality checking for the random effects which are unobservable and the test should be based on their predictors. The test is consistent against global alternatives, and is sensitive to the local alternatives converging to the null at a certain rate arbitrarily close to 1/V~ where n is sample size. ^-hlrthermore, to overcome the problem that the limiting null distribution of the test is not tractable, we suggest a new method: use a conditional Monte Carlo test (CMCT) to approximate the null distribution, and then to simulate p-values. The test is compared with existing methods, the power is examined, and several examples are applied to illustrate the usefulness of our test in the analysis of longitudinal data.