现实世界研究(real world research,RWR)作为随机对照研究的补充,受到越来越多关注。如何能有效地利用高质量现实世界数据产生可靠的现实世界证据存在着机遇与挑战。本文从数据管理与利用以及获取证据的技术两方面对近年来的相关研究现...现实世界研究(real world research,RWR)作为随机对照研究的补充,受到越来越多关注。如何能有效地利用高质量现实世界数据产生可靠的现实世界证据存在着机遇与挑战。本文从数据管理与利用以及获取证据的技术两方面对近年来的相关研究现状进行总结与评述,以期为RWR及应用提供参考。展开更多
The parametric temporal data model captures a real world entity in a single tuple, which reduces query language complexity. Such a data model, however, is difficult to be implemented on top of conventional databases b...The parametric temporal data model captures a real world entity in a single tuple, which reduces query language complexity. Such a data model, however, is difficult to be implemented on top of conventional databases because of its unfixed attribute sizes. XML is a matured technology and can be an elegant solution for such challenge. Representing data in XML trigger a question about storage efficiency. The goal of this work is to provide a straightforward answer to such a question. To this end, we compare three different storage models for the parametric temporal data model and show that XML is not worse than any other approaches. Furthermore, XML outperforms the other storages under certain conditions. Therefore, our simulation results provide a positive indication that the myth about XML is not true in the parametric temporal data model.展开更多
In standard canonical correlation analysis (CCA), the data from definite datasets are used to estimate their canonical correlation. In real applications, for example in bilingual text retrieval, it may have a great po...In standard canonical correlation analysis (CCA), the data from definite datasets are used to estimate their canonical correlation. In real applications, for example in bilingual text retrieval, it may have a great portion of data that we do not know which set it belongs to. This part of data is called unlabeled data, while the rest from definite datasets is called labeled data. We propose a novel method called regularized canonical correlation analysis (RCCA), which makes use of both labeled and unlabeled samples. Specifically, we learn to approximate canonical correlation as if all data were labeled. Then, we describe a generalization of RCCA for the multi-set situation. Experiments on four real world datasets, Yeast, Cloud, Iris, and Haberman, demonstrate that, by incorporating the unlabeled data points, the accuracy of correlation coefficients can be improved by over 30%.展开更多
基金supported by the National Research Foundation in Korea through contract N-12-NM-IR05
文摘The parametric temporal data model captures a real world entity in a single tuple, which reduces query language complexity. Such a data model, however, is difficult to be implemented on top of conventional databases because of its unfixed attribute sizes. XML is a matured technology and can be an elegant solution for such challenge. Representing data in XML trigger a question about storage efficiency. The goal of this work is to provide a straightforward answer to such a question. To this end, we compare three different storage models for the parametric temporal data model and show that XML is not worse than any other approaches. Furthermore, XML outperforms the other storages under certain conditions. Therefore, our simulation results provide a positive indication that the myth about XML is not true in the parametric temporal data model.
基金Project (No. 5959438) supported by Microsoft (China) Co., Ltd
文摘In standard canonical correlation analysis (CCA), the data from definite datasets are used to estimate their canonical correlation. In real applications, for example in bilingual text retrieval, it may have a great portion of data that we do not know which set it belongs to. This part of data is called unlabeled data, while the rest from definite datasets is called labeled data. We propose a novel method called regularized canonical correlation analysis (RCCA), which makes use of both labeled and unlabeled samples. Specifically, we learn to approximate canonical correlation as if all data were labeled. Then, we describe a generalization of RCCA for the multi-set situation. Experiments on four real world datasets, Yeast, Cloud, Iris, and Haberman, demonstrate that, by incorporating the unlabeled data points, the accuracy of correlation coefficients can be improved by over 30%.