The availability of large microarray data has led to a growing interest in biclustering methods in the past decade. Several algorithms have been proposed to identify subsets of genes and conditions according to differ...The availability of large microarray data has led to a growing interest in biclustering methods in the past decade. Several algorithms have been proposed to identify subsets of genes and conditions according to different similarity measures and under varying constraints. In this paper we focus on the exclusive row biclustering problem (also known as projected clustering) for gene expression, in which each row can only be a member of a single bicluster while columns can participate in multiple clusters. This type of biclustering may be adequate, for example, for clustering groups of cancer patients where each patient (row) is expected to be carrying only a single type of cancer, while each cancer type is associated with multiple (and possibly overlapping) genes (columns). We present a novel method to identify these exclusive row biclusters in the spirit of the optimal set cover problem. We present our algorithmic solution as a combination of existing biclustering algorithms and combinatorial auction techniques. Furthermore, we devise an approach for tuning the threshold of our algorithm based on comparison with a null model, inspired by the Gap statistic approach. We demonstrate our approach on both synthetic and real world gene expression data and show its power in identifying large span non-overlapping rows submatrices, while considering their unique nature.展开更多
Ensemble methods are among the state-of-the-art predictive modeling approaches.Applied to modern big data,these methods often require a large number of sub-learners,where the complexity of each learner typically grows...Ensemble methods are among the state-of-the-art predictive modeling approaches.Applied to modern big data,these methods often require a large number of sub-learners,where the complexity of each learner typically grows with the size of the dataset.This phenomenon results in an increasing demand for storage space,which may be very costly.This problem mostly manifests in a subscriber-based environment,where a user-specific ensemble needs to be stored on a personal device with strict storage limitations(such as a cellular device).In this work we introduce a novel method for lossless compression of tree-based ensemble methods,focusing on random forests.Our suggested method is based on probabilistic modeling of the ensemble's trees,followed by model clustering via Bregman divergence.This allows us to find a minimal set of models that provides an accurate description of the trees,and at the same time is small enough to store and maintain.Our compression scheme demonstrates high compression rates on a variety of modern datasets.Importantly,our scheme enables predictions from the compressed format and a perfect reconstruction of the original ensemble.In addition,we introduce a theoretically sound lossy compression scheme,which allows us to control the trade-off between the distortion and the coding rate.展开更多
基金funded in part by Israeli Science Foundation under Grant No.1227/09by a grant to Amichai Painsky fromthe Israeli Center for Absorption in Science
文摘The availability of large microarray data has led to a growing interest in biclustering methods in the past decade. Several algorithms have been proposed to identify subsets of genes and conditions according to different similarity measures and under varying constraints. In this paper we focus on the exclusive row biclustering problem (also known as projected clustering) for gene expression, in which each row can only be a member of a single bicluster while columns can participate in multiple clusters. This type of biclustering may be adequate, for example, for clustering groups of cancer patients where each patient (row) is expected to be carrying only a single type of cancer, while each cancer type is associated with multiple (and possibly overlapping) genes (columns). We present a novel method to identify these exclusive row biclusters in the spirit of the optimal set cover problem. We present our algorithmic solution as a combination of existing biclustering algorithms and combinatorial auction techniques. Furthermore, we devise an approach for tuning the threshold of our algorithm based on comparison with a null model, inspired by the Gap statistic approach. We demonstrate our approach on both synthetic and real world gene expression data and show its power in identifying large span non-overlapping rows submatrices, while considering their unique nature.
基金Israel Science Foundation under Grant No.1487/12a Returning Scientist Fellowship from the Israeli Ministry of Immigration to Amichai Painsky.
文摘Ensemble methods are among the state-of-the-art predictive modeling approaches.Applied to modern big data,these methods often require a large number of sub-learners,where the complexity of each learner typically grows with the size of the dataset.This phenomenon results in an increasing demand for storage space,which may be very costly.This problem mostly manifests in a subscriber-based environment,where a user-specific ensemble needs to be stored on a personal device with strict storage limitations(such as a cellular device).In this work we introduce a novel method for lossless compression of tree-based ensemble methods,focusing on random forests.Our suggested method is based on probabilistic modeling of the ensemble's trees,followed by model clustering via Bregman divergence.This allows us to find a minimal set of models that provides an accurate description of the trees,and at the same time is small enough to store and maintain.Our compression scheme demonstrates high compression rates on a variety of modern datasets.Importantly,our scheme enables predictions from the compressed format and a perfect reconstruction of the original ensemble.In addition,we introduce a theoretically sound lossy compression scheme,which allows us to control the trade-off between the distortion and the coding rate.