Identifying rare patterns for medical diagnosis is a challenging task due to heterogeneity and the volume of data.Data summarization can create a concise version of the original data that can be used for effective dia...Identifying rare patterns for medical diagnosis is a challenging task due to heterogeneity and the volume of data.Data summarization can create a concise version of the original data that can be used for effective diagnosis.In this paper,we propose an ensemble summarization method that combines clustering and sampling to create a summary of the original data to ensure the inclusion of rare patterns.To the best of our knowledge,there has been no such technique available to augment the performance of anomaly detection techniques and simultaneously increase the efficiency of medical diagnosis.The performance of popular anomaly detection algorithms increases significantly in terms of accuracy and computational complexity when the summaries are used.Therefore,the medical diagnosis becomes more effective,and our experimental results reflect that the combination of the proposed summarization scheme and all underlying algorithms used in this paper outperforms the most popular anomaly detection techniques.展开更多
In this paper,we propose a correlationaware probabilistic data summarization technique to efficiently analyze and visualize large-scale multi-block volume data generated by massively parallel scientific simulations.Th...In this paper,we propose a correlationaware probabilistic data summarization technique to efficiently analyze and visualize large-scale multi-block volume data generated by massively parallel scientific simulations.The core of our technique is correlation modeling of distribution representations of adjacent data blocks using copula functions and accurate data value estimation by combining numerical information,spatial location,and correlation distribution using Bayes’rule.This effectively preserves statistical properties without merging data blocks in different parallel computing nodes and repartitioning them,thus significantly reducing the computational cost.Furthermore,this enables reconstruction of the original data more accurately than existing methods.We demonstrate the effectiveness of our technique using six datasets,with the largest having one billion grid points.The experimental results show that our approach reduces the data storage cost by approximately one order of magnitude compared to state-of-the-art methods while providing a higher reconstruction accuracy at a lower computational cost.展开更多
文摘Identifying rare patterns for medical diagnosis is a challenging task due to heterogeneity and the volume of data.Data summarization can create a concise version of the original data that can be used for effective diagnosis.In this paper,we propose an ensemble summarization method that combines clustering and sampling to create a summary of the original data to ensure the inclusion of rare patterns.To the best of our knowledge,there has been no such technique available to augment the performance of anomaly detection techniques and simultaneously increase the efficiency of medical diagnosis.The performance of popular anomaly detection algorithms increases significantly in terms of accuracy and computational complexity when the summaries are used.Therefore,the medical diagnosis becomes more effective,and our experimental results reflect that the combination of the proposed summarization scheme and all underlying algorithms used in this paper outperforms the most popular anomaly detection techniques.
基金supported by the Chinese Postdoctoral Science Foundation(2021M700016).
文摘In this paper,we propose a correlationaware probabilistic data summarization technique to efficiently analyze and visualize large-scale multi-block volume data generated by massively parallel scientific simulations.The core of our technique is correlation modeling of distribution representations of adjacent data blocks using copula functions and accurate data value estimation by combining numerical information,spatial location,and correlation distribution using Bayes’rule.This effectively preserves statistical properties without merging data blocks in different parallel computing nodes and repartitioning them,thus significantly reducing the computational cost.Furthermore,this enables reconstruction of the original data more accurately than existing methods.We demonstrate the effectiveness of our technique using six datasets,with the largest having one billion grid points.The experimental results show that our approach reduces the data storage cost by approximately one order of magnitude compared to state-of-the-art methods while providing a higher reconstruction accuracy at a lower computational cost.