期刊文献+
共找到13篇文章
< 1 >
每页显示 20 50 100
Enhancing Storage Efficiency and Performance: A Survey of Data Partitioning Techniques
1
作者 刘鹏举 李翠平 陈红 《Journal of Computer Science & Technology》 SCIE EI CSCD 2024年第2期346-368,共23页
Data partitioning techniques are pivotal for optimal data placement across storage devices,thereby enhancing resource utilization and overall system throughput.However,the design of effective partition schemes faces m... Data partitioning techniques are pivotal for optimal data placement across storage devices,thereby enhancing resource utilization and overall system throughput.However,the design of effective partition schemes faces multiple challenges,including considerations of the cluster environment,storage device characteristics,optimization objectives,and the balance between partition quality and computational efficiency.Furthermore,dynamic environments necessitate robust partition detection mechanisms.This paper presents a comprehensive survey structured around partition deployment environments,outlining the distinguishing features and applicability of various partitioning strategies while delving into how these challenges are addressed.We discuss partitioning features pertaining to database schema,table data,workload,and runtime metrics.We then delve into the partition generation process,segmenting it into initialization and optimization stages.A comparative analysis of partition generation and update algorithms is provided,emphasizing their suitability for different scenarios and optimization objectives.Additionally,we illustrate the applications of partitioning in prevalent database products and suggest potential future research directions and solutions.This survey aims to foster the implementation,deployment,and updating of high-quality partitions for specific system scenarios. 展开更多
关键词 data partitioning SURVEY partitioning feature partition generation partition update
原文传递
A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis 被引量:13
2
作者 Mohammad Sultan Mahmud Joshua Zhexue Huang +2 位作者 Salman Salloum Tamer Z.Emara Kuanishbay Sadatdiynov 《Big Data Mining and Analytics》 2020年第2期85-101,共17页
Computer clusters with the shared-nothing architecture are the major computing platforms for big data processing and analysis.In cluster computing,data partitioning and sampling are two fundamental strategies to speed... Computer clusters with the shared-nothing architecture are the major computing platforms for big data processing and analysis.In cluster computing,data partitioning and sampling are two fundamental strategies to speed up the computation of big data and increase scalability.In this paper,we present a comprehensive survey of the methods and techniques of data partitioning and sampling with respect to big data processing and analysis.We start with an overview of the mainstream big data frameworks on Hadoop clusters.The basic methods of data partitioning are then discussed including three classical horizontal partitioning schemes:range,hash,and random partitioning.Data partitioning on Hadoop clusters is also discussed with a summary of new strategies for big data partitioning,including the new Random Sample Partition(RSP)distributed model.The classical methods of data sampling are then investigated,including simple random sampling,stratified sampling,and reservoir sampling.Two common methods of big data sampling on computing clusters are also discussed:record-level sampling and blocklevel sampling.Record-level sampling is not as efficient as block-level sampling on big distributed data.On the other hand,block-level sampling on data blocks generated with the classical data partitioning methods does not necessarily produce good representative samples for approximate computing of big data.In this survey,we also summarize the prevailing strategies and related work on sampling-based approximation on Hadoop clusters.We believe that data partitioning and sampling should be considered together to build approximate cluster computing frameworks that are reliable in both the computational and statistical respects. 展开更多
关键词 big data analysis data partitioning data sampling distributed and parallel computing approximate computing
原文传递
Picture-Neutrosophic Trusted Safe Semi-Supervised Fuzzy Clustering for Noisy Data
3
作者 Pham Huy Thong Florentin Smarandache +5 位作者 Phung The Huan Tran Manh Tuan Tran Thi Ngan Vu Duc Thai Nguyen Long Giang Le Hoang Son 《Computer Systems Science & Engineering》 SCIE EI 2023年第8期1981-1997,共17页
Clustering is a crucial method for deciphering data structure and producing new information.Due to its significance in revealing fundamental connections between the human brain and events,it is essential to utilize cl... Clustering is a crucial method for deciphering data structure and producing new information.Due to its significance in revealing fundamental connections between the human brain and events,it is essential to utilize clustering for cognitive research.Dealing with noisy data caused by inaccurate synthesis from several sources or misleading data production processes is one of the most intriguing clustering difficulties.Noisy data can lead to incorrect object recognition and inference.This research aims to innovate a novel clustering approach,named Picture-Neutrosophic Trusted Safe Semi-Supervised Fuzzy Clustering(PNTS3FCM),to solve the clustering problem with noisy data using neutral and refusal degrees in the definition of Picture Fuzzy Set(PFS)and Neutrosophic Set(NS).Our contribution is to propose a new optimization model with four essential components:clustering,outlier removal,safe semi-supervised fuzzy clustering and partitioning with labeled and unlabeled data.The effectiveness and flexibility of the proposed technique are estimated and compared with the state-of-art methods,standard Picture fuzzy clustering(FC-PFS)and Confidence-weighted safe semi-supervised clustering(CS3FCM)on benchmark UCI datasets.The experimental results show that our method is better at least 10/15 datasets than the compared methods in terms of clustering quality and computational time. 展开更多
关键词 Safe semi-supervised fuzzy clustering picture fuzzy set neutrosophic set data partition with noises fuzzy clustering
下载PDF
Hybrid Graph Partitioning with OLB Approach in Distributed Transactions
4
作者 Rajesh Bharati Vahida Attar 《Intelligent Automation & Soft Computing》 SCIE 2023年第7期763-775,共13页
Online Transaction Processing(OLTP)gets support from data partitioning to achieve better performance and scalability.The primary objective of database and application developers is to provide scalable and reliable dat... Online Transaction Processing(OLTP)gets support from data partitioning to achieve better performance and scalability.The primary objective of database and application developers is to provide scalable and reliable database systems.This research presents a novel method for data partitioning and load balancing for scalable transactions.Data is efficiently partitioned using the hybrid graph partitioning method.Optimized load balancing(OLB)approach is applied to calculate the weight factor,average workload,and partition efficiency.The presented approach is appropriate for various online data transaction applications.The quality of the proposed approach is examined using OLTP database benchmark.The performance of the proposed methodology significantly outperformed with respect to metrics like throughput,response time,and CPU utilization. 展开更多
关键词 datapartitioning SCALABILITY OPTIMIZATION THROUGHPUT
下载PDF
ADC-DL:Communication-Efficient Distributed Learning with Hierarchical Clustering and Adaptive Dataset Condensation
5
作者 Zhipeng Gao Yan Yang +1 位作者 Chen Zhao Zijia Mo 《China Communications》 SCIE CSCD 2022年第12期73-85,共13页
The rapid growth of modern mobile devices leads to a large number of distributed data,which is extremely valuable for learning models.Unfortunately,model training by collecting all these original data to a centralized... The rapid growth of modern mobile devices leads to a large number of distributed data,which is extremely valuable for learning models.Unfortunately,model training by collecting all these original data to a centralized cloud server is not applicable due to data privacy and communication costs concerns,hindering artificial intelligence from empowering mobile devices.Moreover,these data are not identically and independently distributed(Non-IID)caused by their different context,which will deteriorate the performance of the model.To address these issues,we propose a novel Distributed Learning algorithm based on hierarchical clustering and Adaptive Dataset Condensation,named ADC-DL,which learns a shared model by collecting the synthetic samples generated on each device.To tackle the heterogeneity of data distribution,we propose an entropy topsis comprehensive tiering model for hierarchical clustering,which distinguishes clients in terms of their data characteristics.Subsequently,synthetic dummy samples are generated based on the hierarchical structure utilizing adaptive dataset condensation.The procedure of dataset condensation can be adjusted adaptively according to the tier of the client.Extensive experiments demonstrate that the performance of our ADC-DL is more outstanding in prediction accuracy and communication costs compared with existing algorithms. 展开更多
关键词 distributed learning Non-IID data partition hierarchical clustering adaptive dataset condensation
下载PDF
Multidimensional Data Querying on Tree-Structured Overlay
6
作者 XU Lizhen WANG Shiyuan 《Wuhan University Journal of Natural Sciences》 CAS 2006年第5期1367-1372,共6页
Multidimensional data query has been gaining much interest in database research communities in recent years, yet many of the existing studies focus mainly on ten tralized systems. A solution to querying in Peer-to-Pee... Multidimensional data query has been gaining much interest in database research communities in recent years, yet many of the existing studies focus mainly on ten tralized systems. A solution to querying in Peer-to-Peer(P2P) environment was proposed to achieve both low processing cost in terms of the number of peers accessed and search messages and balanced query loads among peers. The system is based on a balanced tree structured P2P network. By partitioning the query space intelligently, the amount of query forwarding is effectively controlled, and the number of peers involved and search messages are also limited. Dynamic load balancing can be achieved during space partitioning and query resolving. Extensive experiments confirm the effectiveness and scalability of our algorithms on P2P networks. 展开更多
关键词 range query skyline query P2P indexing multi-dimensional data partition
下载PDF
Multi-authority proxy re-encryption based on CPABE for cloud storage systems 被引量:7
7
作者 Xiaolong Xu Jinglan Zhou +1 位作者 Xinheng Wang Yun Zhang 《Journal of Systems Engineering and Electronics》 SCIE EI CSCD 2016年第1期211-223,共13页
The dissociation between data management and data ownership makes it difficult to protect data security and privacy in cloud storage systems.Traditional encryption technologies are not suitable for data protection in ... The dissociation between data management and data ownership makes it difficult to protect data security and privacy in cloud storage systems.Traditional encryption technologies are not suitable for data protection in cloud storage systems.A novel multi-authority proxy re-encryption mechanism based on ciphertext-policy attribute-based encryption(MPRE-CPABE) is proposed for cloud storage systems.MPRE-CPABE requires data owner to split each file into two blocks,one big block and one small block.The small block is used to encrypt the big one as the private key,and then the encrypted big block will be uploaded to the cloud storage system.Even if the uploaded big block of file is stolen,illegal users cannot get the complete information of the file easily.Ciphertext-policy attribute-based encryption(CPABE)is always criticized for its heavy overload and insecure issues when distributing keys or revoking user's access right.MPRE-CPABE applies CPABE to the multi-authority cloud storage system,and solves the above issues.The weighted access structure(WAS) is proposed to support a variety of fine-grained threshold access control policy in multi-authority environments,and reduce the computational cost of key distribution.Meanwhile,MPRE-CPABE uses proxy re-encryption to reduce the computational cost of access revocation.Experiments are implemented on platforms of Ubuntu and CloudSim.Experimental results show that MPRE-CPABE can greatly reduce the computational cost of the generation of key components and the revocation of user's access right.MPRE-CPABE is also proved secure under the security model of decisional bilinear Diffie-Hellman(DBDH). 展开更多
关键词 cloud storage data partition multi-authority security proxy re-encryption attribute-based encryption(ABE).
下载PDF
An Efficient Multidimensional Fusion Algorithm for IoT Data Based on Partitioning 被引量:3
8
作者 Jin Zhou Liang Hu +2 位作者 Feng Wang Huimin Lu Kuo Zhao 《Tsinghua Science and Technology》 SCIE EI CAS 2013年第4期369-378,共10页
The Internet of Things (IoT) implies a worldwide network of interconnected objects uniquely addressable, via standard communication protocols. The prevalence of IoT is bound to generate large amounts of multisource,... The Internet of Things (IoT) implies a worldwide network of interconnected objects uniquely addressable, via standard communication protocols. The prevalence of IoT is bound to generate large amounts of multisource, heterogeneous, dynamic, and sparse data. However, IoT offers inconsequential practical benefits without the ability to integrate, fuse, and glean useful information from such massive amounts of data. Accordingly, preparing us for the imminent invasion of things, a tool called data fusion can be used to manipulate and manage such data in order to improve process efficiency and provide advanced intelligence. In order to determine an acceptable quality of intelligence, diverse and voluminous data have to be combined and fused. Therefore, it is imperative to improve the computational efficiency for fusing and mining multidimensional data. In this paper, we propose an efficient multidimensional fusion algorithm for IoT data based on partitioning. The basic concept involves the partitioning of dimensions (attributes), i.e., a big data set with higher dimensions can be transformed into certain number of relatively smaller data subsets that can be easily processed. Then, based on the partitioning of dimensions, the discernible matrixes of all data subsets in rough set theory are computed to obtain their core attribute sets. Furthermore, a global core attribute set can be determined. Finally, the attribute reduction and rule extraction methods are used to obtain the fusion results. By means of proving a few theorems and simulation, the correctness and effectiveness of this algorithm is illustrated. 展开更多
关键词 Internet of Things data fusion multidimensional data partitioning rough set theory
原文传递
Unequal decoding power allocation for efficient video transmission
9
作者 王永芳 余松煜 +1 位作者 杨小康 张兆杨 《Journal of Shanghai University(English Edition)》 2010年第1期60-65,共6页
We present an unequal decoding power allocation (UDPA) approach for minimization of the receiver power consumption subject to a given quality of service (QoS), by exploiting data partitioning and turbo decoding. W... We present an unequal decoding power allocation (UDPA) approach for minimization of the receiver power consumption subject to a given quality of service (QoS), by exploiting data partitioning and turbo decoding. We assign unequal decoding power of forward error correction (FEC) to data partitions with different priority by jointly considering the source coding, channel coding and receiver power consumption. The proposed scheme is applied to H.264 video over additive white Gaussion noise (AWGN) channel, and achieves excellent tradeoff between video delivery quality and power consumption, and yields significant power saving compared with the conventional equal decoding power allocation (EDPA) approach in wireless video transmission. 展开更多
关键词 power allocation turbo code data partition joint source and channel code
下载PDF
A Partition Checkpoint Strategy Based on Data Segment Priority
10
作者 LIANG Ping LIU Yunsheng 《Wuhan University Journal of Natural Sciences》 CAS 2012年第2期109-113,共5页
A partition checkpoint strategy based on data segment priority is presented to meet the timing constraints of the data and the transaction in embedded real-time main memory database systems(ERTMMDBS) as well as to r... A partition checkpoint strategy based on data segment priority is presented to meet the timing constraints of the data and the transaction in embedded real-time main memory database systems(ERTMMDBS) as well as to reduce the number of the transactions missing their deadlines and the recovery time.The partition checkpoint strategy takes into account the characteristics of the data and the transactions associated with it;moreover,it partitions the database according to the data segment priority and sets the corresponding checkpoint frequency to each partition for independent checkpoint operation.The simulation results show that the partition checkpoint strategy decreases the ratio of trans-actions missing their deadlines. 展开更多
关键词 embedded real-time main memory database systems database recovery partition checkpoint data segment priority
原文传递
Approaches for Scaling DBSCAN Algorithm to Large Spatial Databases 被引量:11
11
作者 周傲英 周水庚 +2 位作者 曹晶 范晔 胡运发 《Journal of Computer Science & Technology》 SCIE EI CSCD 2000年第6期509-526,共18页
The huge amount of information stored in databases owned by corporations (e.g., retail, financial, telecom) has spurred a tremendous interest in the area of knowledge discovery and data mining. Clustering, in data mi... The huge amount of information stored in databases owned by corporations (e.g., retail, financial, telecom) has spurred a tremendous interest in the area of knowledge discovery and data mining. Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application fields, such as statistical data analysis, pattern recognition, image processing, and other business applications. Although researchers have been working on clustering algorithms for decades, and a lot of algorithms for clustering have been developed, there is still no efficient algorithm for clustering very large databases and high dimensional data. As an outstanding representative of clustering algorithms, DBSCAN algorithm shows good performance in spatial data clustering. However, for large spatial databases, DBSCAN requires large volume of memory support and could incur substantial I/O costs because it operates directly on the entire database. In this paper, several approaches are proposed to scale DBSCAN algorithm to large spatial databases. To begin with, a fast DBSCAN algorithm is developed, which considerably speeds up the original DBSCAN algorithm. Then a sampling based DBSCAN algorithm, a partitioning-based DBSCAN algorithm, and a parallel DBSCAN algorithm are introduced consecutively. Following that, based on the above-proposed algorithms, a synthetic algorithm is also given. Finally, some experimental results are given to demonstrate the effectiveness and efficiency of these algorithms. 展开更多
关键词 spatial database CLUSTERING fast DBSCAN algorithm data sampling data partitioning PARALLEL
原文传递
FrepJoin:an efficient partition-based algorithm for edit similarity join
12
作者 Ji-zhou LUO Sheng-fei SHI +1 位作者 Hong-zhi WANG Jian-zhong LI 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2017年第10期1499-1510,共12页
String similarity join(SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-and... String similarity join(SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics.The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets. 展开更多
关键词 String similarity join Edit distance Filter and refine data partition Combined frequency vectors
原文传递
RDF partitioning for scalable SPARQL query processing
13
作者 Xiaoyan WANG Tao YANG +2 位作者 Jinchuan CHEN Long HE Xiaoyong DU 《Frontiers of Computer Science》 SCIE EI CSCD 2015年第6期919-933,共15页
The volume of RDF data increases dramatically within recent years, while cloud computing platforms like Hadoop are supposed to be a good choice for processing queries over huge data sets for their wonderful scalabilit... The volume of RDF data increases dramatically within recent years, while cloud computing platforms like Hadoop are supposed to be a good choice for processing queries over huge data sets for their wonderful scalability. Previous work on evaluating SPARQL queries with Hadoop mainly focus on reducing the number of joins through careful split of HDFS files and algorithms for generating Map/Reduce jobs. However, the way of partitioning RDF data could also affect system performance. Specifically, a good partitioning solution would greatly reduce or even to- tally avoid cross-node joins, and significantly cut down the cost in query evaluation. Based on HadoopDB, this work processes SPARQL queries in a hybrid architecture, where Map/Reduce takes charge of the computing tasks, and RDF query engines like RDF-3X store the data and execute join operations. According to the analysis of query workloads, this work proposes a novel algorithm for automatically parti- tioning RDF data and an approximate solution to physically place the partitions in order to reduce data redundancy. It also discusses how to make a good trade-off between query evaluation efficiency and data redundancy. All of these pro- posed approaches have been evaluated by extensive experiments over large RDF data sets. 展开更多
关键词 RDF data data partitioning SPARQL query
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部