金融行业需要大数据技术支持数字化转型,但大数据技术体系复杂、组件快速变化、产品有效运维等问题越来越成为金融行业数字化转型的瓶颈。阐述了金融行业数据处理中面临的问题,结合银行业务发展需求,分析了基于开源技术的商用大数据平台...金融行业需要大数据技术支持数字化转型,但大数据技术体系复杂、组件快速变化、产品有效运维等问题越来越成为金融行业数字化转型的瓶颈。阐述了金融行业数据处理中面临的问题,结合银行业务发展需求,分析了基于开源技术的商用大数据平台Golden Data HD的系统架构、技术难点和主要特性。通过具体应用案例,展示了Golden Data HD的应用效果,该平台有效提升了数据存储容量、查询处理速度和应用业务创新,为金融企业的数字化转型提供了有力支撑。展开更多
远程直接内存访问(remote direct memory access, RDMA)技术正在大数据领域被越来越广泛地应用,它支持在对方主机CPU不参与的情况下远程读写异地内存,并提供高带宽、高吞吐和低延迟的数据传输特性,从而大幅提升分布式存储系统的性能,因...远程直接内存访问(remote direct memory access, RDMA)技术正在大数据领域被越来越广泛地应用,它支持在对方主机CPU不参与的情况下远程读写异地内存,并提供高带宽、高吞吐和低延迟的数据传输特性,从而大幅提升分布式存储系统的性能,因此基于RDMA的分布式存储系统将为满足大数据高时效处理和存储带来新的机遇.首先分析了基于RDMA的分布式存储系统简单替换网络传输模块并不能充分发挥RDMA在语义和性能上的优势的原因,并指出存储系统架构需要变革的因素.然后阐述了高效运用RDMA技术主要取决于2个方面:第1方面是硬件资源的高效管理,包括网卡缓存和CPU缓存的合理利用、多核CPU的并行加速以及内存资源管理等;第2方面是软硬件的紧耦合设计,借助RDMA在语义和性能上的特性,重构新型数据组织和索引方式、优化分布式协议等.同时,以分布式文件系统、分布式键值存储和分布式事务系统为典型应用场景,分别阐述了它们在硬件资源管理和软件重构这2个方面的相关研究.最后,给出了总结和展望.展开更多
The Very Fast Decision Tree(VFDT)algorithm is a classification algorithm for data streams.When processing large amounts of data,VFDT requires less time than traditional decision tree algorithms.However,when training s...The Very Fast Decision Tree(VFDT)algorithm is a classification algorithm for data streams.When processing large amounts of data,VFDT requires less time than traditional decision tree algorithms.However,when training samples become fewer,the label values of VFDT leaf nodes will have more errors,and the classification ability of single VFDT decision tree is limited.The Random Forest algorithm is a combinational classifier with high prediction accuracy and noise-tol-erant ability.It is constituted by multiple decision trees and can make up for the shortage of single decision tree.In this paper,in order to improve the classification accuracy on data streams,the Random Forest algorithm is integrated into the process of tree building of the VFDT algorithm,and a new Random Forest Based Very Fast Decision Tree algorithm named RFVFDT is designed.The RFVFDT algorithm adopts the decision tree building criterion of a Random Forest classifier,and improves Random Forest algorithm with sliding window to meet the unboundedness of data streams and avoid process delay and data loss.Experimental results of the classification of KDD CUP data sets show that the classification accuracy of RFVFDT algorithm is higher than that of VFDT.The less the samples are,the more obvious the advantage is.RFVFDT is fast when running in the multithread mode.展开更多
Data layout in a file system is the organization of data stored in external storages. The data layout has a huge impact on performance of storage systems. We survey three main kinds of data layout in traditional file ...Data layout in a file system is the organization of data stored in external storages. The data layout has a huge impact on performance of storage systems. We survey three main kinds of data layout in traditional file systems: in-place update file system, log-structured file system, and copy-on-write file sys- tem. Each file system has its own strengths and weaknesses under different circumstances. We also include a recent us- age of persistent layout in a file system that combines both flash memory and byte- addressable non- volatile memory. With this survey, we conclude that persistent data layout in file systems may evolve dramatically in the era of emerging non-volatile memory.展开更多
文摘金融行业需要大数据技术支持数字化转型,但大数据技术体系复杂、组件快速变化、产品有效运维等问题越来越成为金融行业数字化转型的瓶颈。阐述了金融行业数据处理中面临的问题,结合银行业务发展需求,分析了基于开源技术的商用大数据平台Golden Data HD的系统架构、技术难点和主要特性。通过具体应用案例,展示了Golden Data HD的应用效果,该平台有效提升了数据存储容量、查询处理速度和应用业务创新,为金融企业的数字化转型提供了有力支撑。
文摘远程直接内存访问(remote direct memory access, RDMA)技术正在大数据领域被越来越广泛地应用,它支持在对方主机CPU不参与的情况下远程读写异地内存,并提供高带宽、高吞吐和低延迟的数据传输特性,从而大幅提升分布式存储系统的性能,因此基于RDMA的分布式存储系统将为满足大数据高时效处理和存储带来新的机遇.首先分析了基于RDMA的分布式存储系统简单替换网络传输模块并不能充分发挥RDMA在语义和性能上的优势的原因,并指出存储系统架构需要变革的因素.然后阐述了高效运用RDMA技术主要取决于2个方面:第1方面是硬件资源的高效管理,包括网卡缓存和CPU缓存的合理利用、多核CPU的并行加速以及内存资源管理等;第2方面是软硬件的紧耦合设计,借助RDMA在语义和性能上的特性,重构新型数据组织和索引方式、优化分布式协议等.同时,以分布式文件系统、分布式键值存储和分布式事务系统为典型应用场景,分别阐述了它们在硬件资源管理和软件重构这2个方面的相关研究.最后,给出了总结和展望.
文摘The Very Fast Decision Tree(VFDT)algorithm is a classification algorithm for data streams.When processing large amounts of data,VFDT requires less time than traditional decision tree algorithms.However,when training samples become fewer,the label values of VFDT leaf nodes will have more errors,and the classification ability of single VFDT decision tree is limited.The Random Forest algorithm is a combinational classifier with high prediction accuracy and noise-tol-erant ability.It is constituted by multiple decision trees and can make up for the shortage of single decision tree.In this paper,in order to improve the classification accuracy on data streams,the Random Forest algorithm is integrated into the process of tree building of the VFDT algorithm,and a new Random Forest Based Very Fast Decision Tree algorithm named RFVFDT is designed.The RFVFDT algorithm adopts the decision tree building criterion of a Random Forest classifier,and improves Random Forest algorithm with sliding window to meet the unboundedness of data streams and avoid process delay and data loss.Experimental results of the classification of KDD CUP data sets show that the classification accuracy of RFVFDT algorithm is higher than that of VFDT.The less the samples are,the more obvious the advantage is.RFVFDT is fast when running in the multithread mode.
基金supported by ZTE Industry-Academia-Research Cooperation Funds
文摘Data layout in a file system is the organization of data stored in external storages. The data layout has a huge impact on performance of storage systems. We survey three main kinds of data layout in traditional file systems: in-place update file system, log-structured file system, and copy-on-write file sys- tem. Each file system has its own strengths and weaknesses under different circumstances. We also include a recent us- age of persistent layout in a file system that combines both flash memory and byte- addressable non- volatile memory. With this survey, we conclude that persistent data layout in file systems may evolve dramatically in the era of emerging non-volatile memory.