Purpose: Big data offer a huge challenge. Their very existence leads to the contradiction that the more data we have the less accessible they become,as the particular piece of information one is searching for may be b...Purpose: Big data offer a huge challenge. Their very existence leads to the contradiction that the more data we have the less accessible they become,as the particular piece of information one is searching for may be buried among terabytes of other data. In this contribution we discuss the origin of big data and point to three challenges when big data arise: Data storage,data processing and generating insights.Design/methodology/approach: Computer-related challenges can be expressed by the CAP theorem which states that it is only possible to simultaneously provide any two of the three following properties in distributed applications: Consistency(C),availability(A) and partition tolerance(P). As an aside we mention Amdahl's law and its application for scientific collaboration. We further discuss data mining in large databases and knowledge representation for handling the results of data mining exercises. We further offer a short informetric study of the field of big data,and point to the ethical dimension of the big data phenomenon.Findings: There still are serious problems to overcome before the field of big data can deliver on its promises.Implications and limitations: This contribution offers a personal view,focusing on the information science aspects,but much more can be said about software aspects.Originality/value: We express the hope that the information scientists,including librarians,will be able to play their full role within the knowledge discovery,data mining and big data communities,leading to exciting developments,the reduction of scientific bottlenecks and really innovative applications.展开更多
There is a great thrust in industry toward the development of more feasible and viable tools for storing fast-growing volume, velocity, and diversity of data, termed 'big data'. The structural shift of the storage m...There is a great thrust in industry toward the development of more feasible and viable tools for storing fast-growing volume, velocity, and diversity of data, termed 'big data'. The structural shift of the storage mechanism from traditional data management systems to NoSQL technology is due to the intention of fulfilling big data storage requirements. However, the available big data storage technologies are inefficient to provide consistent, scalable, and available solutions for continuously growing heterogeneous data. Storage is the preliminary process of big data analytics for real-world applications such as scientific experiments, healthcare, social networks, and e-business. So far, Amazon, Google, and Apache are some of the industry standards in providing big data storage solutions, yet the literature does not report an in-depth survey of storage technologies available for big data, investigating the performance and magnitude gains of these technologies. The primary objective of this paper is to conduct a comprehensive investigation of state-of-the-art storage technologies available for big data. A well-defined taxonomy of big data storage technologies is presented to assist data analysts and researchers in understanding and selecting a storage mecha- nism that better fits their needs. To evaluate the performance of different storage architectures, we compare and analyze the ex- isling approaches using Brewer's CAP theorem. The significance and applications of storage technologies and support to other categories are discussed. Several future research challenges are highlighted with the intention to expedite the deployment of a reliable and scalable storage system.展开更多
文摘Purpose: Big data offer a huge challenge. Their very existence leads to the contradiction that the more data we have the less accessible they become,as the particular piece of information one is searching for may be buried among terabytes of other data. In this contribution we discuss the origin of big data and point to three challenges when big data arise: Data storage,data processing and generating insights.Design/methodology/approach: Computer-related challenges can be expressed by the CAP theorem which states that it is only possible to simultaneously provide any two of the three following properties in distributed applications: Consistency(C),availability(A) and partition tolerance(P). As an aside we mention Amdahl's law and its application for scientific collaboration. We further discuss data mining in large databases and knowledge representation for handling the results of data mining exercises. We further offer a short informetric study of the field of big data,and point to the ethical dimension of the big data phenomenon.Findings: There still are serious problems to overcome before the field of big data can deliver on its promises.Implications and limitations: This contribution offers a personal view,focusing on the information science aspects,but much more can be said about software aspects.Originality/value: We express the hope that the information scientists,including librarians,will be able to play their full role within the knowledge discovery,data mining and big data communities,leading to exciting developments,the reduction of scientific bottlenecks and really innovative applications.
文摘There is a great thrust in industry toward the development of more feasible and viable tools for storing fast-growing volume, velocity, and diversity of data, termed 'big data'. The structural shift of the storage mechanism from traditional data management systems to NoSQL technology is due to the intention of fulfilling big data storage requirements. However, the available big data storage technologies are inefficient to provide consistent, scalable, and available solutions for continuously growing heterogeneous data. Storage is the preliminary process of big data analytics for real-world applications such as scientific experiments, healthcare, social networks, and e-business. So far, Amazon, Google, and Apache are some of the industry standards in providing big data storage solutions, yet the literature does not report an in-depth survey of storage technologies available for big data, investigating the performance and magnitude gains of these technologies. The primary objective of this paper is to conduct a comprehensive investigation of state-of-the-art storage technologies available for big data. A well-defined taxonomy of big data storage technologies is presented to assist data analysts and researchers in understanding and selecting a storage mecha- nism that better fits their needs. To evaluate the performance of different storage architectures, we compare and analyze the ex- isling approaches using Brewer's CAP theorem. The significance and applications of storage technologies and support to other categories are discussed. Several future research challenges are highlighted with the intention to expedite the deployment of a reliable and scalable storage system.