The issue of privacy protection for mobile social networks is a frontier topic in the field of social network applications.The existing researches on user privacy protection in mobile social network mainly focus on pr...The issue of privacy protection for mobile social networks is a frontier topic in the field of social network applications.The existing researches on user privacy protection in mobile social network mainly focus on privacy preserving data publishing and access control.There is little research on the association of user privacy information,so it is not easy to design personalized privacy protection strategy,but also increase the complexity of user privacy settings.Therefore,this paper concentrates on the association of user privacy information taking big data analysis tools,so as to provide data support for personalized privacy protection strategy design.展开更多
Quantitative analysis of digital images requires detection and segmentation of the borders of the object of interest. Accurate segmentation is required for volume determination, 3D rendering, radiation therapy, and su...Quantitative analysis of digital images requires detection and segmentation of the borders of the object of interest. Accurate segmentation is required for volume determination, 3D rendering, radiation therapy, and surgery planning. In medical images, segmentation has traditionally been done by human experts. Substantial computational and storage requirements become especially acute when object orientation and scale have to be considered. Therefore, automated or semi-automated segmentation techniques are essential if these software applications are ever to gain widespread clinical use. Many methods have been proposed to detect and segment 2D shapes, most of which involve template matching. Advanced segmentation techniques called Snakes or active contours have been used, considering deformable models or templates. The main purpose of this work is to apply segmentation techniques for the definition of 3D organs (anatomical structures) when big data information has been stored and must be organized by the doctors for medical diagnosis. The processes would be implemented in the CT images from patients with COVID-19.展开更多
In the United States,the buildings sector consumes about 76%of electricity use and 40% of all primary energy use and associated greenhouse gas emissions.Occupant behavior has drawn increasing research interests due to...In the United States,the buildings sector consumes about 76%of electricity use and 40% of all primary energy use and associated greenhouse gas emissions.Occupant behavior has drawn increasing research interests due to its impacts on the building energy consumption.However,occupant behavior study at urban scale remains a challenge,and very limited studies have been conducted.As an effort to couple big data analysis with human mobility modeling,this study has explored urban scale human mobility utilizing three months Global Positioning System(GPS)data of 93,o00 users at Phoenix Metropolitan Area.This research extracted stay points from raw data,and identified users'home,work,and other locations by Density-Based Spatial Clustering algorithm.Then,daily mobility patterns were constructed using different types of locations.We propose a novel approach to predict urban scale daily human mobility patterns with 12-hour prediction horizon,using Long Short-Term Memory(LSTM)neural network model.Results shows the developed models achieved around 85%average accuracy and about 86%mean precision.The developed models can be further applied to analyze urban scale occupant behavior,building energy demand and flexibility,and contributed to urban planning.展开更多
Distributed computing frameworks are the fundamental component of distributed computing systems.They provide an essential way to support the efficient processing of big data on clusters or cloud.The size of big data i...Distributed computing frameworks are the fundamental component of distributed computing systems.They provide an essential way to support the efficient processing of big data on clusters or cloud.The size of big data increases at a pace that is faster than the increase in the big data processing capacity of clusters.Thus,distributed computing frameworks based on the MapReduce computing model are not adequate to support big data analysis tasks which often require running complex analytical algorithms on extremely big data sets in terabytes.In performing such tasks,these frameworks face three challenges:computational inefficiency due to high I/O and communication costs,non-scalability to big data due to memory limit,and limited analytical algorithms because many serial algorithms cannot be implemented in the MapReduce programming model.New distributed computing frameworks need to be developed to conquer these challenges.In this paper,we review MapReduce-type distributed computing frameworks that are currently used in handling big data and discuss their problems when conducting big data analysis.In addition,we present a non-MapReduce distributed computing framework that has the potential to overcome big data analysis challenges.展开更多
Electrocardiogram(ECG)is a low-cost,simple,fast,and non-invasive test.It can reflect the heart’s electrical activity and provide valuable diagnostic clues about the health of the entire body.Therefore,ECG has been wi...Electrocardiogram(ECG)is a low-cost,simple,fast,and non-invasive test.It can reflect the heart’s electrical activity and provide valuable diagnostic clues about the health of the entire body.Therefore,ECG has been widely used in various biomedical applications such as arrhythmia detection,disease-specific detection,mortality prediction,and biometric recognition.In recent years,ECG-related studies have been carried out using a variety of publicly available datasets,with many differences in the datasets used,data preprocessing methods,targeted challenges,and modeling and analysis techniques.Here we systematically summarize and analyze the ECGbased automatic analysis methods and applications.Specifically,we first reviewed 22 commonly used ECG public datasets and provided an overview of data preprocessing processes.Then we described some of the most widely used applications of ECG signals and analyzed the advanced methods involved in these applications.Finally,we elucidated some of the challenges in ECG analysis and provided suggestions for further research.展开更多
Computer clusters with the shared-nothing architecture are the major computing platforms for big data processing and analysis.In cluster computing,data partitioning and sampling are two fundamental strategies to speed...Computer clusters with the shared-nothing architecture are the major computing platforms for big data processing and analysis.In cluster computing,data partitioning and sampling are two fundamental strategies to speed up the computation of big data and increase scalability.In this paper,we present a comprehensive survey of the methods and techniques of data partitioning and sampling with respect to big data processing and analysis.We start with an overview of the mainstream big data frameworks on Hadoop clusters.The basic methods of data partitioning are then discussed including three classical horizontal partitioning schemes:range,hash,and random partitioning.Data partitioning on Hadoop clusters is also discussed with a summary of new strategies for big data partitioning,including the new Random Sample Partition(RSP)distributed model.The classical methods of data sampling are then investigated,including simple random sampling,stratified sampling,and reservoir sampling.Two common methods of big data sampling on computing clusters are also discussed:record-level sampling and blocklevel sampling.Record-level sampling is not as efficient as block-level sampling on big distributed data.On the other hand,block-level sampling on data blocks generated with the classical data partitioning methods does not necessarily produce good representative samples for approximate computing of big data.In this survey,we also summarize the prevailing strategies and related work on sampling-based approximation on Hadoop clusters.We believe that data partitioning and sampling should be considered together to build approximate cluster computing frameworks that are reliable in both the computational and statistical respects.展开更多
Air quality is a critical concern for public health and environmental regulation. The Air Quality Index (AQI), a widely adopted index by the US Environmental Protection Agency (EPA), serves as a crucial metric for rep...Air quality is a critical concern for public health and environmental regulation. The Air Quality Index (AQI), a widely adopted index by the US Environmental Protection Agency (EPA), serves as a crucial metric for reporting site-specific air pollution levels. Accurately predicting air quality, as measured by the AQI, is essential for effective air pollution management. In this study, we aim to identify the most reliable regression model among linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), logistic regression, and K-nearest neighbors (KNN). We conducted four different regression analyses using a machine learning approach to determine the model with the best performance. By employing the confusion matrix and error percentages, we selected the best-performing model, which yielded prediction error rates of 22%, 23%, 20%, and 27%, respectively, for LDA, QDA, logistic regression, and KNN models. The logistic regression model outperformed the other three statistical models in predicting AQI. Understanding these models' performance can help address an existing gap in air quality research and contribute to the integration of regression techniques in AQI studies, ultimately benefiting stakeholders like environmental regulators, healthcare professionals, urban planners, and researchers.展开更多
Metastasis is the greatest contributor to cancer?related death.In the era of precision medicine,it is essential to predict and to prevent the spread of cancer cells to significantly improve patient survival.Thanks to ...Metastasis is the greatest contributor to cancer?related death.In the era of precision medicine,it is essential to predict and to prevent the spread of cancer cells to significantly improve patient survival.Thanks to the application of a variety of high?throughput technologies,accumulating big data enables researchers and clinicians to identify aggressive tumors as well as patients with a high risk of cancer metastasis.However,there have been few large?scale gene collection studies to enable metastasis?related analyses.In the last several years,emerging efforts have identi?fied pro?metastatic genes in a variety of cancers,providing us the ability to generate a pro?metastatic gene cluster for big data analyses.We carefully selected 285 genes with in vivo evidence of promoting metastasis reported in the literature.These genes have been investigated in different tumor types.We used two datasets downloaded from The Cancer Genome Atlas database,specifically,datasets of clear cell renal cell carcinoma and hepatocellular carcinoma,for validation tests,and excluded any genes for which elevated expression level correlated with longer overall survival in any of the datasets.Ultimately,150 pro?metastatic genes remained in our analyses.We believe this collection of pro?metastatic genes will be helpful for big data analyses,and eventually will accelerate anti?metastasis research and clinical intervention.展开更多
In view of the frequent fluctuation of garlic price under the market economy and the current situation of garlic price,the fluctuation of garlic price in the circulation link of garlic industry chain is analyzed,and t...In view of the frequent fluctuation of garlic price under the market economy and the current situation of garlic price,the fluctuation of garlic price in the circulation link of garlic industry chain is analyzed,and the application mode of multidisciplinary in the agricultural industry is discussed.On the basis of the big data platform of garlic industry chain,this paper constructs a Garch model to analyze the fluctuation law of garlic price in the circulation link and provides the garlic industry service from the angle of price fluctuation combined with the economic analysis.The research shows that the average price rate of the price of garlic shows“agglomeration”and cyclical phenomenon,which has the characteristics of fragility,left and a non-normal distribution and the fitting value of the GARCH model is very close to the true value.Finally,it looks into the industrial service form from the perspective of garlic price fluctuation.展开更多
Monitoring,understanding and predicting Origin-destination(OD)flows in a city is an important problem for city planning and human activity.Taxi-GPS traces,acted as one kind of typical crowd sensed data,it can be used ...Monitoring,understanding and predicting Origin-destination(OD)flows in a city is an important problem for city planning and human activity.Taxi-GPS traces,acted as one kind of typical crowd sensed data,it can be used to mine the semantics of OD flows.In this paper,we firstly construct and analyze a complex network of OD flows based on large-scale GPS taxi traces of a city in China.The spatiotemporal analysis for the OD flows complex network showed that there were distinctive patterns in OD flows.Then based on a novel complex network model,a semantics mining method of OD flows is proposed through compounding Points of Interests(POI)network and public transport network to the OD flows network.The propose method would offer a novel way to predict the location characteristic and future traffic conditions accurately.展开更多
In recent years,China has successfully set up multiple single-product big data platforms.As an indigenous and unique plant in China,the peony contains immense economic returns,strong social benefits,and profound cultu...In recent years,China has successfully set up multiple single-product big data platforms.As an indigenous and unique plant in China,the peony contains immense economic returns,strong social benefits,and profound cultural heritage.Its seed oil,as an emerging edible oil,has attracted much attention.Heze city is one of the places optimal for cultivating peonies.In this context,a study of the big data of the peony industry in Heze city bears practical significance.This paper begins with the literature review of big data platforms for the entire industry.Referring to established single-product big data platforms,it reports the results of a case study of the peony industry in Heze city that identify potential difficulties and problems regarding the building of a big data platform for the peony industry that incorporates the five dimensions of service,management,application,resource,and technology.展开更多
The term sentiment analysis deals with sentiment classification based on the review made by the user in a social network.The sentiment classification accuracy is evaluated using various selection methods,especially thos...The term sentiment analysis deals with sentiment classification based on the review made by the user in a social network.The sentiment classification accuracy is evaluated using various selection methods,especially those that deal with algorithm selection.In this work,every sentiment received through user expressions is ranked in order to categorise sentiments as informative and non-informative.In order to do so,the work focus on Query Expansion Ranking(QER)algorithm that takes user text as input and process for sentiment analysis andfinally produces the results as informative or non-informative.The challenge is to convert non-informative into informative using the concepts of classifiers like Bayes multinomial,entropy modelling along with the traditional sentimental analysis algorithm like Support Vector Machine(SVM)and decision trees.The work also addresses simulated annealing along with QER to classify data based on sentiment analysis.As the input volume is very fast,the work also addresses the concept of big data for information retrieval and processing.The result com-parison shows that the QER algorithm proved to be versatile when compared with the result of SVM.This work uses Twitter user comments for evaluating senti-ment analysis.展开更多
This paper proposes a method for improving the data security of wireless sensor networks based on blockchain technology.Blockchain technology is applied to data transfer to build a highly secure wireless sensor networ...This paper proposes a method for improving the data security of wireless sensor networks based on blockchain technology.Blockchain technology is applied to data transfer to build a highly secure wireless sensor network.In this network,the relay stations use microcontrollers and embedded devices,and the microcontrollers,such as Raspberry Pi and Arduino Yun,represents mobile databases.The proposed system uses microcontrollers to facilitate the connection of various sensor devices.By adopting blockchain encryption,the security of sensing data can be effectively improved.A blockchain is a concatenated transaction record that is protected by cryptography.Each section contains the encrypted hash of the previous section,the corresponding timestamp,and transaction data.The transaction data denote the sensing data of the wireless sensing network.The proposed system uses a hash value representation calculated by the Merkel-tree algorithm,which makes the transfer data of the system difficult to be tamped with.However,the proposed system can serve as a private cloud data center.In this study,the system visualizes the data uploaded by sensors and create relevant charts based on big data analysis.Since the webpage server of the proposed system is built on an embedded operating system,it is easy to model and visualize the corresponding graphics using Python or JavaScript programming language.Finally,this study creates an embedded system mobile database and web server,which can utilize JavaScript program language and Node.js runtime environment to apply blockchain technology to mobile databases.The proposed method is verified by the experiment using about 1600 data records.The results show that the possibility of data being changed is very small,and the probability of data being changed is almost zero.展开更多
Cryptocurrency, as a typical application scene of blockchain, has attracted broad interests from both industrial and academic communities. With its rapid development, the cryptocurrency transaction network embedding(C...Cryptocurrency, as a typical application scene of blockchain, has attracted broad interests from both industrial and academic communities. With its rapid development, the cryptocurrency transaction network embedding(CTNE) has become a hot topic. It embeds transaction nodes into low-dimensional feature space while effectively maintaining a network structure,thereby discovering desired patterns demonstrating involved users' normal and abnormal behaviors. Based on a wide investigation into the state-of-the-art CTNE, this survey has made the following efforts: 1) categorizing recent progress of CTNE methods, 2) summarizing the publicly available cryptocurrency transaction network datasets, 3) evaluating several widely-adopted methods to show their performance in several typical evaluation protocols, and 4) discussing the future trends of CTNE. By doing so, it strives to provide a systematic and comprehensive overview of existing CTNE methods from static to dynamic perspectives,thereby promoting further research into this emerging and important field.展开更多
In response to the limitations of the traditional education and teaching model,this article proposes a smart education model based on ChatGPT.The model actively breaks the constraint of time and space and the design p...In response to the limitations of the traditional education and teaching model,this article proposes a smart education model based on ChatGPT.The model actively breaks the constraint of time and space and the design patterns of traditional education,providing smart education services including personalized learning,smart tutoring and evaluation,educational content creation support,and education big data analysis.Through constructing an open and inclusive learning space and creating flexible and diverse educational models,ChatGPT can help to meet students’individuality and overall development,as well as assist teachers in keeping up with the students’learning performance and developmental requirements in real-time.This provides an important basis for optimizing teaching content,offering personalized and accurate cultivation,and planning the development path of students.展开更多
The proliferation of textual data in society currently is overwhelming, in particular, unstructured textual data is being constantly generated via call centre logs, emails, documents on the web, blogs, tweets, custome...The proliferation of textual data in society currently is overwhelming, in particular, unstructured textual data is being constantly generated via call centre logs, emails, documents on the web, blogs, tweets, customer comments, customer reviews, etc.While the amount of textual data is increasing rapidly, users ability to summarise, understand, and make sense of such data for making better business/living decisions remains challenging. This paper studies how to analyse textual data, based on layered software patterns, for extracting insightful user intelligence from a large collection of documents and for using such information to improve user operations and performance.展开更多
Irritable bowel syndrome(IBS)is a common clinical label for medically unexplained gastrointestinal symptoms,recently described as a disturbance of the microbiota-gut-brain axis.Despite decades of research,the pathophy...Irritable bowel syndrome(IBS)is a common clinical label for medically unexplained gastrointestinal symptoms,recently described as a disturbance of the microbiota-gut-brain axis.Despite decades of research,the pathophysiology of this highly heterogeneous disorder remains elusive.However,a dramatic change in the understanding of the underlying pathophysiological mechanisms surfaced when the importance of gut microbiota protruded the scientific picture.Are we getting any closer to understanding IBS’etiology,or are we drowning in unspecific,conflicting data because we possess limited tools to unravel the cluster of secrets our gut microbiota is concealing?In this comprehensive review we are discussing some of the major important features of IBS and their interaction with gut microbiota,clinical microbiota-altering treatment such as the low FODMAP diet and fecal microbiota transplantation,neuroimaging and methods in microbiota analyses,and current and future challenges with big data analysis in IBS.展开更多
Based on Cite Space software,big data bibliometrics analysis was carried out on the keywords of papers of photocatalytic materials published in 2020.Tracking the hotspots and directions can help young scholars to unde...Based on Cite Space software,big data bibliometrics analysis was carried out on the keywords of papers of photocatalytic materials published in 2020.Tracking the hotspots and directions can help young scholars to understand the latest progress.In the Web of Sciences,4147 related papers were searched with"photocatalytic materials"as the main topic.Cluster analysis showed that the hotspots were g-C_(3)N_(4),Mxene and metal-organic frameworks (MOF) and titanium dioxide (TiO_(2)).展开更多
The world health organization(WHO)terms dengue as a serious illness that impacts almost half of the world’s population and carries no specific treatment.Early and accurate detection of spread in affected regions can ...The world health organization(WHO)terms dengue as a serious illness that impacts almost half of the world’s population and carries no specific treatment.Early and accurate detection of spread in affected regions can save precious lives.Despite the severity of the disease,a few noticeable works can be found that involve sentiment analysis to mine accurate intuitions from the social media text streams.However,the massive data explosion in recent years has led to difficulties in terms of storing and processing large amounts of data,as reliable mechanisms to gather the data and suitable techniques to extract meaningful insights from the data are required.This research study proposes a sentiment analysis polarity approach for collecting data and extracting relevant information about dengue via Apache Hadoop.The method consists of two main parts:the first part collects data from social media using Apache Flume,while the second part focuses on querying and extracting relevant information via the hybrid filtration-polarity algorithm using Apache Hive.To overcome the noisy and unstructured nature of the data,the process of extracting information is characterized by pre and post-filtration phases.As a result,only with the integration of Flume and Hive with filtration and polarity analysis,can a reliable sentiment analysis technique be offered to collect and process large-scale data from the social network.We introduce how the Apache Hadoop ecosystem–Flume and Hive–can provide a sentiment analysis capability by storing and processing large amounts of data.An important finding of this paper is that developing efficient sentiment analysis applications for detecting diseases can be more reliable through the use of the Hadoop ecosystem components than through the use of normal machines.展开更多
基金We thank the anonymous reviewers and editors for their very constructive comments.the National Social Science Foundation Project of China under Grant 16BTQ085.
文摘The issue of privacy protection for mobile social networks is a frontier topic in the field of social network applications.The existing researches on user privacy protection in mobile social network mainly focus on privacy preserving data publishing and access control.There is little research on the association of user privacy information,so it is not easy to design personalized privacy protection strategy,but also increase the complexity of user privacy settings.Therefore,this paper concentrates on the association of user privacy information taking big data analysis tools,so as to provide data support for personalized privacy protection strategy design.
文摘Quantitative analysis of digital images requires detection and segmentation of the borders of the object of interest. Accurate segmentation is required for volume determination, 3D rendering, radiation therapy, and surgery planning. In medical images, segmentation has traditionally been done by human experts. Substantial computational and storage requirements become especially acute when object orientation and scale have to be considered. Therefore, automated or semi-automated segmentation techniques are essential if these software applications are ever to gain widespread clinical use. Many methods have been proposed to detect and segment 2D shapes, most of which involve template matching. Advanced segmentation techniques called Snakes or active contours have been used, considering deformable models or templates. The main purpose of this work is to apply segmentation techniques for the definition of 3D organs (anatomical structures) when big data information has been stored and must be organized by the doctors for medical diagnosis. The processes would be implemented in the CT images from patients with COVID-19.
基金supported by the U.S.National Science Foundation(Award No.1949372 and No.2125775)in part supported through computational resources provided by Syracuse University.
文摘In the United States,the buildings sector consumes about 76%of electricity use and 40% of all primary energy use and associated greenhouse gas emissions.Occupant behavior has drawn increasing research interests due to its impacts on the building energy consumption.However,occupant behavior study at urban scale remains a challenge,and very limited studies have been conducted.As an effort to couple big data analysis with human mobility modeling,this study has explored urban scale human mobility utilizing three months Global Positioning System(GPS)data of 93,o00 users at Phoenix Metropolitan Area.This research extracted stay points from raw data,and identified users'home,work,and other locations by Density-Based Spatial Clustering algorithm.Then,daily mobility patterns were constructed using different types of locations.We propose a novel approach to predict urban scale daily human mobility patterns with 12-hour prediction horizon,using Long Short-Term Memory(LSTM)neural network model.Results shows the developed models achieved around 85%average accuracy and about 86%mean precision.The developed models can be further applied to analyze urban scale occupant behavior,building energy demand and flexibility,and contributed to urban planning.
基金supported by the National Natural Science Foundation of China(No.61972261)Basic Research Foundations of Shenzhen(Nos.JCYJ 20210324093609026 and JCYJ20200813091134001).
文摘Distributed computing frameworks are the fundamental component of distributed computing systems.They provide an essential way to support the efficient processing of big data on clusters or cloud.The size of big data increases at a pace that is faster than the increase in the big data processing capacity of clusters.Thus,distributed computing frameworks based on the MapReduce computing model are not adequate to support big data analysis tasks which often require running complex analytical algorithms on extremely big data sets in terabytes.In performing such tasks,these frameworks face three challenges:computational inefficiency due to high I/O and communication costs,non-scalability to big data due to memory limit,and limited analytical algorithms because many serial algorithms cannot be implemented in the MapReduce programming model.New distributed computing frameworks need to be developed to conquer these challenges.In this paper,we review MapReduce-type distributed computing frameworks that are currently used in handling big data and discuss their problems when conducting big data analysis.In addition,we present a non-MapReduce distributed computing framework that has the potential to overcome big data analysis challenges.
基金Supported by the NSFC-Zhejiang Joint Fund for the Integration of Industrialization and Informatization(U1909208)the Science and Technology Major Project of Changsha(kh2202004)the Changsha Municipal Natural Science Foundation(kq2202106).
文摘Electrocardiogram(ECG)is a low-cost,simple,fast,and non-invasive test.It can reflect the heart’s electrical activity and provide valuable diagnostic clues about the health of the entire body.Therefore,ECG has been widely used in various biomedical applications such as arrhythmia detection,disease-specific detection,mortality prediction,and biometric recognition.In recent years,ECG-related studies have been carried out using a variety of publicly available datasets,with many differences in the datasets used,data preprocessing methods,targeted challenges,and modeling and analysis techniques.Here we systematically summarize and analyze the ECGbased automatic analysis methods and applications.Specifically,we first reviewed 22 commonly used ECG public datasets and provided an overview of data preprocessing processes.Then we described some of the most widely used applications of ECG signals and analyzed the advanced methods involved in these applications.Finally,we elucidated some of the challenges in ECG analysis and provided suggestions for further research.
基金Supported in part by the National Natural Science Foundation of China(No.61972261)the National Key R&D Program of China(No.2017YFC0822604-2)
文摘Computer clusters with the shared-nothing architecture are the major computing platforms for big data processing and analysis.In cluster computing,data partitioning and sampling are two fundamental strategies to speed up the computation of big data and increase scalability.In this paper,we present a comprehensive survey of the methods and techniques of data partitioning and sampling with respect to big data processing and analysis.We start with an overview of the mainstream big data frameworks on Hadoop clusters.The basic methods of data partitioning are then discussed including three classical horizontal partitioning schemes:range,hash,and random partitioning.Data partitioning on Hadoop clusters is also discussed with a summary of new strategies for big data partitioning,including the new Random Sample Partition(RSP)distributed model.The classical methods of data sampling are then investigated,including simple random sampling,stratified sampling,and reservoir sampling.Two common methods of big data sampling on computing clusters are also discussed:record-level sampling and blocklevel sampling.Record-level sampling is not as efficient as block-level sampling on big distributed data.On the other hand,block-level sampling on data blocks generated with the classical data partitioning methods does not necessarily produce good representative samples for approximate computing of big data.In this survey,we also summarize the prevailing strategies and related work on sampling-based approximation on Hadoop clusters.We believe that data partitioning and sampling should be considered together to build approximate cluster computing frameworks that are reliable in both the computational and statistical respects.
文摘Air quality is a critical concern for public health and environmental regulation. The Air Quality Index (AQI), a widely adopted index by the US Environmental Protection Agency (EPA), serves as a crucial metric for reporting site-specific air pollution levels. Accurately predicting air quality, as measured by the AQI, is essential for effective air pollution management. In this study, we aim to identify the most reliable regression model among linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), logistic regression, and K-nearest neighbors (KNN). We conducted four different regression analyses using a machine learning approach to determine the model with the best performance. By employing the confusion matrix and error percentages, we selected the best-performing model, which yielded prediction error rates of 22%, 23%, 20%, and 27%, respectively, for LDA, QDA, logistic regression, and KNN models. The logistic regression model outperformed the other three statistical models in predicting AQI. Understanding these models' performance can help address an existing gap in air quality research and contribute to the integration of regression techniques in AQI studies, ultimately benefiting stakeholders like environmental regulators, healthcare professionals, urban planners, and researchers.
基金supported by grants from the National Natural Science Foundation of China(No.81272340,No.81472386,No.81672872)the National High Technology Research and Development Program of China(863 Program)(No.2012AA02A501)+1 种基金the Science and Technology Planning Project of Guangdong Province,China(No.2014B020212017,No.2014B050504004 and No.2015B050501005)the Natural Science Foundation of Guangdong Province,China(No.2016A030311011)
文摘Metastasis is the greatest contributor to cancer?related death.In the era of precision medicine,it is essential to predict and to prevent the spread of cancer cells to significantly improve patient survival.Thanks to the application of a variety of high?throughput technologies,accumulating big data enables researchers and clinicians to identify aggressive tumors as well as patients with a high risk of cancer metastasis.However,there have been few large?scale gene collection studies to enable metastasis?related analyses.In the last several years,emerging efforts have identi?fied pro?metastatic genes in a variety of cancers,providing us the ability to generate a pro?metastatic gene cluster for big data analyses.We carefully selected 285 genes with in vivo evidence of promoting metastasis reported in the literature.These genes have been investigated in different tumor types.We used two datasets downloaded from The Cancer Genome Atlas database,specifically,datasets of clear cell renal cell carcinoma and hepatocellular carcinoma,for validation tests,and excluded any genes for which elevated expression level correlated with longer overall survival in any of the datasets.Ultimately,150 pro?metastatic genes remained in our analyses.We believe this collection of pro?metastatic genes will be helpful for big data analyses,and eventually will accelerate anti?metastasis research and clinical intervention.
文摘In view of the frequent fluctuation of garlic price under the market economy and the current situation of garlic price,the fluctuation of garlic price in the circulation link of garlic industry chain is analyzed,and the application mode of multidisciplinary in the agricultural industry is discussed.On the basis of the big data platform of garlic industry chain,this paper constructs a Garch model to analyze the fluctuation law of garlic price in the circulation link and provides the garlic industry service from the angle of price fluctuation combined with the economic analysis.The research shows that the average price rate of the price of garlic shows“agglomeration”and cyclical phenomenon,which has the characteristics of fragility,left and a non-normal distribution and the fitting value of the GARCH model is very close to the true value.Finally,it looks into the industrial service form from the perspective of garlic price fluctuation.
基金This work is supported by Shandong Provincial Natural Science Foundation,China under Grant No.ZR2017MG011This work is also supported by Key Research and Development Program in Shandong Provincial(2017GGX90103).
文摘Monitoring,understanding and predicting Origin-destination(OD)flows in a city is an important problem for city planning and human activity.Taxi-GPS traces,acted as one kind of typical crowd sensed data,it can be used to mine the semantics of OD flows.In this paper,we firstly construct and analyze a complex network of OD flows based on large-scale GPS taxi traces of a city in China.The spatiotemporal analysis for the OD flows complex network showed that there were distinctive patterns in OD flows.Then based on a novel complex network model,a semantics mining method of OD flows is proposed through compounding Points of Interests(POI)network and public transport network to the OD flows network.The propose method would offer a novel way to predict the location characteristic and future traffic conditions accurately.
文摘In recent years,China has successfully set up multiple single-product big data platforms.As an indigenous and unique plant in China,the peony contains immense economic returns,strong social benefits,and profound cultural heritage.Its seed oil,as an emerging edible oil,has attracted much attention.Heze city is one of the places optimal for cultivating peonies.In this context,a study of the big data of the peony industry in Heze city bears practical significance.This paper begins with the literature review of big data platforms for the entire industry.Referring to established single-product big data platforms,it reports the results of a case study of the peony industry in Heze city that identify potential difficulties and problems regarding the building of a big data platform for the peony industry that incorporates the five dimensions of service,management,application,resource,and technology.
文摘The term sentiment analysis deals with sentiment classification based on the review made by the user in a social network.The sentiment classification accuracy is evaluated using various selection methods,especially those that deal with algorithm selection.In this work,every sentiment received through user expressions is ranked in order to categorise sentiments as informative and non-informative.In order to do so,the work focus on Query Expansion Ranking(QER)algorithm that takes user text as input and process for sentiment analysis andfinally produces the results as informative or non-informative.The challenge is to convert non-informative into informative using the concepts of classifiers like Bayes multinomial,entropy modelling along with the traditional sentimental analysis algorithm like Support Vector Machine(SVM)and decision trees.The work also addresses simulated annealing along with QER to classify data based on sentiment analysis.As the input volume is very fast,the work also addresses the concept of big data for information retrieval and processing.The result com-parison shows that the QER algorithm proved to be versatile when compared with the result of SVM.This work uses Twitter user comments for evaluating senti-ment analysis.
基金supported by the Department of Electrical Engineering,National Chin-Yi University of Technologythe National Chin-Yi University of Technology,Takming University of Science and Technology,Taiwan,for supporting this research.
文摘This paper proposes a method for improving the data security of wireless sensor networks based on blockchain technology.Blockchain technology is applied to data transfer to build a highly secure wireless sensor network.In this network,the relay stations use microcontrollers and embedded devices,and the microcontrollers,such as Raspberry Pi and Arduino Yun,represents mobile databases.The proposed system uses microcontrollers to facilitate the connection of various sensor devices.By adopting blockchain encryption,the security of sensing data can be effectively improved.A blockchain is a concatenated transaction record that is protected by cryptography.Each section contains the encrypted hash of the previous section,the corresponding timestamp,and transaction data.The transaction data denote the sensing data of the wireless sensing network.The proposed system uses a hash value representation calculated by the Merkel-tree algorithm,which makes the transfer data of the system difficult to be tamped with.However,the proposed system can serve as a private cloud data center.In this study,the system visualizes the data uploaded by sensors and create relevant charts based on big data analysis.Since the webpage server of the proposed system is built on an embedded operating system,it is easy to model and visualize the corresponding graphics using Python or JavaScript programming language.Finally,this study creates an embedded system mobile database and web server,which can utilize JavaScript program language and Node.js runtime environment to apply blockchain technology to mobile databases.The proposed method is verified by the experiment using about 1600 data records.The results show that the possibility of data being changed is very small,and the probability of data being changed is almost zero.
基金supported in part by the National Natural Science Foundation of China (62272078)the CAAI-Huawei MindSpore Open Fund (CAAIXSJLJJ-2021-035A)the Doctoral Student Talent Training Program of Chongqing University of Posts and Telecommunications (BYJS202009)。
文摘Cryptocurrency, as a typical application scene of blockchain, has attracted broad interests from both industrial and academic communities. With its rapid development, the cryptocurrency transaction network embedding(CTNE) has become a hot topic. It embeds transaction nodes into low-dimensional feature space while effectively maintaining a network structure,thereby discovering desired patterns demonstrating involved users' normal and abnormal behaviors. Based on a wide investigation into the state-of-the-art CTNE, this survey has made the following efforts: 1) categorizing recent progress of CTNE methods, 2) summarizing the publicly available cryptocurrency transaction network datasets, 3) evaluating several widely-adopted methods to show their performance in several typical evaluation protocols, and 4) discussing the future trends of CTNE. By doing so, it strives to provide a systematic and comprehensive overview of existing CTNE methods from static to dynamic perspectives,thereby promoting further research into this emerging and important field.
基金Ministry of Education of New Engineering Project Research and Practice(No.E-AQGABQ20202704)Undergraduate Teaching Reform and Innovation Project of Beijing Higher Education(No.202110018002)+3 种基金First-Class Discipline Construction Project of Beijing Electronic Science and Technology Institute(No.20210064Z0401,No.20210056Z0402)Fundamental Research Funds for the Central Universities(No.328202205,No.328202271,No.328202269)Research on Graphical Development Platform of Reconfigurable Cryptographic Chip Based on Model Driven(No.20220153Z0114)National Key Research and Development Program Funded Project(No.2017YFB0801803)。
文摘In response to the limitations of the traditional education and teaching model,this article proposes a smart education model based on ChatGPT.The model actively breaks the constraint of time and space and the design patterns of traditional education,providing smart education services including personalized learning,smart tutoring and evaluation,educational content creation support,and education big data analysis.Through constructing an open and inclusive learning space and creating flexible and diverse educational models,ChatGPT can help to meet students’individuality and overall development,as well as assist teachers in keeping up with the students’learning performance and developmental requirements in real-time.This provides an important basis for optimizing teaching content,offering personalized and accurate cultivation,and planning the development path of students.
文摘The proliferation of textual data in society currently is overwhelming, in particular, unstructured textual data is being constantly generated via call centre logs, emails, documents on the web, blogs, tweets, customer comments, customer reviews, etc.While the amount of textual data is increasing rapidly, users ability to summarise, understand, and make sense of such data for making better business/living decisions remains challenging. This paper studies how to analyse textual data, based on layered software patterns, for extracting insightful user intelligence from a large collection of documents and for using such information to improve user operations and performance.
基金Supported by the Spanish Ministry of Science and Innovation(MICINN,Spain),No.AGL2017-88801-P(to Sanz Y)the Miguel Server grant from the Spanish"Carlos III"Health Institute(ISCIII),No.CP19/00132(to Benitez-Paez A)+2 种基金The Norwegian Research Council(Funding Postdoc Position for Bharath Halandur Nagaraja),No.FRIMEDBIO276010and Helse Vest’s Research Funding,No.HV912243and ERC H2020-MSCA-IF-2019,No.895219(to Haleem N).
文摘Irritable bowel syndrome(IBS)is a common clinical label for medically unexplained gastrointestinal symptoms,recently described as a disturbance of the microbiota-gut-brain axis.Despite decades of research,the pathophysiology of this highly heterogeneous disorder remains elusive.However,a dramatic change in the understanding of the underlying pathophysiological mechanisms surfaced when the importance of gut microbiota protruded the scientific picture.Are we getting any closer to understanding IBS’etiology,or are we drowning in unspecific,conflicting data because we possess limited tools to unravel the cluster of secrets our gut microbiota is concealing?In this comprehensive review we are discussing some of the major important features of IBS and their interaction with gut microbiota,clinical microbiota-altering treatment such as the low FODMAP diet and fecal microbiota transplantation,neuroimaging and methods in microbiota analyses,and current and future challenges with big data analysis in IBS.
基金supported by the Open Foundation of the State Key Laboratory of Structural Chemistry(20190027)Youth Program of National Natural Science Foundation of China(51702053)。
文摘Based on Cite Space software,big data bibliometrics analysis was carried out on the keywords of papers of photocatalytic materials published in 2020.Tracking the hotspots and directions can help young scholars to understand the latest progress.In the Web of Sciences,4147 related papers were searched with"photocatalytic materials"as the main topic.Cluster analysis showed that the hotspots were g-C_(3)N_(4),Mxene and metal-organic frameworks (MOF) and titanium dioxide (TiO_(2)).
基金Taif University Researchers Supporting Project number(TURSP-2020/98).
文摘The world health organization(WHO)terms dengue as a serious illness that impacts almost half of the world’s population and carries no specific treatment.Early and accurate detection of spread in affected regions can save precious lives.Despite the severity of the disease,a few noticeable works can be found that involve sentiment analysis to mine accurate intuitions from the social media text streams.However,the massive data explosion in recent years has led to difficulties in terms of storing and processing large amounts of data,as reliable mechanisms to gather the data and suitable techniques to extract meaningful insights from the data are required.This research study proposes a sentiment analysis polarity approach for collecting data and extracting relevant information about dengue via Apache Hadoop.The method consists of two main parts:the first part collects data from social media using Apache Flume,while the second part focuses on querying and extracting relevant information via the hybrid filtration-polarity algorithm using Apache Hive.To overcome the noisy and unstructured nature of the data,the process of extracting information is characterized by pre and post-filtration phases.As a result,only with the integration of Flume and Hive with filtration and polarity analysis,can a reliable sentiment analysis technique be offered to collect and process large-scale data from the social network.We introduce how the Apache Hadoop ecosystem–Flume and Hive–can provide a sentiment analysis capability by storing and processing large amounts of data.An important finding of this paper is that developing efficient sentiment analysis applications for detecting diseases can be more reliable through the use of the Hadoop ecosystem components than through the use of normal machines.