Meteorological model tasks require considerable meteorological basis data to support their execution.However,if the task and the mete-orological datasets are located on different clouds,that enhances the cost,executio...Meteorological model tasks require considerable meteorological basis data to support their execution.However,if the task and the mete-orological datasets are located on different clouds,that enhances the cost,execution time,and energy consumption of execution meteorological tasks.Therefore,the data layout and task scheduling may work together in the meteorological cloud to avoid being in various locations.To the best of our knowledge,this is the first paper that tries to schedule meteorological tasks with the help of the meteorological data set layout.First,we use the FP-Growth-M(frequent-pattern growth for meteorological model datasets)method to mine the relationship between meteorological models and datasets.Second,based on the relation,we propose a heuristics algorithm for laying out the meteorological datasets and scheduling tasks.Finally,we use simulation results to compare our proposed method with other methods.The simulation results show that our method reduces the number of involved clouds,the sizes of files from outer clouds,and the time of transmitting files.展开更多
In this study,an observation points‐based positive‐unlabeled learning algorithm(hence called OP‐PUL)is proposed to deal with positive‐unlabeled learning(PUL)tasks by judiciously assigning highly credible labels to...In this study,an observation points‐based positive‐unlabeled learning algorithm(hence called OP‐PUL)is proposed to deal with positive‐unlabeled learning(PUL)tasks by judiciously assigning highly credible labels to unlabeled samples.The proposed OP‐PUL algorithm has three components.First,an observation point classifier ensemble(OPCE)algorithm is constructed to divide unlabeled samples into two categories,which are temporary positive and permanent negative samples.Second,a temporary OPC(TOPC)is trained based on the combination of original positive samples and permanent negative samples and then the permanent positive samples that are correctly classified with TOPC are retained from the temporary positive samples.Third,a permanent OPC(POPC)is finally trained based on the combination of original positive samples,permanent positive samples and permanent negative samples.An exhaustive experimental evaluation is conducted to validate the feasibility,rationality and effectiveness of the OP‐PUL algorithm,using 30 benchmark PU data sets.Results show that(1)the OP‐PUL algorithm is stable and robust as unlabeled samples and positive samples are increased in unlabeled data sets and(2)the permanent positive samples have a consistent probability distribution with the original positive samples.Moreover,a statistical analysis reveals that POPC in the OP‐PUL algorithm can yield better PUL performances on the 30 data sets in comparison with four well‐known PUL algorithms.This demonstrates that OP‐PUL is a viable algorithm to deal with PUL tasks.展开更多
随着信息技术的高速发展,人们积累的数据量急剧增长,如何从海量的数据中提取有用的知识成为当务之急。数据挖掘就是为顺应这种需要应运而生发展起来的数据处理技术。其主要任务是关联分析、分类、预测时序模式和偏差分析等。是知识发现(...随着信息技术的高速发展,人们积累的数据量急剧增长,如何从海量的数据中提取有用的知识成为当务之急。数据挖掘就是为顺应这种需要应运而生发展起来的数据处理技术。其主要任务是关联分析、分类、预测时序模式和偏差分析等。是知识发现(knowledge discovery in database)的关键步骤。数据挖掘技术是人们长期对数据库技术进行研究和开发的结果。起初各种商业数据是存储在计算机的数据库中的,然后发展到可对数据库进行查询和访问,进而发展到对数据库的即时遍历。数据挖掘使数据库技术进入了一个更高级的阶段,它不仅能对过去的数据进行查询和遍历,并且能够找出过去数据之间的潜在联系,从而促进信息的传递。展开更多
Data mining has been proven as a reliable technique to analyze road accidents and provide productive results. Most of the road accident data analysis use data mining techniques, focusing on identifying factors that af...Data mining has been proven as a reliable technique to analyze road accidents and provide productive results. Most of the road accident data analysis use data mining techniques, focusing on identifying factors that affect the severity of an accident. However, any damage resulting from road accidents is always unacceptable in terms of health, property damage and other economic factors. Sometimes, it is found that road accident occurrences are more frequent at certain specific locations. The analysis of these locations can help in identifying certain road accident features that make a road accident to occur frequently in these locations. Association rule mining is one of the popular data mining techniques that identify the correlation in various attributes of road accident. In this paper, we first applied k-means algorithm to group the accident locations into three categories, high-frequency, moderate-frequency and low-frequency accident locations. k-means algorithm takes accident frequency count as a parameter to cluster the locations. Then we used association rule mining to characterize these locations. The rules revealed different factors associated with road accidents at different locations with varying accident frequencies. Theassociation rules for high-frequency accident location disclosed that intersections on highways are more dangerous for every type of accidents. High-frequency accident locations mostly involved two-wheeler accidents at hilly regions. In moderate-frequency accident locations, colonies near local roads and intersection on highway roads are found dangerous for pedestrian hit accidents. Low-frequency accident locations are scattered throughout the district and the most of the accidents at these locations were not critical. Although the data set was limited to some selected attributes, our approach extracted some useful hidden information from the data which can be utilized to take some preventive efforts in these locations.展开更多
The understanding of customer incidents and behaviour is crucial to the success of any organization. Evidence from literature shows a prediction pattern of products to customer. These studies predicted product charact...The understanding of customer incidents and behaviour is crucial to the success of any organization. Evidence from literature shows a prediction pattern of products to customer. These studies predicted product characteristics leaving out the customers characteristics. To address this gap, this study aims to design datamining system and implement it on an electronic commerce organization website. The customer information and history (clickstreams) from the electronic commerce website was used to predict the customers’ behaviour. This will give meaningful and usable data patterns to organizations. Python programming language was used to design the datamining system, while PHP, HTML, and JavaScript were used for the e-commerce website. A brief description of the background of e-commerce and data mining, previous work of researchers who have worked on data mining in e-commerce settings, was reviewed and the relationship between their findings and this work was established. The data mining system utilizes consensus clustering technique and the clustering algorithm with a graphical-based approach. Furthermore, the interaction between the data mining system and the customer’s dataset on an ecommerce website was defined. Quantitative evidence for determining the number and membership of possible customer behavioural clusters within the dataset was generated.展开更多
In this paper, the authors present three different algorithms for data clustering. These are Self-Organizing Map (SOM), Neural Gas (NG) and Fuzzy C-Means (FCM) algorithms. SOM and NG algorithms are based on comp...In this paper, the authors present three different algorithms for data clustering. These are Self-Organizing Map (SOM), Neural Gas (NG) and Fuzzy C-Means (FCM) algorithms. SOM and NG algorithms are based on competitive leaming. An important property of these algorithms is that they preserve the topological structure of data. This means that data that is close in input distribution is mapped to nearby locations in the network. The FCM algorithm is an algorithm based on soft clustering which means that the different clusters are not necessarily distinct, but may overlap. This clustering method may be very useful in many biological problems, for instance in genetics, where a gene may belong to different clusters. The different algorithms are compared in terms of their visualization of the clustering of proteomic data.展开更多
随着电脑、网络技术的发展,要获取某一问题的有关资料已经不是非常困难的事情了。但是对于数量大、涉及面宽的数据,靠以往人工汇总报表是无法完成的,而那种由简单汇总、按指定模式去分析的统计方法也无法适应这类数据的分析。因此,...随着电脑、网络技术的发展,要获取某一问题的有关资料已经不是非常困难的事情了。但是对于数量大、涉及面宽的数据,靠以往人工汇总报表是无法完成的,而那种由简单汇总、按指定模式去分析的统计方法也无法适应这类数据的分析。因此,一种智能化的、能综合应用各种统计方法来分析庞大数据资料的软件就应运而生,这就是目前国际上统计最热门的话题 Data Mining技术的市场需求和它的技术支持背景。那么,如何理解 Data Mining这一名词?其作用是什么? Data Mining技术涉及到哪些领域?它和统计分析有什么不同? Data Mining有哪些功能?有何具体应用?针对这些问题,为了让越来越多的读者了解并掌握 Data Mining技术,本文作者日前采访了国家统计局统计教育中心主任、中国统计教育学会副会长王吉利先生和统计界知名人士、上海财经大学张尧庭教授。展开更多
基金funded in part byMajor projects of the National Social Science Fund(16ZDA054)of Chinathe Postgraduate Research&Practice Innovation Program of Jiansu Province(NO.KYCX18_0999)of Chinathe Engineering Research Center for Software Testing and Evaluation of Fujian Province(ST2018004)of China.
文摘Meteorological model tasks require considerable meteorological basis data to support their execution.However,if the task and the mete-orological datasets are located on different clouds,that enhances the cost,execution time,and energy consumption of execution meteorological tasks.Therefore,the data layout and task scheduling may work together in the meteorological cloud to avoid being in various locations.To the best of our knowledge,this is the first paper that tries to schedule meteorological tasks with the help of the meteorological data set layout.First,we use the FP-Growth-M(frequent-pattern growth for meteorological model datasets)method to mine the relationship between meteorological models and datasets.Second,based on the relation,we propose a heuristics algorithm for laying out the meteorological datasets and scheduling tasks.Finally,we use simulation results to compare our proposed method with other methods.The simulation results show that our method reduces the number of involved clouds,the sizes of files from outer clouds,and the time of transmitting files.
基金National Natural Science Foundation of China,Grant/Award Number:61972261Natural Science Foundation of Guangdong Province,Grant/Award Number:2314050006683+1 种基金Key Basic Research Foundation of Shenzhen,Grant/Award Number:JCYJ20220818100205012Basic Research Foundations of Shenzhen,Grant/Award Number:JCYJ20210324093609026.
文摘In this study,an observation points‐based positive‐unlabeled learning algorithm(hence called OP‐PUL)is proposed to deal with positive‐unlabeled learning(PUL)tasks by judiciously assigning highly credible labels to unlabeled samples.The proposed OP‐PUL algorithm has three components.First,an observation point classifier ensemble(OPCE)algorithm is constructed to divide unlabeled samples into two categories,which are temporary positive and permanent negative samples.Second,a temporary OPC(TOPC)is trained based on the combination of original positive samples and permanent negative samples and then the permanent positive samples that are correctly classified with TOPC are retained from the temporary positive samples.Third,a permanent OPC(POPC)is finally trained based on the combination of original positive samples,permanent positive samples and permanent negative samples.An exhaustive experimental evaluation is conducted to validate the feasibility,rationality and effectiveness of the OP‐PUL algorithm,using 30 benchmark PU data sets.Results show that(1)the OP‐PUL algorithm is stable and robust as unlabeled samples and positive samples are increased in unlabeled data sets and(2)the permanent positive samples have a consistent probability distribution with the original positive samples.Moreover,a statistical analysis reveals that POPC in the OP‐PUL algorithm can yield better PUL performances on the 30 data sets in comparison with four well‐known PUL algorithms.This demonstrates that OP‐PUL is a viable algorithm to deal with PUL tasks.
文摘随着信息技术的高速发展,人们积累的数据量急剧增长,如何从海量的数据中提取有用的知识成为当务之急。数据挖掘就是为顺应这种需要应运而生发展起来的数据处理技术。其主要任务是关联分析、分类、预测时序模式和偏差分析等。是知识发现(knowledge discovery in database)的关键步骤。数据挖掘技术是人们长期对数据库技术进行研究和开发的结果。起初各种商业数据是存储在计算机的数据库中的,然后发展到可对数据库进行查询和访问,进而发展到对数据库的即时遍历。数据挖掘使数据库技术进入了一个更高级的阶段,它不仅能对过去的数据进行查询和遍历,并且能够找出过去数据之间的潜在联系,从而促进信息的传递。
文摘Data mining has been proven as a reliable technique to analyze road accidents and provide productive results. Most of the road accident data analysis use data mining techniques, focusing on identifying factors that affect the severity of an accident. However, any damage resulting from road accidents is always unacceptable in terms of health, property damage and other economic factors. Sometimes, it is found that road accident occurrences are more frequent at certain specific locations. The analysis of these locations can help in identifying certain road accident features that make a road accident to occur frequently in these locations. Association rule mining is one of the popular data mining techniques that identify the correlation in various attributes of road accident. In this paper, we first applied k-means algorithm to group the accident locations into three categories, high-frequency, moderate-frequency and low-frequency accident locations. k-means algorithm takes accident frequency count as a parameter to cluster the locations. Then we used association rule mining to characterize these locations. The rules revealed different factors associated with road accidents at different locations with varying accident frequencies. Theassociation rules for high-frequency accident location disclosed that intersections on highways are more dangerous for every type of accidents. High-frequency accident locations mostly involved two-wheeler accidents at hilly regions. In moderate-frequency accident locations, colonies near local roads and intersection on highway roads are found dangerous for pedestrian hit accidents. Low-frequency accident locations are scattered throughout the district and the most of the accidents at these locations were not critical. Although the data set was limited to some selected attributes, our approach extracted some useful hidden information from the data which can be utilized to take some preventive efforts in these locations.
文摘The understanding of customer incidents and behaviour is crucial to the success of any organization. Evidence from literature shows a prediction pattern of products to customer. These studies predicted product characteristics leaving out the customers characteristics. To address this gap, this study aims to design datamining system and implement it on an electronic commerce organization website. The customer information and history (clickstreams) from the electronic commerce website was used to predict the customers’ behaviour. This will give meaningful and usable data patterns to organizations. Python programming language was used to design the datamining system, while PHP, HTML, and JavaScript were used for the e-commerce website. A brief description of the background of e-commerce and data mining, previous work of researchers who have worked on data mining in e-commerce settings, was reviewed and the relationship between their findings and this work was established. The data mining system utilizes consensus clustering technique and the clustering algorithm with a graphical-based approach. Furthermore, the interaction between the data mining system and the customer’s dataset on an ecommerce website was defined. Quantitative evidence for determining the number and membership of possible customer behavioural clusters within the dataset was generated.
文摘In this paper, the authors present three different algorithms for data clustering. These are Self-Organizing Map (SOM), Neural Gas (NG) and Fuzzy C-Means (FCM) algorithms. SOM and NG algorithms are based on competitive leaming. An important property of these algorithms is that they preserve the topological structure of data. This means that data that is close in input distribution is mapped to nearby locations in the network. The FCM algorithm is an algorithm based on soft clustering which means that the different clusters are not necessarily distinct, but may overlap. This clustering method may be very useful in many biological problems, for instance in genetics, where a gene may belong to different clusters. The different algorithms are compared in terms of their visualization of the clustering of proteomic data.
文摘随着电脑、网络技术的发展,要获取某一问题的有关资料已经不是非常困难的事情了。但是对于数量大、涉及面宽的数据,靠以往人工汇总报表是无法完成的,而那种由简单汇总、按指定模式去分析的统计方法也无法适应这类数据的分析。因此,一种智能化的、能综合应用各种统计方法来分析庞大数据资料的软件就应运而生,这就是目前国际上统计最热门的话题 Data Mining技术的市场需求和它的技术支持背景。那么,如何理解 Data Mining这一名词?其作用是什么? Data Mining技术涉及到哪些领域?它和统计分析有什么不同? Data Mining有哪些功能?有何具体应用?针对这些问题,为了让越来越多的读者了解并掌握 Data Mining技术,本文作者日前采访了国家统计局统计教育中心主任、中国统计教育学会副会长王吉利先生和统计界知名人士、上海财经大学张尧庭教授。