随着信息技术的高速发展,人们积累的数据量急剧增长,如何从海量的数据中提取有用的知识成为当务之急。数据挖掘就是为顺应这种需要应运而生发展起来的数据处理技术。其主要任务是关联分析、分类、预测时序模式和偏差分析等。是知识发现(...随着信息技术的高速发展,人们积累的数据量急剧增长,如何从海量的数据中提取有用的知识成为当务之急。数据挖掘就是为顺应这种需要应运而生发展起来的数据处理技术。其主要任务是关联分析、分类、预测时序模式和偏差分析等。是知识发现(knowledge discovery in database)的关键步骤。数据挖掘技术是人们长期对数据库技术进行研究和开发的结果。起初各种商业数据是存储在计算机的数据库中的,然后发展到可对数据库进行查询和访问,进而发展到对数据库的即时遍历。数据挖掘使数据库技术进入了一个更高级的阶段,它不仅能对过去的数据进行查询和遍历,并且能够找出过去数据之间的潜在联系,从而促进信息的传递。展开更多
Data mining has been proven as a reliable technique to analyze road accidents and provide productive results. Most of the road accident data analysis use data mining techniques, focusing on identifying factors that af...Data mining has been proven as a reliable technique to analyze road accidents and provide productive results. Most of the road accident data analysis use data mining techniques, focusing on identifying factors that affect the severity of an accident. However, any damage resulting from road accidents is always unacceptable in terms of health, property damage and other economic factors. Sometimes, it is found that road accident occurrences are more frequent at certain specific locations. The analysis of these locations can help in identifying certain road accident features that make a road accident to occur frequently in these locations. Association rule mining is one of the popular data mining techniques that identify the correlation in various attributes of road accident. In this paper, we first applied k-means algorithm to group the accident locations into three categories, high-frequency, moderate-frequency and low-frequency accident locations. k-means algorithm takes accident frequency count as a parameter to cluster the locations. Then we used association rule mining to characterize these locations. The rules revealed different factors associated with road accidents at different locations with varying accident frequencies. Theassociation rules for high-frequency accident location disclosed that intersections on highways are more dangerous for every type of accidents. High-frequency accident locations mostly involved two-wheeler accidents at hilly regions. In moderate-frequency accident locations, colonies near local roads and intersection on highway roads are found dangerous for pedestrian hit accidents. Low-frequency accident locations are scattered throughout the district and the most of the accidents at these locations were not critical. Although the data set was limited to some selected attributes, our approach extracted some useful hidden information from the data which can be utilized to take some preventive efforts in these locations.展开更多
Meteorological model tasks require considerable meteorological basis data to support their execution.However,if the task and the mete-orological datasets are located on different clouds,that enhances the cost,executio...Meteorological model tasks require considerable meteorological basis data to support their execution.However,if the task and the mete-orological datasets are located on different clouds,that enhances the cost,execution time,and energy consumption of execution meteorological tasks.Therefore,the data layout and task scheduling may work together in the meteorological cloud to avoid being in various locations.To the best of our knowledge,this is the first paper that tries to schedule meteorological tasks with the help of the meteorological data set layout.First,we use the FP-Growth-M(frequent-pattern growth for meteorological model datasets)method to mine the relationship between meteorological models and datasets.Second,based on the relation,we propose a heuristics algorithm for laying out the meteorological datasets and scheduling tasks.Finally,we use simulation results to compare our proposed method with other methods.The simulation results show that our method reduces the number of involved clouds,the sizes of files from outer clouds,and the time of transmitting files.展开更多
In this study,an observation points‐based positive‐unlabeled learning algorithm(hence called OP‐PUL)is proposed to deal with positive‐unlabeled learning(PUL)tasks by judiciously assigning highly credible labels to...In this study,an observation points‐based positive‐unlabeled learning algorithm(hence called OP‐PUL)is proposed to deal with positive‐unlabeled learning(PUL)tasks by judiciously assigning highly credible labels to unlabeled samples.The proposed OP‐PUL algorithm has three components.First,an observation point classifier ensemble(OPCE)algorithm is constructed to divide unlabeled samples into two categories,which are temporary positive and permanent negative samples.Second,a temporary OPC(TOPC)is trained based on the combination of original positive samples and permanent negative samples and then the permanent positive samples that are correctly classified with TOPC are retained from the temporary positive samples.Third,a permanent OPC(POPC)is finally trained based on the combination of original positive samples,permanent positive samples and permanent negative samples.An exhaustive experimental evaluation is conducted to validate the feasibility,rationality and effectiveness of the OP‐PUL algorithm,using 30 benchmark PU data sets.Results show that(1)the OP‐PUL algorithm is stable and robust as unlabeled samples and positive samples are increased in unlabeled data sets and(2)the permanent positive samples have a consistent probability distribution with the original positive samples.Moreover,a statistical analysis reveals that POPC in the OP‐PUL algorithm can yield better PUL performances on the 30 data sets in comparison with four well‐known PUL algorithms.This demonstrates that OP‐PUL is a viable algorithm to deal with PUL tasks.展开更多
文摘随着信息技术的高速发展,人们积累的数据量急剧增长,如何从海量的数据中提取有用的知识成为当务之急。数据挖掘就是为顺应这种需要应运而生发展起来的数据处理技术。其主要任务是关联分析、分类、预测时序模式和偏差分析等。是知识发现(knowledge discovery in database)的关键步骤。数据挖掘技术是人们长期对数据库技术进行研究和开发的结果。起初各种商业数据是存储在计算机的数据库中的,然后发展到可对数据库进行查询和访问,进而发展到对数据库的即时遍历。数据挖掘使数据库技术进入了一个更高级的阶段,它不仅能对过去的数据进行查询和遍历,并且能够找出过去数据之间的潜在联系,从而促进信息的传递。
文摘Data mining has been proven as a reliable technique to analyze road accidents and provide productive results. Most of the road accident data analysis use data mining techniques, focusing on identifying factors that affect the severity of an accident. However, any damage resulting from road accidents is always unacceptable in terms of health, property damage and other economic factors. Sometimes, it is found that road accident occurrences are more frequent at certain specific locations. The analysis of these locations can help in identifying certain road accident features that make a road accident to occur frequently in these locations. Association rule mining is one of the popular data mining techniques that identify the correlation in various attributes of road accident. In this paper, we first applied k-means algorithm to group the accident locations into three categories, high-frequency, moderate-frequency and low-frequency accident locations. k-means algorithm takes accident frequency count as a parameter to cluster the locations. Then we used association rule mining to characterize these locations. The rules revealed different factors associated with road accidents at different locations with varying accident frequencies. Theassociation rules for high-frequency accident location disclosed that intersections on highways are more dangerous for every type of accidents. High-frequency accident locations mostly involved two-wheeler accidents at hilly regions. In moderate-frequency accident locations, colonies near local roads and intersection on highway roads are found dangerous for pedestrian hit accidents. Low-frequency accident locations are scattered throughout the district and the most of the accidents at these locations were not critical. Although the data set was limited to some selected attributes, our approach extracted some useful hidden information from the data which can be utilized to take some preventive efforts in these locations.
基金funded in part byMajor projects of the National Social Science Fund(16ZDA054)of Chinathe Postgraduate Research&Practice Innovation Program of Jiansu Province(NO.KYCX18_0999)of Chinathe Engineering Research Center for Software Testing and Evaluation of Fujian Province(ST2018004)of China.
文摘Meteorological model tasks require considerable meteorological basis data to support their execution.However,if the task and the mete-orological datasets are located on different clouds,that enhances the cost,execution time,and energy consumption of execution meteorological tasks.Therefore,the data layout and task scheduling may work together in the meteorological cloud to avoid being in various locations.To the best of our knowledge,this is the first paper that tries to schedule meteorological tasks with the help of the meteorological data set layout.First,we use the FP-Growth-M(frequent-pattern growth for meteorological model datasets)method to mine the relationship between meteorological models and datasets.Second,based on the relation,we propose a heuristics algorithm for laying out the meteorological datasets and scheduling tasks.Finally,we use simulation results to compare our proposed method with other methods.The simulation results show that our method reduces the number of involved clouds,the sizes of files from outer clouds,and the time of transmitting files.
基金National Natural Science Foundation of China,Grant/Award Number:61972261Natural Science Foundation of Guangdong Province,Grant/Award Number:2314050006683+1 种基金Key Basic Research Foundation of Shenzhen,Grant/Award Number:JCYJ20220818100205012Basic Research Foundations of Shenzhen,Grant/Award Number:JCYJ20210324093609026.
文摘In this study,an observation points‐based positive‐unlabeled learning algorithm(hence called OP‐PUL)is proposed to deal with positive‐unlabeled learning(PUL)tasks by judiciously assigning highly credible labels to unlabeled samples.The proposed OP‐PUL algorithm has three components.First,an observation point classifier ensemble(OPCE)algorithm is constructed to divide unlabeled samples into two categories,which are temporary positive and permanent negative samples.Second,a temporary OPC(TOPC)is trained based on the combination of original positive samples and permanent negative samples and then the permanent positive samples that are correctly classified with TOPC are retained from the temporary positive samples.Third,a permanent OPC(POPC)is finally trained based on the combination of original positive samples,permanent positive samples and permanent negative samples.An exhaustive experimental evaluation is conducted to validate the feasibility,rationality and effectiveness of the OP‐PUL algorithm,using 30 benchmark PU data sets.Results show that(1)the OP‐PUL algorithm is stable and robust as unlabeled samples and positive samples are increased in unlabeled data sets and(2)the permanent positive samples have a consistent probability distribution with the original positive samples.Moreover,a statistical analysis reveals that POPC in the OP‐PUL algorithm can yield better PUL performances on the 30 data sets in comparison with four well‐known PUL algorithms.This demonstrates that OP‐PUL is a viable algorithm to deal with PUL tasks.