The development of technologies such as big data and blockchain has brought convenience to life,but at the same time,privacy and security issues are becoming more and more prominent.The K-anonymity algorithm is an eff...The development of technologies such as big data and blockchain has brought convenience to life,but at the same time,privacy and security issues are becoming more and more prominent.The K-anonymity algorithm is an effective and low computational complexity privacy-preserving algorithm that can safeguard users’privacy by anonymizing big data.However,the algorithm currently suffers from the problem of focusing only on improving user privacy while ignoring data availability.In addition,ignoring the impact of quasi-identified attributes on sensitive attributes causes the usability of the processed data on statistical analysis to be reduced.Based on this,we propose a new K-anonymity algorithm to solve the privacy security problem in the context of big data,while guaranteeing improved data usability.Specifically,we construct a new information loss function based on the information quantity theory.Considering that different quasi-identification attributes have different impacts on sensitive attributes,we set weights for each quasi-identification attribute when designing the information loss function.In addition,to reduce information loss,we improve K-anonymity in two ways.First,we make the loss of information smaller than in the original table while guaranteeing privacy based on common artificial intelligence algorithms,i.e.,greedy algorithm and 2-means clustering algorithm.In addition,we improve the 2-means clustering algorithm by designing a mean-center method to select the initial center of mass.Meanwhile,we design the K-anonymity algorithm of this scheme based on the constructed information loss function,the improved 2-means clustering algorithm,and the greedy algorithm,which reduces the information loss.Finally,we experimentally demonstrate the effectiveness of the algorithm in improving the effect of 2-means clustering and reducing information loss.展开更多
Classical survival analysis assumes all subjects will experience the event of interest, but in some cases, a portion of the population may never encounter the event. These survival methods further assume independent s...Classical survival analysis assumes all subjects will experience the event of interest, but in some cases, a portion of the population may never encounter the event. These survival methods further assume independent survival times, which is not valid for honey bees, which live in nests. The study introduces a semi-parametric marginal proportional hazards mixture cure (PHMC) model with exchangeable correlation structure, using generalized estimating equations for survival data analysis. The model was tested on clustered right-censored bees survival data with a cured fraction, where two bee species were subjected to different entomopathogens to test the effect of the entomopathogens on the survival of the bee species. The Expectation-Solution algorithm is used to estimate the parameters. The study notes a weak positive association between cure statuses (ρ1=0.0007) and survival times for uncured bees (ρ2=0.0890), emphasizing their importance. The odds of being uncured for A. mellifera is higher than the odds for species M. ferruginea. The bee species, A. mellifera are more susceptible to entomopathogens icipe 7, icipe 20, and icipe 69. The Cox-Snell residuals show that the proposed semiparametric PH model generally fits the data well as compared to model that assume independent correlation structure. Thus, the semi parametric marginal proportional hazards mixture cure is parsimonious model for correlated bees survival data.展开更多
A convective and stratiform cloud classification method for weather radar is proposed based on the density-based spatial clustering of applications with noise(DBSCAN)algorithm.To identify convective and stratiform clo...A convective and stratiform cloud classification method for weather radar is proposed based on the density-based spatial clustering of applications with noise(DBSCAN)algorithm.To identify convective and stratiform clouds in different developmental phases,two-dimensional(2D)and three-dimensional(3D)models are proposed by applying reflectivity factors at 0.5°and at 0.5°,1.5°,and 2.4°elevation angles,respectively.According to the thresholds of the algorithm,which include echo intensity,the echo top height of 35 dBZ(ET),density threshold,andεneighborhood,cloud clusters can be marked into four types:deep-convective cloud(DCC),shallow-convective cloud(SCC),hybrid convective-stratiform cloud(HCS),and stratiform cloud(SFC)types.Each cloud cluster type is further identified as a core area and boundary area,which can provide more abundant cloud structure information.The algorithm is verified using the volume scan data observed with new-generation S-band weather radars in Nanjing,Xuzhou,and Qingdao.The results show that cloud clusters can be intuitively identified as core and boundary points,which change in area continuously during the process of convective evolution,by the improved DBSCAN algorithm.Therefore,the occurrence and disappearance of convective weather can be estimated in advance by observing the changes of the classification.Because density thresholds are different and multiple elevations are utilized in the 3D model,the identified echo types and areas are dissimilar between the 2D and 3D models.The 3D model identifies larger convective and stratiform clouds than the 2D model.However,the developing convective clouds of small areas at lower heights cannot be identified with the 3D model because they are covered by thick stratiform clouds.In addition,the 3D model can avoid the influence of the melting layer and better suggest convective clouds in the developmental stage.展开更多
Due to the development of E-Commerce, collaboration filtering (CF) recommendation algorithm becomes popular in recent years. It has some limitations such as cold start, data sparseness and low operation efficiency. In...Due to the development of E-Commerce, collaboration filtering (CF) recommendation algorithm becomes popular in recent years. It has some limitations such as cold start, data sparseness and low operation efficiency. In this paper, a CF recommendation algorithm is propose based on the latent factor model and improved spectral clustering (CFRALFMISC) to improve the forecasting precision. The latent factor model was firstly adopted to predict the missing score. Then, the cluster validity index was used to determine the number of clusters. Finally, the spectral clustering was improved by using the FCM algorithm to replace the K-means in the spectral clustering. The simulation results show that CFRALFMISC can effectively improve the recommendation precision compared with other algorithms.展开更多
Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application fields, such as statistical data analysis, pattern recogni...Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application fields, such as statistical data analysis, pattern recognition, image processing, and etc. We combine sampling technique with DBSCAN algorithm to cluster large spatial databases, and two sampling based DBSCAN (SDBSCAN) algorithms are developed. One algorithm introduces sampling technique inside DBSCAN, and the other uses sampling procedure outside DBSCAN. Experimental results demonstrate that our algorithms are effective and efficient in clustering large scale spatial databases.展开更多
As we enter the year of 2011, the 2009 H1N1 pandemic influenza virus is in the news again. At least 20 people have died of this virus in China since the beginning of 2011 and it is now the predominant flu strain in th...As we enter the year of 2011, the 2009 H1N1 pandemic influenza virus is in the news again. At least 20 people have died of this virus in China since the beginning of 2011 and it is now the predominant flu strain in the country. Although this novel virus was quite stable during its run in the flu season of 2009-2010, a genetic variant of this virus was found in Singapore in early 2010, and then in Australia and New Zealand during their 2010 winter influenza season. Several critical mutations in the HA protein of this variant were uncovered in the strains collected from January 2010 to April 2010. Moreover, a structural homology model of HA from the A/Brisbane/10/2010(H1N1) strain was made based on the structure of A/California/04/2009 (H1N1). The purpose of this study was to investigate mutations in the HA protein of 2009 H1N1 from sequence data collected worldwide from May 2010 to February 2011. A fundamental problem in bioinformatics and biology is to find the similar gene sequences for a given gene sequence of interest. Here we proposed the inverse problem, i.e., finding the exemplars from a group of related gene sequences. With a clustering algorithm affinity propagation, six exemplars of the HA sequences were identified to represent six clusters. One of the clusters contained strain A/Brisbane/12/2010(H1N1) that only differed from A/Brisbane/10/2010 in the HA sequence at position 449. Based on the sequence identity of the six exemplars, nine mutations in HA were located that could be used to distinguish these six clusters. Finally, we discovered the change of correlation patterns for the HA and NA of 2009 H1N1 as a result of the HA receptor binding specificity switch, revealing the balanced interplay between these two surface proteins of the virus.展开更多
Energy efficiency is the prime concern in Wireless Sensor Networks(WSNs) as maximized energy consumption without essentially limits the energy stability and network lifetime. Clustering is the significant approach ess...Energy efficiency is the prime concern in Wireless Sensor Networks(WSNs) as maximized energy consumption without essentially limits the energy stability and network lifetime. Clustering is the significant approach essential for minimizing unnecessary transmission energy consumption with sustained network lifetime. This clustering process is identified as the Non-deterministic Polynomial(NP)-hard optimization problems which has the maximized probability of being solved through metaheuristic algorithms.This adoption of hybrid metaheuristic algorithm concentrates on the identification of the optimal or nearoptimal solutions which aids in better energy stability during Cluster Head(CH) selection. In this paper,Hybrid Seagull and Whale Optimization Algorithmbased Dynamic Clustering Protocol(HSWOA-DCP)is proposed with the exploitation benefits of WOA and exploration merits of SEOA to optimal CH selection for maintaining energy stability with prolonged network lifetime. This HSWOA-DCP adopted the modified version of SEagull Optimization Algorithm(SEOA) to handle the problem of premature convergence and computational accuracy which is maximally possible during CH selection. The inclusion of SEOA into WOA improved the global searching capability during the selection of CH and prevents worst fitness nodes from being selected as CH, since the spiral attacking behavior of SEOA is similar to the bubble-net characteristics of WOA. This CH selection integrates the spiral attacking principles of SEOA and contraction surrounding mechanism of WOA for improving computation accuracy to prevent frequent election process. It also included the strategy of levy flight strategy into SEOA for potentially avoiding premature convergence to attain better trade-off between the rate of exploration and exploitation in a more effective manner. The simulation results of the proposed HSWOADCP confirmed better network survivability rate, network residual energy and network overall throughput on par with the competitive CH selection schemes under different number of data transmission rounds.The statistical analysis of the proposed HSWOA-DCP scheme also confirmed its energy stability with respect to ANOVA test.展开更多
In Wireless Sensor Networks(WSNs),Clustering process is widely utilized for increasing the lifespan with sustained energy stability during data transmission.Several clustering protocols were devised for extending netw...In Wireless Sensor Networks(WSNs),Clustering process is widely utilized for increasing the lifespan with sustained energy stability during data transmission.Several clustering protocols were devised for extending network lifetime,but most of them failed in handling the problem of fixed clustering,static rounds,and inadequate Cluster Head(CH)selection criteria which consumes more energy.In this paper,Stochastic Ranking Improved Teaching-Learning and Adaptive Grasshopper Optimization Algorithm(SRITL-AGOA)-based Clustering Scheme for energy stabilization and extending network lifespan.This SRITL-AGOA selected CH depending on the weightage of factors such as node mobility degree,neighbour's density distance to sink,single-hop or multihop communication and Residual Energy(RE)that directly influences the energy consumption of sensor nodes.In specific,Grasshopper Optimization Algorithm(GOA)is improved through tangent-based nonlinear strategy for enhancing the ability of global optimization.On the other hand,stochastic ranking and violation constraint handling strategies are embedded into Teaching-Learning-based Optimization Algorithm(TLOA)for improving its exploitation tendencies.Then,SR and VCH improved TLOA is embedded into the exploitation phase of AGOA for selecting better CH by maintaining better balance amid exploration and exploitation.Simulation results confirmed that the proposed SRITL-AGOA improved throughput by 21.86%,network stability by 18.94%,load balancing by 16.14%with minimized energy depletion by19.21%,compared to the competitive CH selection approaches.展开更多
To the problem that it is hard to determine the clustering number and the abnormal points by using the clustering validity function, an effective clustering partition model based on the genetic algorithm is built in t...To the problem that it is hard to determine the clustering number and the abnormal points by using the clustering validity function, an effective clustering partition model based on the genetic algorithm is built in this paper. The solution to the problem is formed by the combination of the clustering partition and the encoding samples, and the fitness function is defined by the distances among and within clusters. The clustering number and the samples in each cluster are determined and the abnormal points are distinguished by implementing the triple random crossover operator and the mutation. Based on the known sample data, the results of the novel method and the clustering validity function are compared. Numerical experiments are given and the results show that the novel method is more effective.展开更多
To make the quantitative results of nuclear magnetic resonance(NMR) transverse relaxation(T;) spectrums reflect the type and pore structure of reservoir more directly, an unsupervised clustering method was developed t...To make the quantitative results of nuclear magnetic resonance(NMR) transverse relaxation(T;) spectrums reflect the type and pore structure of reservoir more directly, an unsupervised clustering method was developed to obtain the quantitative pore structure information from the NMR T;spectrums based on the Gaussian mixture model(GMM). Firstly, We conducted the principal component analysis on T;spectrums in order to reduce the dimension data and the dependence of the original variables. Secondly, the dimension-reduced data was fitted using the GMM probability density function, and the model parameters and optimal clustering numbers were obtained according to the expectation-maximization algorithm and the change of the Akaike information criterion. Finally, the T;spectrum features and pore structure types of different clustering groups were analyzed and compared with T;geometric mean and T;arithmetic mean. The effectiveness of the algorithm has been verified by numerical simulation and field NMR logging data. The research shows that the clustering results based on GMM method have good correlations with the shape and distribution of the T;spectrum, pore structure, and petroleum productivity, providing a new means for quantitative identification of pore structure, reservoir grading, and oil and gas productivity evaluation.展开更多
Many existing product family design methods assume a given platform, However, it is not an in-tuitive task to select the platform and unique variable within a product family. Meanwhile, most approaches are single-plat...Many existing product family design methods assume a given platform, However, it is not an in-tuitive task to select the platform and unique variable within a product family. Meanwhile, most approaches are single-platform methods, in which design variables are either shared across all product variants or not at all. While in multiple-platform design, platform variables can have special value with regard to a subset of product variants within the product family, and offer opportunities for superior overall design. An information theoretical approach incorporating fuzzy clustering and Shannon's entropy was proposed for platform variables selection in multiple-platform product family. A 2-level chromosome genetic algorithm (2LCGA) was proposed and developed for optimizing the corresponding product family in a single stage, simultaneously determining the optimal settings for the product platform and unique variables. The single-stage approach can yield im-provements in the overall performance of the product family compared with two-stage approaches, in which the first stage involves determining the best settings for the platform and values of unique variables are found for each product in the second stage. An example of design of a family of universal motors was used to verify the proposed method.展开更多
The study aims to recognize how efficiently Educational DataMining(EDM)integrates into Artificial Intelligence(AI)to develop skills for predicting students’performance.The study used a survey questionnaire and collec...The study aims to recognize how efficiently Educational DataMining(EDM)integrates into Artificial Intelligence(AI)to develop skills for predicting students’performance.The study used a survey questionnaire and collected data from 300 undergraduate students of Al Neelain University.The first step’s initial population placements were created using Particle Swarm Optimization(PSO).Then,using adaptive feature space search,Educational Grey Wolf Optimization(EGWO)was employed to choose the optimal attribute combination.The second stage uses the SVMclassifier to forecast classification accuracy.Different classifiers were utilized to evaluate the performance of students.According to the results,it was revealed that AI could forecast the final grades of students with an accuracy rate of 97%on the test dataset.Furthermore,the present study showed that successful students could be selected by the Decision Tree model with an efficiency rate of 87.50%and could be categorized as having equal information ratio gain after the semester.While the random forest provided an accuracy of 28%.These findings indicate the higher accuracy rate in the results when these models were implemented on the data set which provides significantly accurate results as compared to a linear regression model with accuracy(12%).The study concluded that the methodology used in this study can prove to be helpful for students and teachers in upgrading academic performance,reducing chances of failure,and taking appropriate steps at the right time to raise the standards of education.The study also motivates academics to assess and discover EDM at several other universities.展开更多
With the rapid development of technology,processing the explosive growth of meteorological data on traditional standalone computing has become increasingly time-consuming,which cannot meet the demands of scientific re...With the rapid development of technology,processing the explosive growth of meteorological data on traditional standalone computing has become increasingly time-consuming,which cannot meet the demands of scientific research and business.Therefore,this paper proposes the implementation of the parallel Clustering Large Application based upon RANdomized Search(CLARANS)clustering algorithm on the Spark cloud computing platformto cluster China’s climate regions usingmeteorological data from1988 to 2018.The aim is to address the challenge of applying clustering algorithms to large datasets.In this paper,the morphological similarity distance is adopted as the similarity measurement standard instead of Euclidean distance,which improves clustering accuracy.Furthermore,the issue of local optima caused by an improper selection of initial clustering centers is addressed by utilizing the max-distance criterion.Compared to the k-means clustering algorithm already implemented in the Spark platform,the proposed algorithm has strong robustness,can reduce the interference of outliers in the dataset on clustering results,and has higher parallel performance than the frequently used serial algorithms,thus improving the efficiency of big data analysis.This experiment compares the clustered centroid data with the annual average meteorological data of representative cities in the five typical meteorological regions that exist in China,and the results show that the clustering results are in good agreement with the meteorological data obtained from the National Meteorological Science Data Center.This algorithm has a positive effect on the clustering analysis of massive meteorological data and deserves attention in scientific research activities.展开更多
For a city,analyzing its advantages,disadvantages and the level of economic development in a country is important,especially for the cities in China developing at flying speed.The corresponding literatures for the cit...For a city,analyzing its advantages,disadvantages and the level of economic development in a country is important,especially for the cities in China developing at flying speed.The corresponding literatures for the cities in China have not considered the indicators of economy and industry in detail.In this paper,based on multiple indicators of economy and industry,the urban hierarchical structure of 285 cities above the prefecture level in China is investigated.The indicators from the economy,industry,infrastructure,medical care,population,education,culture,and employment levels are selected to establish a new indicator system for analyzing urban hierarchical structure.The factor analysis method is used to investigate the relationship between the variables of selected indicators and obtain the score of each common factor and comprehensive scores and rankings for 285 cities above the prefecture level in China.According to the comprehensive scores,285 cities above the prefecture level are clustered into 15 levels by using K-means clustering algorithm.Then,the hierarchical structure system of the cities above the prefecture level in China is obtained and corresponding policy implications are proposed.The results and implications can not only be applied to the urban planning and development in China but also offer a reference on other developing countries.The methodologies used in this paper can also be applied to study the urban hierarchical structure in other countries.展开更多
Gobi spans a large area of China,surpassing the combined expanse of mobile dunes and semi-fixed dunes.Its presence significantly influences the movement of sand and dust.However,the complex origins and diverse materia...Gobi spans a large area of China,surpassing the combined expanse of mobile dunes and semi-fixed dunes.Its presence significantly influences the movement of sand and dust.However,the complex origins and diverse materials constituting the Gobi result in notable differences in saltation processes across various Gobi surfaces.It is challenging to describe these processes according to a uniform morphology.Therefore,it becomes imperative to articulate surface characteristics through parameters such as the three-dimensional(3D)size and shape of gravel.Collecting morphology information for Gobi gravels is essential for studying its genesis and sand saltation.To enhance the efficiency and information yield of gravel parameter measurements,this study conducted field experiments in the Gobi region across Dunhuang City,Guazhou County,and Yumen City(administrated by Jiuquan City),Gansu Province,China in March 2023.A research framework and methodology for measuring 3D parameters of gravel using point cloud were developed,alongside improved calculation formulas for 3D parameters including gravel grain size,volume,flatness,roundness,sphericity,and equivalent grain size.Leveraging multi-view geometry technology for 3D reconstruction allowed for establishing an optimal data acquisition scheme characterized by high point cloud reconstruction efficiency and clear quality.Additionally,the proposed methodology incorporated point cloud clustering,segmentation,and filtering techniques to isolate individual gravel point clouds.Advanced point cloud algorithms,including the Oriented Bounding Box(OBB),point cloud slicing method,and point cloud triangulation,were then deployed to calculate the 3D parameters of individual gravels.These systematic processes allow precise and detailed characterization of individual gravels.For gravel grain size and volume,the correlation coefficients between point cloud and manual measurements all exceeded 0.9000,confirming the feasibility of the proposed methodology for measuring 3D parameters of individual gravels.The proposed workflow yields accurate calculations of relevant parameters for Gobi gravels,providing essential data support for subsequent studies on Gobi environments.展开更多
In this letter, a new method is proposed for unsupervised classification of terrain types and man-made objects using POLarimetric Synthetic Aperture Radar (POLSAR) data. This technique is a combi-nation of the usage o...In this letter, a new method is proposed for unsupervised classification of terrain types and man-made objects using POLarimetric Synthetic Aperture Radar (POLSAR) data. This technique is a combi-nation of the usage of polarimetric information of SAR images and the unsupervised classification method based on fuzzy set theory. Image quantization and image enhancement are used to preprocess the POLSAR data. Then the polarimetric information and Fuzzy C-Means (FCM) clustering algorithm are used to classify the preprocessed images. The advantages of this algorithm are the automated classification, its high classifica-tion accuracy, fast convergence and high stability. The effectiveness of this algorithm is demonstrated by ex-periments using SIR-C/X-SAR (Spaceborne Imaging Radar-C/X-band Synthetic Aperture Radar) data.展开更多
To construct a high efficient text clustering algorithm the multilevel graph model and the refinement algorithm used in the uncoarsening phase is discussed. The model is applied to text clustering. The performance of ...To construct a high efficient text clustering algorithm the multilevel graph model and the refinement algorithm used in the uncoarsening phase is discussed. The model is applied to text clustering. The performance of clustering algorithm has to be improved with the refinement algorithm application. The experiment result demonstrated that the multilevel graph text clustering algorithm is available. Key words text clustering - multilevel coarsen graph model - refinement algorithm - high-dimensional clustering CLC number TP301 Foundation item: Supported by the National Natural Science Foundation of China (60173051)Biography: CHEN Jian-bin(1970-), male, Associate professor, Ph. D., research direction: data mining.展开更多
Due to our increased dependence on Internet and growing number of intrusion incidents, building effective intrusion detection systems are essential for protecting Internet resources and yet it is a great challenge. In...Due to our increased dependence on Internet and growing number of intrusion incidents, building effective intrusion detection systems are essential for protecting Internet resources and yet it is a great challenge. In literature, many researchers utilized Artificial Neural Networks (ANN) in supervised learning based intrusion detection successfully. Here, ANN maps the network traffic into predefined classes i.e. normal or specific attack type based upon training from label dataset. However, for ANN-based IDS, detection rate (DR) and false positive rate (FPR) are still needed to be improved. In this study, we propose an ensemble approach, called MANNE, for ANN-based IDS that evolves ANNs by Multi Objective Genetic algorithm to solve the problem. It helps IDS to achieve high DR, less FPR and in turn high intrusion detection capability. The procedure of MANNE is as follows: firstly, a Pareto front consisting of a set of non-dominated ANN solutions is created using MOGA, which formulates the base classifiers. Subsequently, based upon this pool of non-dominated ANN solutions as base classifiers, another Pareto front consisting of a set of non-dominated ensembles is created which exhibits classification tradeoffs. Finally, prediction aggregation is done to get final ensemble prediction from predictions of base classifiers. Experimental results on the KDD CUP 1999 dataset show that our proposed ensemble approach, MANNE, outperforms ANN trained by Back Propagation and its ensembles using bagging & boosting methods in terms of defined performance metrics. We also compared our approach with other well-known methods such as decision tree and its ensembles using bagging & boosting methods.展开更多
Energy conservation is a significant task in the Internet of Things(IoT)because IoT involves highly resource-constrained devices.Clustering is an effective technique for saving energy by reducing duplicate data.In a c...Energy conservation is a significant task in the Internet of Things(IoT)because IoT involves highly resource-constrained devices.Clustering is an effective technique for saving energy by reducing duplicate data.In a clustering protocol,the selection of a cluster head(CH)plays a key role in prolonging the lifetime of a network.However,most cluster-based protocols,including routing protocols for low-power and lossy networks(RPLs),have used fuzzy logic and probabilistic approaches to select the CH node.Consequently,early battery depletion is produced near the sink.To overcome this issue,a lion optimization algorithm(LOA)for selecting CH in RPL is proposed in this study.LOA-RPL comprises three processes:cluster formation,CH selection,and route establishment.A cluster is formed using the Euclidean distance.CH selection is performed using LOA.Route establishment is implemented using residual energy information.An extensive simulation is conducted in the network simulator ns-3 on various parameters,such as network lifetime,power consumption,packet delivery ratio(PDR),and throughput.The performance of LOA-RPL is also compared with those of RPL,fuzzy rule-based energyefficient clustering and immune-inspired routing(FEEC-IIR),and the routing scheme for IoT that uses shuffled frog-leaping optimization algorithm(RISARPL).The performance evaluation metrics used in this study are network lifetime,power consumption,PDR,and throughput.The proposed LOARPL increases network lifetime by 20%and PDR by 5%–10%compared with RPL,FEEC-IIR,and RISA-RPL.LOA-RPL is also highly energy-efficient compared with other similar routing protocols.展开更多
基金Foundation of National Natural Science Foundation of China(62202118)Scientific and Technological Research Projects from Guizhou Education Department([2023]003)+1 种基金Guizhou Provincial Department of Science and Technology Hundred Levels of Innovative Talents Project(GCC[2023]018)Top Technology Talent Project from Guizhou Education Department([2022]073).
文摘The development of technologies such as big data and blockchain has brought convenience to life,but at the same time,privacy and security issues are becoming more and more prominent.The K-anonymity algorithm is an effective and low computational complexity privacy-preserving algorithm that can safeguard users’privacy by anonymizing big data.However,the algorithm currently suffers from the problem of focusing only on improving user privacy while ignoring data availability.In addition,ignoring the impact of quasi-identified attributes on sensitive attributes causes the usability of the processed data on statistical analysis to be reduced.Based on this,we propose a new K-anonymity algorithm to solve the privacy security problem in the context of big data,while guaranteeing improved data usability.Specifically,we construct a new information loss function based on the information quantity theory.Considering that different quasi-identification attributes have different impacts on sensitive attributes,we set weights for each quasi-identification attribute when designing the information loss function.In addition,to reduce information loss,we improve K-anonymity in two ways.First,we make the loss of information smaller than in the original table while guaranteeing privacy based on common artificial intelligence algorithms,i.e.,greedy algorithm and 2-means clustering algorithm.In addition,we improve the 2-means clustering algorithm by designing a mean-center method to select the initial center of mass.Meanwhile,we design the K-anonymity algorithm of this scheme based on the constructed information loss function,the improved 2-means clustering algorithm,and the greedy algorithm,which reduces the information loss.Finally,we experimentally demonstrate the effectiveness of the algorithm in improving the effect of 2-means clustering and reducing information loss.
文摘Classical survival analysis assumes all subjects will experience the event of interest, but in some cases, a portion of the population may never encounter the event. These survival methods further assume independent survival times, which is not valid for honey bees, which live in nests. The study introduces a semi-parametric marginal proportional hazards mixture cure (PHMC) model with exchangeable correlation structure, using generalized estimating equations for survival data analysis. The model was tested on clustered right-censored bees survival data with a cured fraction, where two bee species were subjected to different entomopathogens to test the effect of the entomopathogens on the survival of the bee species. The Expectation-Solution algorithm is used to estimate the parameters. The study notes a weak positive association between cure statuses (ρ1=0.0007) and survival times for uncured bees (ρ2=0.0890), emphasizing their importance. The odds of being uncured for A. mellifera is higher than the odds for species M. ferruginea. The bee species, A. mellifera are more susceptible to entomopathogens icipe 7, icipe 20, and icipe 69. The Cox-Snell residuals show that the proposed semiparametric PH model generally fits the data well as compared to model that assume independent correlation structure. Thus, the semi parametric marginal proportional hazards mixture cure is parsimonious model for correlated bees survival data.
基金funded by the Key-Area Research and Development Program of Guangdong Province(Grant No.2020B1111200001)the Key project of monitoring,early warning and prevention of major natural disasters of China(Grant No.2019YFC1510304)+1 种基金the S&T Program of Hebei(Grant No.19275408D)the Scientific Research Projects of Weather Modification in Northwest China(Grant No.RYSY201905).
文摘A convective and stratiform cloud classification method for weather radar is proposed based on the density-based spatial clustering of applications with noise(DBSCAN)algorithm.To identify convective and stratiform clouds in different developmental phases,two-dimensional(2D)and three-dimensional(3D)models are proposed by applying reflectivity factors at 0.5°and at 0.5°,1.5°,and 2.4°elevation angles,respectively.According to the thresholds of the algorithm,which include echo intensity,the echo top height of 35 dBZ(ET),density threshold,andεneighborhood,cloud clusters can be marked into four types:deep-convective cloud(DCC),shallow-convective cloud(SCC),hybrid convective-stratiform cloud(HCS),and stratiform cloud(SFC)types.Each cloud cluster type is further identified as a core area and boundary area,which can provide more abundant cloud structure information.The algorithm is verified using the volume scan data observed with new-generation S-band weather radars in Nanjing,Xuzhou,and Qingdao.The results show that cloud clusters can be intuitively identified as core and boundary points,which change in area continuously during the process of convective evolution,by the improved DBSCAN algorithm.Therefore,the occurrence and disappearance of convective weather can be estimated in advance by observing the changes of the classification.Because density thresholds are different and multiple elevations are utilized in the 3D model,the identified echo types and areas are dissimilar between the 2D and 3D models.The 3D model identifies larger convective and stratiform clouds than the 2D model.However,the developing convective clouds of small areas at lower heights cannot be identified with the 3D model because they are covered by thick stratiform clouds.In addition,the 3D model can avoid the influence of the melting layer and better suggest convective clouds in the developmental stage.
基金the National Natural Science Foundation of China (Grant No. 61762031)Guangxi Key Research and Development Plan (Gui Science AB17195029, Gui Science AB18126006)+3 种基金Guangxi key Laboratory Fund of Embedded Technology and Intelligent System, 2017 Innovation Project of Guangxi Graduate Education (No. YCSW2017156)2018 Innovation Project of Guangxi Graduate Education (No. YCSW2018157)Subsidies for the Project of Promoting the Ability of Young and Middleaged Scientific Research in Universities and Colleges of Guangxi (KY2016YB184)2016 Guilin Science and Technology Project (Gui Science 2016010202).
文摘Due to the development of E-Commerce, collaboration filtering (CF) recommendation algorithm becomes popular in recent years. It has some limitations such as cold start, data sparseness and low operation efficiency. In this paper, a CF recommendation algorithm is propose based on the latent factor model and improved spectral clustering (CFRALFMISC) to improve the forecasting precision. The latent factor model was firstly adopted to predict the missing score. Then, the cluster validity index was used to determine the number of clusters. Finally, the spectral clustering was improved by using the FCM algorithm to replace the K-means in the spectral clustering. The simulation results show that CFRALFMISC can effectively improve the recommendation precision compared with other algorithms.
基金Supported by the Open Researches Fund Program of L IESMARS(WKL(0 0 ) 0 30 2 )
文摘Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application fields, such as statistical data analysis, pattern recognition, image processing, and etc. We combine sampling technique with DBSCAN algorithm to cluster large spatial databases, and two sampling based DBSCAN (SDBSCAN) algorithms are developed. One algorithm introduces sampling technique inside DBSCAN, and the other uses sampling procedure outside DBSCAN. Experimental results demonstrate that our algorithms are effective and efficient in clustering large scale spatial databases.
文摘As we enter the year of 2011, the 2009 H1N1 pandemic influenza virus is in the news again. At least 20 people have died of this virus in China since the beginning of 2011 and it is now the predominant flu strain in the country. Although this novel virus was quite stable during its run in the flu season of 2009-2010, a genetic variant of this virus was found in Singapore in early 2010, and then in Australia and New Zealand during their 2010 winter influenza season. Several critical mutations in the HA protein of this variant were uncovered in the strains collected from January 2010 to April 2010. Moreover, a structural homology model of HA from the A/Brisbane/10/2010(H1N1) strain was made based on the structure of A/California/04/2009 (H1N1). The purpose of this study was to investigate mutations in the HA protein of 2009 H1N1 from sequence data collected worldwide from May 2010 to February 2011. A fundamental problem in bioinformatics and biology is to find the similar gene sequences for a given gene sequence of interest. Here we proposed the inverse problem, i.e., finding the exemplars from a group of related gene sequences. With a clustering algorithm affinity propagation, six exemplars of the HA sequences were identified to represent six clusters. One of the clusters contained strain A/Brisbane/12/2010(H1N1) that only differed from A/Brisbane/10/2010 in the HA sequence at position 449. Based on the sequence identity of the six exemplars, nine mutations in HA were located that could be used to distinguish these six clusters. Finally, we discovered the change of correlation patterns for the HA and NA of 2009 H1N1 as a result of the HA receptor binding specificity switch, revealing the balanced interplay between these two surface proteins of the virus.
文摘Energy efficiency is the prime concern in Wireless Sensor Networks(WSNs) as maximized energy consumption without essentially limits the energy stability and network lifetime. Clustering is the significant approach essential for minimizing unnecessary transmission energy consumption with sustained network lifetime. This clustering process is identified as the Non-deterministic Polynomial(NP)-hard optimization problems which has the maximized probability of being solved through metaheuristic algorithms.This adoption of hybrid metaheuristic algorithm concentrates on the identification of the optimal or nearoptimal solutions which aids in better energy stability during Cluster Head(CH) selection. In this paper,Hybrid Seagull and Whale Optimization Algorithmbased Dynamic Clustering Protocol(HSWOA-DCP)is proposed with the exploitation benefits of WOA and exploration merits of SEOA to optimal CH selection for maintaining energy stability with prolonged network lifetime. This HSWOA-DCP adopted the modified version of SEagull Optimization Algorithm(SEOA) to handle the problem of premature convergence and computational accuracy which is maximally possible during CH selection. The inclusion of SEOA into WOA improved the global searching capability during the selection of CH and prevents worst fitness nodes from being selected as CH, since the spiral attacking behavior of SEOA is similar to the bubble-net characteristics of WOA. This CH selection integrates the spiral attacking principles of SEOA and contraction surrounding mechanism of WOA for improving computation accuracy to prevent frequent election process. It also included the strategy of levy flight strategy into SEOA for potentially avoiding premature convergence to attain better trade-off between the rate of exploration and exploitation in a more effective manner. The simulation results of the proposed HSWOADCP confirmed better network survivability rate, network residual energy and network overall throughput on par with the competitive CH selection schemes under different number of data transmission rounds.The statistical analysis of the proposed HSWOA-DCP scheme also confirmed its energy stability with respect to ANOVA test.
文摘In Wireless Sensor Networks(WSNs),Clustering process is widely utilized for increasing the lifespan with sustained energy stability during data transmission.Several clustering protocols were devised for extending network lifetime,but most of them failed in handling the problem of fixed clustering,static rounds,and inadequate Cluster Head(CH)selection criteria which consumes more energy.In this paper,Stochastic Ranking Improved Teaching-Learning and Adaptive Grasshopper Optimization Algorithm(SRITL-AGOA)-based Clustering Scheme for energy stabilization and extending network lifespan.This SRITL-AGOA selected CH depending on the weightage of factors such as node mobility degree,neighbour's density distance to sink,single-hop or multihop communication and Residual Energy(RE)that directly influences the energy consumption of sensor nodes.In specific,Grasshopper Optimization Algorithm(GOA)is improved through tangent-based nonlinear strategy for enhancing the ability of global optimization.On the other hand,stochastic ranking and violation constraint handling strategies are embedded into Teaching-Learning-based Optimization Algorithm(TLOA)for improving its exploitation tendencies.Then,SR and VCH improved TLOA is embedded into the exploitation phase of AGOA for selecting better CH by maintaining better balance amid exploration and exploitation.Simulation results confirmed that the proposed SRITL-AGOA improved throughput by 21.86%,network stability by 18.94%,load balancing by 16.14%with minimized energy depletion by19.21%,compared to the competitive CH selection approaches.
文摘To the problem that it is hard to determine the clustering number and the abnormal points by using the clustering validity function, an effective clustering partition model based on the genetic algorithm is built in this paper. The solution to the problem is formed by the combination of the clustering partition and the encoding samples, and the fitness function is defined by the distances among and within clusters. The clustering number and the samples in each cluster are determined and the abnormal points are distinguished by implementing the triple random crossover operator and the mutation. Based on the known sample data, the results of the novel method and the clustering validity function are compared. Numerical experiments are given and the results show that the novel method is more effective.
基金Supported by the National Natural Science Foundation of China (42174142)National Science and Technology Major Project (2017ZX05039-002)+2 种基金Operation Fund of China National Petroleum Corporation Logging Key Laboratory (2021DQ20210107-11)Fundamental Research Funds for Central Universities (19CX02006A)Major Science and Technology Project of China National Petroleum Corporation (ZD2019-183-006)。
文摘To make the quantitative results of nuclear magnetic resonance(NMR) transverse relaxation(T;) spectrums reflect the type and pore structure of reservoir more directly, an unsupervised clustering method was developed to obtain the quantitative pore structure information from the NMR T;spectrums based on the Gaussian mixture model(GMM). Firstly, We conducted the principal component analysis on T;spectrums in order to reduce the dimension data and the dependence of the original variables. Secondly, the dimension-reduced data was fitted using the GMM probability density function, and the model parameters and optimal clustering numbers were obtained according to the expectation-maximization algorithm and the change of the Akaike information criterion. Finally, the T;spectrum features and pore structure types of different clustering groups were analyzed and compared with T;geometric mean and T;arithmetic mean. The effectiveness of the algorithm has been verified by numerical simulation and field NMR logging data. The research shows that the clustering results based on GMM method have good correlations with the shape and distribution of the T;spectrum, pore structure, and petroleum productivity, providing a new means for quantitative identification of pore structure, reservoir grading, and oil and gas productivity evaluation.
基金the National Natural Science Founda-tion of China (No. 70471022)Joint Research Scheme ofthe National Natural Science Foundation of China andthe Hong Kong Research Grant Council (No. 70418013)
文摘Many existing product family design methods assume a given platform, However, it is not an in-tuitive task to select the platform and unique variable within a product family. Meanwhile, most approaches are single-platform methods, in which design variables are either shared across all product variants or not at all. While in multiple-platform design, platform variables can have special value with regard to a subset of product variants within the product family, and offer opportunities for superior overall design. An information theoretical approach incorporating fuzzy clustering and Shannon's entropy was proposed for platform variables selection in multiple-platform product family. A 2-level chromosome genetic algorithm (2LCGA) was proposed and developed for optimizing the corresponding product family in a single stage, simultaneously determining the optimal settings for the product platform and unique variables. The single-stage approach can yield im-provements in the overall performance of the product family compared with two-stage approaches, in which the first stage involves determining the best settings for the platform and values of unique variables are found for each product in the second stage. An example of design of a family of universal motors was used to verify the proposed method.
基金supported via funding from Prince Sattam bin Abdulaziz University Project Number(PSAU/2024/R/1445).
文摘The study aims to recognize how efficiently Educational DataMining(EDM)integrates into Artificial Intelligence(AI)to develop skills for predicting students’performance.The study used a survey questionnaire and collected data from 300 undergraduate students of Al Neelain University.The first step’s initial population placements were created using Particle Swarm Optimization(PSO).Then,using adaptive feature space search,Educational Grey Wolf Optimization(EGWO)was employed to choose the optimal attribute combination.The second stage uses the SVMclassifier to forecast classification accuracy.Different classifiers were utilized to evaluate the performance of students.According to the results,it was revealed that AI could forecast the final grades of students with an accuracy rate of 97%on the test dataset.Furthermore,the present study showed that successful students could be selected by the Decision Tree model with an efficiency rate of 87.50%and could be categorized as having equal information ratio gain after the semester.While the random forest provided an accuracy of 28%.These findings indicate the higher accuracy rate in the results when these models were implemented on the data set which provides significantly accurate results as compared to a linear regression model with accuracy(12%).The study concluded that the methodology used in this study can prove to be helpful for students and teachers in upgrading academic performance,reducing chances of failure,and taking appropriate steps at the right time to raise the standards of education.The study also motivates academics to assess and discover EDM at several other universities.
基金supported by the National Natural Science Foundation of China(Grant No.62101275 and 62101274).
文摘With the rapid development of technology,processing the explosive growth of meteorological data on traditional standalone computing has become increasingly time-consuming,which cannot meet the demands of scientific research and business.Therefore,this paper proposes the implementation of the parallel Clustering Large Application based upon RANdomized Search(CLARANS)clustering algorithm on the Spark cloud computing platformto cluster China’s climate regions usingmeteorological data from1988 to 2018.The aim is to address the challenge of applying clustering algorithms to large datasets.In this paper,the morphological similarity distance is adopted as the similarity measurement standard instead of Euclidean distance,which improves clustering accuracy.Furthermore,the issue of local optima caused by an improper selection of initial clustering centers is addressed by utilizing the max-distance criterion.Compared to the k-means clustering algorithm already implemented in the Spark platform,the proposed algorithm has strong robustness,can reduce the interference of outliers in the dataset on clustering results,and has higher parallel performance than the frequently used serial algorithms,thus improving the efficiency of big data analysis.This experiment compares the clustered centroid data with the annual average meteorological data of representative cities in the five typical meteorological regions that exist in China,and the results show that the clustering results are in good agreement with the meteorological data obtained from the National Meteorological Science Data Center.This algorithm has a positive effect on the clustering analysis of massive meteorological data and deserves attention in scientific research activities.
基金supported by National Key Research and Development Program of China(Grant No.2018YFC0704903).
文摘For a city,analyzing its advantages,disadvantages and the level of economic development in a country is important,especially for the cities in China developing at flying speed.The corresponding literatures for the cities in China have not considered the indicators of economy and industry in detail.In this paper,based on multiple indicators of economy and industry,the urban hierarchical structure of 285 cities above the prefecture level in China is investigated.The indicators from the economy,industry,infrastructure,medical care,population,education,culture,and employment levels are selected to establish a new indicator system for analyzing urban hierarchical structure.The factor analysis method is used to investigate the relationship between the variables of selected indicators and obtain the score of each common factor and comprehensive scores and rankings for 285 cities above the prefecture level in China.According to the comprehensive scores,285 cities above the prefecture level are clustered into 15 levels by using K-means clustering algorithm.Then,the hierarchical structure system of the cities above the prefecture level in China is obtained and corresponding policy implications are proposed.The results and implications can not only be applied to the urban planning and development in China but also offer a reference on other developing countries.The methodologies used in this paper can also be applied to study the urban hierarchical structure in other countries.
基金funded by the National Natural Science Foundation of China(42071014).
文摘Gobi spans a large area of China,surpassing the combined expanse of mobile dunes and semi-fixed dunes.Its presence significantly influences the movement of sand and dust.However,the complex origins and diverse materials constituting the Gobi result in notable differences in saltation processes across various Gobi surfaces.It is challenging to describe these processes according to a uniform morphology.Therefore,it becomes imperative to articulate surface characteristics through parameters such as the three-dimensional(3D)size and shape of gravel.Collecting morphology information for Gobi gravels is essential for studying its genesis and sand saltation.To enhance the efficiency and information yield of gravel parameter measurements,this study conducted field experiments in the Gobi region across Dunhuang City,Guazhou County,and Yumen City(administrated by Jiuquan City),Gansu Province,China in March 2023.A research framework and methodology for measuring 3D parameters of gravel using point cloud were developed,alongside improved calculation formulas for 3D parameters including gravel grain size,volume,flatness,roundness,sphericity,and equivalent grain size.Leveraging multi-view geometry technology for 3D reconstruction allowed for establishing an optimal data acquisition scheme characterized by high point cloud reconstruction efficiency and clear quality.Additionally,the proposed methodology incorporated point cloud clustering,segmentation,and filtering techniques to isolate individual gravel point clouds.Advanced point cloud algorithms,including the Oriented Bounding Box(OBB),point cloud slicing method,and point cloud triangulation,were then deployed to calculate the 3D parameters of individual gravels.These systematic processes allow precise and detailed characterization of individual gravels.For gravel grain size and volume,the correlation coefficients between point cloud and manual measurements all exceeded 0.9000,confirming the feasibility of the proposed methodology for measuring 3D parameters of individual gravels.The proposed workflow yields accurate calculations of relevant parameters for Gobi gravels,providing essential data support for subsequent studies on Gobi environments.
基金Supported by the University Doctorate Special Research Fund (No. 20030614001) and the Youth Scholarship Leader Fund of Univ. of Electro. Sci. and Tech. of China.
文摘In this letter, a new method is proposed for unsupervised classification of terrain types and man-made objects using POLarimetric Synthetic Aperture Radar (POLSAR) data. This technique is a combi-nation of the usage of polarimetric information of SAR images and the unsupervised classification method based on fuzzy set theory. Image quantization and image enhancement are used to preprocess the POLSAR data. Then the polarimetric information and Fuzzy C-Means (FCM) clustering algorithm are used to classify the preprocessed images. The advantages of this algorithm are the automated classification, its high classifica-tion accuracy, fast convergence and high stability. The effectiveness of this algorithm is demonstrated by ex-periments using SIR-C/X-SAR (Spaceborne Imaging Radar-C/X-band Synthetic Aperture Radar) data.
文摘To construct a high efficient text clustering algorithm the multilevel graph model and the refinement algorithm used in the uncoarsening phase is discussed. The model is applied to text clustering. The performance of clustering algorithm has to be improved with the refinement algorithm application. The experiment result demonstrated that the multilevel graph text clustering algorithm is available. Key words text clustering - multilevel coarsen graph model - refinement algorithm - high-dimensional clustering CLC number TP301 Foundation item: Supported by the National Natural Science Foundation of China (60173051)Biography: CHEN Jian-bin(1970-), male, Associate professor, Ph. D., research direction: data mining.
文摘Due to our increased dependence on Internet and growing number of intrusion incidents, building effective intrusion detection systems are essential for protecting Internet resources and yet it is a great challenge. In literature, many researchers utilized Artificial Neural Networks (ANN) in supervised learning based intrusion detection successfully. Here, ANN maps the network traffic into predefined classes i.e. normal or specific attack type based upon training from label dataset. However, for ANN-based IDS, detection rate (DR) and false positive rate (FPR) are still needed to be improved. In this study, we propose an ensemble approach, called MANNE, for ANN-based IDS that evolves ANNs by Multi Objective Genetic algorithm to solve the problem. It helps IDS to achieve high DR, less FPR and in turn high intrusion detection capability. The procedure of MANNE is as follows: firstly, a Pareto front consisting of a set of non-dominated ANN solutions is created using MOGA, which formulates the base classifiers. Subsequently, based upon this pool of non-dominated ANN solutions as base classifiers, another Pareto front consisting of a set of non-dominated ensembles is created which exhibits classification tradeoffs. Finally, prediction aggregation is done to get final ensemble prediction from predictions of base classifiers. Experimental results on the KDD CUP 1999 dataset show that our proposed ensemble approach, MANNE, outperforms ANN trained by Back Propagation and its ensembles using bagging & boosting methods in terms of defined performance metrics. We also compared our approach with other well-known methods such as decision tree and its ensembles using bagging & boosting methods.
基金This research was supported by X-mind Corps program of National Research Foundation of Korea(NRF)funded by the Ministry of Science,ICT(No.2019H1D8A1105622)the Soonchunhyang University Research Fund.
文摘Energy conservation is a significant task in the Internet of Things(IoT)because IoT involves highly resource-constrained devices.Clustering is an effective technique for saving energy by reducing duplicate data.In a clustering protocol,the selection of a cluster head(CH)plays a key role in prolonging the lifetime of a network.However,most cluster-based protocols,including routing protocols for low-power and lossy networks(RPLs),have used fuzzy logic and probabilistic approaches to select the CH node.Consequently,early battery depletion is produced near the sink.To overcome this issue,a lion optimization algorithm(LOA)for selecting CH in RPL is proposed in this study.LOA-RPL comprises three processes:cluster formation,CH selection,and route establishment.A cluster is formed using the Euclidean distance.CH selection is performed using LOA.Route establishment is implemented using residual energy information.An extensive simulation is conducted in the network simulator ns-3 on various parameters,such as network lifetime,power consumption,packet delivery ratio(PDR),and throughput.The performance of LOA-RPL is also compared with those of RPL,fuzzy rule-based energyefficient clustering and immune-inspired routing(FEEC-IIR),and the routing scheme for IoT that uses shuffled frog-leaping optimization algorithm(RISARPL).The performance evaluation metrics used in this study are network lifetime,power consumption,PDR,and throughput.The proposed LOARPL increases network lifetime by 20%and PDR by 5%–10%compared with RPL,FEEC-IIR,and RISA-RPL.LOA-RPL is also highly energy-efficient compared with other similar routing protocols.