A new way of indexing and processing twig patterns in an XML documents is proposed in this paper. Every path in XML document can be transformed into a sequence of labels by Structure-Encoded that constructs a one-to-o...A new way of indexing and processing twig patterns in an XML documents is proposed in this paper. Every path in XML document can be transformed into a sequence of labels by Structure-Encoded that constructs a one-to-one correspondence between XML tree and sequence. Base on identifying characteristics of nodes in XML tree, the elements are classified and clustered. During query proceeding, the twig pattern is also transformed into its Structure-Encoded. By performing subsequence matching on the set of sequences in XML documents, all the occurrences of path in the XML documents are refined. Using the index, the numbers of elements retrieved are minimized. The search results with pertinent format provide more structure information without any false dismissals or false alarms. The index also supports keyword search Experiment results indicate the index has significantly efficiency with high precision.展开更多
Partition-based clustering with weighted feature is developed in the framework of shadowed sets. The objects in the core and boundary regions, generated by shadowed sets-based clustering, have different impact on the ...Partition-based clustering with weighted feature is developed in the framework of shadowed sets. The objects in the core and boundary regions, generated by shadowed sets-based clustering, have different impact on the prototype of each cluster. By integrating feature weights, a formula for weight calculation is introduced to the clustering algorithm. The selection of weight exponent is crucial for good result and the weights are updated iteratively with each partition of clusters. The convergence of the weighted algorithms is given, and the feasible cluster validity indices of data mining application are utilized. Experimental results on both synthetic and real-life numerical data with different feature weights demonstrate that the weighted algorithm is better than the other unweighted algorithms.展开更多
The pick-up algorithm by the k-th order cluster for the closest distance is used in the fields of weather and climactic events, and the technical terms clustered index and high clustered region are defined to investig...The pick-up algorithm by the k-th order cluster for the closest distance is used in the fields of weather and climactic events, and the technical terms clustered index and high clustered region are defined to investigate their temporal and spatial distribution characteristics in China during the past 50 years. The results show that the contribution of extreme high-temperature event clusters changed in the period from the 1960s to the 1970s, and its strength was enhanced. On the other hand, the decreasing trend in the clusters of low-temperature extremes can be taken as a signal for warmer winters to follow in the decadal time scale. Torrential rain and heavy rainfall clusters have both been lessened in the past 50 years, and have different cluster characteristics because of their definitions. Regions with high clustered indexes are concentrated in southern China. The spatial evolution of the heavy rainfall clusters reveals that clustered heavy rainfall has played an important role in the rain-belt pattern over China during the last 50 years.展开更多
A multilevel secure relation hierarchical data model for multilevel secure database is extended from the relation hierarchical data model in single level environment in this paper. Based on the model, an upper lowe...A multilevel secure relation hierarchical data model for multilevel secure database is extended from the relation hierarchical data model in single level environment in this paper. Based on the model, an upper lower layer relationalintegrity is presented after we analyze and eliminate the covert channels caused by the database integrity.Two SQL statements are extended to process polyinstantiation in the multilevel secure environment.The system based on the multilevel secure relation hierarchical data model is capable of integratively storing and manipulating complicated objects ( e.g. , multilevel spatial data) and conventional data ( e.g. , integer, real number and character string) in multilevel secure database.展开更多
Time series clustering is a challenging problem due to the large-volume,high-dimensional,and warping characteristics of time series data.Traditional clustering methods often use a single criterion or distance measure,...Time series clustering is a challenging problem due to the large-volume,high-dimensional,and warping characteristics of time series data.Traditional clustering methods often use a single criterion or distance measure,which may not capture all the features of the data.This paper proposes a novel method for time series clustering based on evolutionary multi-tasking optimization,termed i-MFEA,which uses an improved multifactorial evolutionary algorithm to optimize multiple clustering tasks simultaneously,each with a different validity index or distance measure.Therefore,i-MFEA can produce diverse and robust clustering solutions that satisfy various preferences of decision-makers.Experiments on two artificial datasets show that i-MFEA outperforms single-objective evolutionary algorithms and traditional clustering methods in terms of convergence speed and clustering quality.The paper also discusses how i-MFEA can address two long-standing issues in time series clustering:the choice of appropriate similarity measure and the number of clusters.展开更多
Although the distance between binary codes can be computed fast in Hamming space, linear search is not practical for large scale datasets. Therefore attention has been paid to the efficiency of performing approximate ...Although the distance between binary codes can be computed fast in Hamming space, linear search is not practical for large scale datasets. Therefore attention has been paid to the efficiency of performing approximate nearest neighbor search, in which hierarchical clustering trees (HCT) are widely used. However, HCT select cluster centers randomly and build indexes with the entire binary code, this degrades search performance. In this paper, we first propose a new clustering algorithm, which chooses cluster centers on the basis of relative distances and uses a more homogeneous partition of the dataset than HCT has to build the hierarchical clustering trees. Then, we present an algorithm to compress binary codes by extracting distinctive bits according to the standard deviation of each bit. Consequently, a new index is proposed using compressed binary codes based on hierarchical decomposition of binary spaces. Experiments conducted on reference datasets and a dataset of one billion binary codes demonstrate the effectiveness and efficiency of our method.展开更多
The density based notion for clustering approach is used widely due to its easy implementation and ability to detect arbitrary shaped clusters in the presence of noisy data points without requiring prior knowledge of ...The density based notion for clustering approach is used widely due to its easy implementation and ability to detect arbitrary shaped clusters in the presence of noisy data points without requiring prior knowledge of the number of clusters to be identified. Density-based spatial clustering of applications with noise (DBSCAN) is the first algorithm proposed in the literature that uses density based notion for cluster detection. Since most of the real data set, today contains feature space of adjacent nested clusters, clearly DBSCAN is not suitable to detect variable adjacent density clusters due to the use of global density parameter neighborhood radius Y,.ad and minimum number of points in neighborhood Np~,. So the efficiency of DBSCAN depends on these initial parameter settings, for DBSCAN to work properly, the neighborhood radius must be less than the distance between two clusters otherwise algorithm merges two clusters and detects them as a single cluster. Through this paper: 1) We have proposed improved version of DBSCAN algorithm to detect clusters of varying density adjacent clusters by using the concept of neighborhood difference and using the notion of density based approach without introducing much additional computational complexity to original DBSCAN algorithm. 2) We validated our experimental results using one of our authors recently proposed space density indexing (SDI) internal cluster measure to demonstrate the quality of proposed clustering method. Also our experimental results suggested that proposed method is effective in detecting variable density adjacent nested clusters.展开更多
基金Supported by the National Natural Science Foundation of China (60473085)
文摘A new way of indexing and processing twig patterns in an XML documents is proposed in this paper. Every path in XML document can be transformed into a sequence of labels by Structure-Encoded that constructs a one-to-one correspondence between XML tree and sequence. Base on identifying characteristics of nodes in XML tree, the elements are classified and clustered. During query proceeding, the twig pattern is also transformed into its Structure-Encoded. By performing subsequence matching on the set of sequences in XML documents, all the occurrences of path in the XML documents are refined. Using the index, the numbers of elements retrieved are minimized. The search results with pertinent format provide more structure information without any false dismissals or false alarms. The index also supports keyword search Experiment results indicate the index has significantly efficiency with high precision.
基金Supported by the National Natural Science Foundation of China(61139002)~~
文摘Partition-based clustering with weighted feature is developed in the framework of shadowed sets. The objects in the core and boundary regions, generated by shadowed sets-based clustering, have different impact on the prototype of each cluster. By integrating feature weights, a formula for weight calculation is introduced to the clustering algorithm. The selection of weight exponent is crucial for good result and the weights are updated iteratively with each partition of clusters. The convergence of the weighted algorithms is given, and the feasible cluster validity indices of data mining application are utilized. Experimental results on both synthetic and real-life numerical data with different feature weights demonstrate that the weighted algorithm is better than the other unweighted algorithms.
基金Project supported by the National Natural Science Foundation of China(Grant Nos.41005043 and 41105033)the National Basic Research Program of China(Grant No.2012CB955901)the National Science and Technology Ministry,China(Grant Nos.2007BAC29B01 and 2007BAC03A01)
文摘The pick-up algorithm by the k-th order cluster for the closest distance is used in the fields of weather and climactic events, and the technical terms clustered index and high clustered region are defined to investigate their temporal and spatial distribution characteristics in China during the past 50 years. The results show that the contribution of extreme high-temperature event clusters changed in the period from the 1960s to the 1970s, and its strength was enhanced. On the other hand, the decreasing trend in the clusters of low-temperature extremes can be taken as a signal for warmer winters to follow in the decadal time scale. Torrential rain and heavy rainfall clusters have both been lessened in the past 50 years, and have different cluster characteristics because of their definitions. Regions with high clustered indexes are concentrated in southern China. The spatial evolution of the heavy rainfall clusters reveals that clustered heavy rainfall has played an important role in the rain-belt pattern over China during the last 50 years.
文摘A multilevel secure relation hierarchical data model for multilevel secure database is extended from the relation hierarchical data model in single level environment in this paper. Based on the model, an upper lower layer relationalintegrity is presented after we analyze and eliminate the covert channels caused by the database integrity.Two SQL statements are extended to process polyinstantiation in the multilevel secure environment.The system based on the multilevel secure relation hierarchical data model is capable of integratively storing and manipulating complicated objects ( e.g. , multilevel spatial data) and conventional data ( e.g. , integer, real number and character string) in multilevel secure database.
基金supported by the Open Project of Xiangjiang Laboratory(No.22XJ02003)the National Natural Science Foundation of China(No.62122093).
文摘Time series clustering is a challenging problem due to the large-volume,high-dimensional,and warping characteristics of time series data.Traditional clustering methods often use a single criterion or distance measure,which may not capture all the features of the data.This paper proposes a novel method for time series clustering based on evolutionary multi-tasking optimization,termed i-MFEA,which uses an improved multifactorial evolutionary algorithm to optimize multiple clustering tasks simultaneously,each with a different validity index or distance measure.Therefore,i-MFEA can produce diverse and robust clustering solutions that satisfy various preferences of decision-makers.Experiments on two artificial datasets show that i-MFEA outperforms single-objective evolutionary algorithms and traditional clustering methods in terms of convergence speed and clustering quality.The paper also discusses how i-MFEA can address two long-standing issues in time series clustering:the choice of appropriate similarity measure and the number of clusters.
文摘Although the distance between binary codes can be computed fast in Hamming space, linear search is not practical for large scale datasets. Therefore attention has been paid to the efficiency of performing approximate nearest neighbor search, in which hierarchical clustering trees (HCT) are widely used. However, HCT select cluster centers randomly and build indexes with the entire binary code, this degrades search performance. In this paper, we first propose a new clustering algorithm, which chooses cluster centers on the basis of relative distances and uses a more homogeneous partition of the dataset than HCT has to build the hierarchical clustering trees. Then, we present an algorithm to compress binary codes by extracting distinctive bits according to the standard deviation of each bit. Consequently, a new index is proposed using compressed binary codes based on hierarchical decomposition of binary spaces. Experiments conducted on reference datasets and a dataset of one billion binary codes demonstrate the effectiveness and efficiency of our method.
文摘The density based notion for clustering approach is used widely due to its easy implementation and ability to detect arbitrary shaped clusters in the presence of noisy data points without requiring prior knowledge of the number of clusters to be identified. Density-based spatial clustering of applications with noise (DBSCAN) is the first algorithm proposed in the literature that uses density based notion for cluster detection. Since most of the real data set, today contains feature space of adjacent nested clusters, clearly DBSCAN is not suitable to detect variable adjacent density clusters due to the use of global density parameter neighborhood radius Y,.ad and minimum number of points in neighborhood Np~,. So the efficiency of DBSCAN depends on these initial parameter settings, for DBSCAN to work properly, the neighborhood radius must be less than the distance between two clusters otherwise algorithm merges two clusters and detects them as a single cluster. Through this paper: 1) We have proposed improved version of DBSCAN algorithm to detect clusters of varying density adjacent clusters by using the concept of neighborhood difference and using the notion of density based approach without introducing much additional computational complexity to original DBSCAN algorithm. 2) We validated our experimental results using one of our authors recently proposed space density indexing (SDI) internal cluster measure to demonstrate the quality of proposed clustering method. Also our experimental results suggested that proposed method is effective in detecting variable density adjacent nested clusters.