Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application fields, such as statistical data analysis, pattern recogni...Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application fields, such as statistical data analysis, pattern recognition, image processing, and etc. We combine sampling technique with DBSCAN algorithm to cluster large spatial databases, and two sampling based DBSCAN (SDBSCAN) algorithms are developed. One algorithm introduces sampling technique inside DBSCAN, and the other uses sampling procedure outside DBSCAN. Experimental results demonstrate that our algorithms are effective and efficient in clustering large scale spatial databases.展开更多
Spatial objects have two types of attributes: geometrical attributes and non-geometrical attributes, which belong to two different attribute domains (geometrical and non-geometrical domains). Although geometrically...Spatial objects have two types of attributes: geometrical attributes and non-geometrical attributes, which belong to two different attribute domains (geometrical and non-geometrical domains). Although geometrically scattered in a geometrical domain, spatial objects may be similar to each other in a non-geometrical domain. Most existing clustering algorithms group spatial datasets into different compact regions in a geometrical domain without considering the aspect of a non-geometrical domain. However, many application scenarios require clustering results in which a cluster has not only high proximity in a geometrical domain, but also high similarity in a non-geometrical domain. This means constraints are imposed on the clustering goal from both geometrical and non-geometrical domains simultaneously. Such a clustering problem is called dual clustering. As distributed clustering applications become more and more popular, it is necessary to tackle the dual clustering problem in distributed databases. The DCAD algorithm is proposed to solve this problem. DCAD consists of two levels of clustering: local clustering and global clustering. First, clustering is conducted at each local site with a local clustering algorithm, and the features of local clusters are extracted clustering is obtained based on those features fective and efficient. Second, local features from each site are sent to a central site where global Experiments on both artificial and real spatial datasets show that DCAD is effective and efficient.展开更多
To realize content-hased retrieval of large image databases, it is required to develop an efficient index and retrieval scheme. This paper proposes an index algorithm of clustering called CMA, which supports fast retr...To realize content-hased retrieval of large image databases, it is required to develop an efficient index and retrieval scheme. This paper proposes an index algorithm of clustering called CMA, which supports fast retrieval of large image databases. CMA takes advantages of k-means and self-adaptive algorithms. It is simple and works without any user interactions. There are two main stages in this algorithm. In the first stage, it classifies images in a database into several clusters, and automatically gets the necessary parameters for the next stage-k-means iteration. The CMA algorithm is tested on a large database of more than ten thousand images and compare it with k-means algorithm. Experimental results show that this algorithm is effective in both precision and retrieval time.展开更多
A shared nothing spatial database cluster is system that provides continuous service even if some system failure happens in any node. So, an efficient recovery of system failure is very important. Generally, the exist...A shared nothing spatial database cluster is system that provides continuous service even if some system failure happens in any node. So, an efficient recovery of system failure is very important. Generally, the existing method recovers the failed node by using both cluster log and local log. This method, however, cause several problems that increase communication cost and size of cluster log. This paper proposes novel recovery method using recently updated record information in shared nothing spatial database cluster. The proposed technique utilizes update information of records and pointers of actual data. This makes a reduction of log size and communication cost. Consequently, this reduces recovery time of failed node due to less processing of update operations.展开更多
The explosive growth of the Internet and database applications has driven database to be more scalable and available, and able to support on line scaling without interrupting service. To support more client’s queries...The explosive growth of the Internet and database applications has driven database to be more scalable and available, and able to support on line scaling without interrupting service. To support more client’s queries without downtime and degrading the response time, more nodes have to be scaled up while the database is running. This paper presents the overview of scalable and available database that satisfies the above characteristics. And we propose a novel on line scaling method. Our method improves the existing on line scaling method for fast response time and higher throughputs. Our proposed method reduces unnecessary network use, i.e., we decrease the number of data copy by reusing the backup data. Also, our on line scaling operation can be processed parallel by selecting adequate nodes as new node. Our performance study shows that our method results in significant reduction in data copy time.展开更多
With the rapid advance of wireless communication, tracking the positions of the moving objects is becoming increasingly feasible and necessary. Because a large number of people use mobile phones, we must handle a larg...With the rapid advance of wireless communication, tracking the positions of the moving objects is becoming increasingly feasible and necessary. Because a large number of people use mobile phones, we must handle a large moving object database as well as the following problems. How can we provide the customers with high quality service, that means, how can we deal with so many enquiries within as less time as possible? Because of the large number of data, the gap between CPU speed and the size of main memory has increasing considerably. One way to reduce the time to handle enquiries is to reduce the I/O number between the buffer and the secondary storage.An effective clustering of the objects can minimize the I/O cost between them. In this paper, according to the characteristic of the moving object database, we analyze the objects in buffer, according to their mappings in the two dimension coordinate, and then develop a density based clustering method to effectively reorganize the clusters. This new mechanism leads to the less cost of the I/O operation and the more efficient response to enquiries.展开更多
A new secured database management system architecture using intrusion detection systems(IDS)is proposed in this paper for organizations with no previous role mapping for users.A simple representation of Structured Que...A new secured database management system architecture using intrusion detection systems(IDS)is proposed in this paper for organizations with no previous role mapping for users.A simple representation of Structured Query Language queries is proposed to easily permit the use of the worked clustering algorithm.A new clustering algorithm that uses a tube search with adaptive memory is applied to database log files to create users’profiles.Then,queries issued for each user are checked against the related user profile using a classifier to determine whether or not each query is malicious.The IDS will stop query execution or report the threat to the responsible person if the query is malicious.A simple classifier based on the Euclidean distance is used and the issued query is transformed to the proposed simple representation using a classifier,where the Euclidean distance between the centers and the profile’s issued query is calculated.A synthetic data set is used for our experimental evaluations.Normal user access behavior in relation to the database is modelled using the data set.The false negative(FN)and false positive(FP)rates are used to compare our proposed algorithm with other methods.The experimental results indicate that our proposed method results in very small FN and FP rates.展开更多
Based on the Online Registration System (ORS) characteristics and key technology analysis, this paper points out that that a good performance and high stability of the ORS lies in the choice of the system database. Da...Based on the Online Registration System (ORS) characteristics and key technology analysis, this paper points out that that a good performance and high stability of the ORS lies in the choice of the system database. Database clustering technology which has merits such as concurrent processing, easy expansion, and high security is proposed to achieve database subsystem of ORS, and the design of the database cluster system framework is available in this paper. Finally, we also explore the database load balancing of the cluster system, heterogeneous database replication technology.展开更多
The efficiency and performance of Distributed Database Management Systems (DDBMS) is mainly measured by its proper design and by network communication cost between sites. Fragmentation and distribution of data are the...The efficiency and performance of Distributed Database Management Systems (DDBMS) is mainly measured by its proper design and by network communication cost between sites. Fragmentation and distribution of data are the major design issues of the DDBMS. In this paper, we propose new approach that integrates both fragmentation and data allocation in one strategy based on high performance clustering technique and transaction processing cost functions. This new approach achieves efficiently and effectively the objectives of data fragmentation, data allocation and network sites clustering. The approach splits the data relations into pair-wise disjoint fragments and determine whether each fragment has to be allocated or not in the network sites, where allocation benefit outweighs the cost depending on high performance clustering technique. To show the performance of the proposed approach, we performed experimental studies on real database application at different networks connectivity. The obtained results proved to achieve minimum total data transaction costs between different sites, reduced the amount of redundant data to be accessed between these sites and improved the overall DDBMS performance.展开更多
A spatial orientation of angular momentum vectors of galaxies in six dynamically unstable Abell clusters(S1171, S0001, A1035, A1373, A1474 and A4053) is studied. For this, twodimensional observed parameters(e.g., p...A spatial orientation of angular momentum vectors of galaxies in six dynamically unstable Abell clusters(S1171, S0001, A1035, A1373, A1474 and A4053) is studied. For this, twodimensional observed parameters(e.g., positions, diameters and position angles) are converted into three-dimensional(3D) rotation axes of the galaxy using the 'position angle-inclination' method. The expected isotropic distribution curves for angular momentum vectors are obtained by performing random simulations. The observed and expected distributions are compared using several statistical tests.No preferred alignments of angular momentum vectors of galaxies are noticed in all six dynamically unstable clusters, supporting the hierarchy model of galaxy formation. These clusters have a larger value of velocity dispersion. However, local effects are noticed in the clusters that have substructures in the1D-3D number density maps.展开更多
Age of knowledge explosion requires us not only to have the ability to get useful information which represented by data but also to find knowledge in information. Human Genome Project achieved large amount of such bio...Age of knowledge explosion requires us not only to have the ability to get useful information which represented by data but also to find knowledge in information. Human Genome Project achieved large amount of such biological data, and people found clustering is a promising approach to analyze those biological data for knowledge hidden. The researches on biological data go to in-depth gradually and so are the clustering algorithms. This article mainly introduces current broad-used clustering algorithms, including the main idea, improvements, key technology, advantage and disadvantage, and the applications in biological field as well as the problems they solve. What’s more, this article roughly introduces some database used in biological field.展开更多
The physical nature of a series of 20 new open clusters is confirmed employing existing data on putative star members,mainly from the second Gaia Data Release(DR2).The clusters were discovered as overdensities of star...The physical nature of a series of 20 new open clusters is confirmed employing existing data on putative star members,mainly from the second Gaia Data Release(DR2).The clusters were discovered as overdensities of stars by visual inspection of either photographic DSS plates or proper motion plots of random source fields.The reported objects are not present in the most comprehensive or recent catalogs of stellar clusters and associations.For all of them,clumps of comoving stars are revealed in the proper motion space.The parallaxes of the clumped stars are compatible with the real existence of open clusters over narrow ranges of distances.Surface density calculations,free of most noise from non-member sources,allow differentiating a cluster core and an extended cluster corona in some instances.Color-magnitude diagrams generally show a definite main sequence that allows confirmation of the physical existence of the clusters and some of their characteristics.Two of the new clusters seem to form a double system with a common origin.Several of the new clusters challenge the claim of near completeness of the known OC population in the distance range from 1.0 to 1.8 kpc from the Sun(Kharchenko et al.).展开更多
We present the first in a series studying the astrophysical parameters of open clusters using the PPMXL* database whose data are applied to study Ruprecht 15. The astrophysical parameters of Ruprecht 15 have been est...We present the first in a series studying the astrophysical parameters of open clusters using the PPMXL* database whose data are applied to study Ruprecht 15. The astrophysical parameters of Ruprecht 15 have been estimated for the first time.展开更多
以GDELT(global database of event,language,tone)数据库为例,讨论使用数据源路径爬取相关新闻文档。利用改进的AC自动机进行多模关键词匹配完成初步的数据清洗;对过滤好的文档数据进行主题数量评估,再利用LDA模型对其进行主题分类和...以GDELT(global database of event,language,tone)数据库为例,讨论使用数据源路径爬取相关新闻文档。利用改进的AC自动机进行多模关键词匹配完成初步的数据清洗;对过滤好的文档数据进行主题数量评估,再利用LDA模型对其进行主题分类和关键词提取。根据分类结果,对海洋环境与气候主题新闻数据及相关指标建立空间聚类模型,最终形成一个对海量文档数据进行抓取、清洗、主题挖掘、空间聚类及可视化呈现的分析模型。展开更多
基金Supported by the Open Researches Fund Program of L IESMARS(WKL(0 0 ) 0 30 2 )
文摘Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application fields, such as statistical data analysis, pattern recognition, image processing, and etc. We combine sampling technique with DBSCAN algorithm to cluster large spatial databases, and two sampling based DBSCAN (SDBSCAN) algorithms are developed. One algorithm introduces sampling technique inside DBSCAN, and the other uses sampling procedure outside DBSCAN. Experimental results demonstrate that our algorithms are effective and efficient in clustering large scale spatial databases.
基金Funded by the National 973 Program of China (No.2003CB415205)the National Natural Science Foundation of China (No.40523005, No.60573183, No.60373019)the Open Research Fund Program of LIESMARS (No.WKL(04)0303).
文摘Spatial objects have two types of attributes: geometrical attributes and non-geometrical attributes, which belong to two different attribute domains (geometrical and non-geometrical domains). Although geometrically scattered in a geometrical domain, spatial objects may be similar to each other in a non-geometrical domain. Most existing clustering algorithms group spatial datasets into different compact regions in a geometrical domain without considering the aspect of a non-geometrical domain. However, many application scenarios require clustering results in which a cluster has not only high proximity in a geometrical domain, but also high similarity in a non-geometrical domain. This means constraints are imposed on the clustering goal from both geometrical and non-geometrical domains simultaneously. Such a clustering problem is called dual clustering. As distributed clustering applications become more and more popular, it is necessary to tackle the dual clustering problem in distributed databases. The DCAD algorithm is proposed to solve this problem. DCAD consists of two levels of clustering: local clustering and global clustering. First, clustering is conducted at each local site with a local clustering algorithm, and the features of local clusters are extracted clustering is obtained based on those features fective and efficient. Second, local features from each site are sent to a central site where global Experiments on both artificial and real spatial datasets show that DCAD is effective and efficient.
基金This project was supported by National High Tech Foundation of 863 (2001AA115123)
文摘To realize content-hased retrieval of large image databases, it is required to develop an efficient index and retrieval scheme. This paper proposes an index algorithm of clustering called CMA, which supports fast retrieval of large image databases. CMA takes advantages of k-means and self-adaptive algorithms. It is simple and works without any user interactions. There are two main stages in this algorithm. In the first stage, it classifies images in a database into several clusters, and automatically gets the necessary parameters for the next stage-k-means iteration. The CMA algorithm is tested on a large database of more than ten thousand images and compare it with k-means algorithm. Experimental results show that this algorithm is effective in both precision and retrieval time.
基金This work is supported by University IT Research Center ProjectKorea.
文摘A shared nothing spatial database cluster is system that provides continuous service even if some system failure happens in any node. So, an efficient recovery of system failure is very important. Generally, the existing method recovers the failed node by using both cluster log and local log. This method, however, cause several problems that increase communication cost and size of cluster log. This paper proposes novel recovery method using recently updated record information in shared nothing spatial database cluster. The proposed technique utilizes update information of records and pointers of actual data. This makes a reduction of log size and communication cost. Consequently, this reduces recovery time of failed node due to less processing of update operations.
基金.This work is supported by University IT Research Center Project
文摘The explosive growth of the Internet and database applications has driven database to be more scalable and available, and able to support on line scaling without interrupting service. To support more client’s queries without downtime and degrading the response time, more nodes have to be scaled up while the database is running. This paper presents the overview of scalable and available database that satisfies the above characteristics. And we propose a novel on line scaling method. Our method improves the existing on line scaling method for fast response time and higher throughputs. Our proposed method reduces unnecessary network use, i.e., we decrease the number of data copy by reusing the backup data. Also, our on line scaling operation can be processed parallel by selecting adequate nodes as new node. Our performance study shows that our method results in significant reduction in data copy time.
基金This work is supported by University IT Research Center Project in KOREA.
文摘With the rapid advance of wireless communication, tracking the positions of the moving objects is becoming increasingly feasible and necessary. Because a large number of people use mobile phones, we must handle a large moving object database as well as the following problems. How can we provide the customers with high quality service, that means, how can we deal with so many enquiries within as less time as possible? Because of the large number of data, the gap between CPU speed and the size of main memory has increasing considerably. One way to reduce the time to handle enquiries is to reduce the I/O number between the buffer and the secondary storage.An effective clustering of the objects can minimize the I/O cost between them. In this paper, according to the characteristic of the moving object database, we analyze the objects in buffer, according to their mappings in the two dimension coordinate, and then develop a density based clustering method to effectively reorganize the clusters. This new mechanism leads to the less cost of the I/O operation and the more efficient response to enquiries.
文摘A new secured database management system architecture using intrusion detection systems(IDS)is proposed in this paper for organizations with no previous role mapping for users.A simple representation of Structured Query Language queries is proposed to easily permit the use of the worked clustering algorithm.A new clustering algorithm that uses a tube search with adaptive memory is applied to database log files to create users’profiles.Then,queries issued for each user are checked against the related user profile using a classifier to determine whether or not each query is malicious.The IDS will stop query execution or report the threat to the responsible person if the query is malicious.A simple classifier based on the Euclidean distance is used and the issued query is transformed to the proposed simple representation using a classifier,where the Euclidean distance between the centers and the profile’s issued query is calculated.A synthetic data set is used for our experimental evaluations.Normal user access behavior in relation to the database is modelled using the data set.The false negative(FN)and false positive(FP)rates are used to compare our proposed algorithm with other methods.The experimental results indicate that our proposed method results in very small FN and FP rates.
文摘Based on the Online Registration System (ORS) characteristics and key technology analysis, this paper points out that that a good performance and high stability of the ORS lies in the choice of the system database. Database clustering technology which has merits such as concurrent processing, easy expansion, and high security is proposed to achieve database subsystem of ORS, and the design of the database cluster system framework is available in this paper. Finally, we also explore the database load balancing of the cluster system, heterogeneous database replication technology.
文摘The efficiency and performance of Distributed Database Management Systems (DDBMS) is mainly measured by its proper design and by network communication cost between sites. Fragmentation and distribution of data are the major design issues of the DDBMS. In this paper, we propose new approach that integrates both fragmentation and data allocation in one strategy based on high performance clustering technique and transaction processing cost functions. This new approach achieves efficiently and effectively the objectives of data fragmentation, data allocation and network sites clustering. The approach splits the data relations into pair-wise disjoint fragments and determine whether each fragment has to be allocated or not in the network sites, where allocation benefit outweighs the cost depending on high performance clustering technique. To show the performance of the proposed approach, we performed experimental studies on real database application at different networks connectivity. The obtained results proved to achieve minimum total data transaction costs between different sites, reduced the amount of redundant data to be accessed between these sites and improved the overall DDBMS performance.
文摘A spatial orientation of angular momentum vectors of galaxies in six dynamically unstable Abell clusters(S1171, S0001, A1035, A1373, A1474 and A4053) is studied. For this, twodimensional observed parameters(e.g., positions, diameters and position angles) are converted into three-dimensional(3D) rotation axes of the galaxy using the 'position angle-inclination' method. The expected isotropic distribution curves for angular momentum vectors are obtained by performing random simulations. The observed and expected distributions are compared using several statistical tests.No preferred alignments of angular momentum vectors of galaxies are noticed in all six dynamically unstable clusters, supporting the hierarchy model of galaxy formation. These clusters have a larger value of velocity dispersion. However, local effects are noticed in the clusters that have substructures in the1D-3D number density maps.
文摘Age of knowledge explosion requires us not only to have the ability to get useful information which represented by data but also to find knowledge in information. Human Genome Project achieved large amount of such biological data, and people found clustering is a promising approach to analyze those biological data for knowledge hidden. The researches on biological data go to in-depth gradually and so are the clustering algorithms. This article mainly introduces current broad-used clustering algorithms, including the main idea, improvements, key technology, advantage and disadvantage, and the applications in biological field as well as the problems they solve. What’s more, this article roughly introduces some database used in biological field.
基金Funding for the DPAC has been provided by national institutions,in particular the institutions participating in the Gaia Multilateral Agreement.This research has made use of the VizieR catalogue access tool,CDS,Strasbourg,France(DOI:10.26093/cds/vizier).
文摘The physical nature of a series of 20 new open clusters is confirmed employing existing data on putative star members,mainly from the second Gaia Data Release(DR2).The clusters were discovered as overdensities of stars by visual inspection of either photographic DSS plates or proper motion plots of random source fields.The reported objects are not present in the most comprehensive or recent catalogs of stellar clusters and associations.For all of them,clumps of comoving stars are revealed in the proper motion space.The parallaxes of the clumped stars are compatible with the real existence of open clusters over narrow ranges of distances.Surface density calculations,free of most noise from non-member sources,allow differentiating a cluster core and an extended cluster corona in some instances.Color-magnitude diagrams generally show a definite main sequence that allows confirmation of the physical existence of the clusters and some of their characteristics.Two of the new clusters seem to form a double system with a common origin.Several of the new clusters challenge the claim of near completeness of the known OC population in the distance range from 1.0 to 1.8 kpc from the Sun(Kharchenko et al.).
文摘We present the first in a series studying the astrophysical parameters of open clusters using the PPMXL* database whose data are applied to study Ruprecht 15. The astrophysical parameters of Ruprecht 15 have been estimated for the first time.
文摘以GDELT(global database of event,language,tone)数据库为例,讨论使用数据源路径爬取相关新闻文档。利用改进的AC自动机进行多模关键词匹配完成初步的数据清洗;对过滤好的文档数据进行主题数量评估,再利用LDA模型对其进行主题分类和关键词提取。根据分类结果,对海洋环境与气候主题新闻数据及相关指标建立空间聚类模型,最终形成一个对海量文档数据进行抓取、清洗、主题挖掘、空间聚类及可视化呈现的分析模型。