A new algorithm for clustering multiple data streams is proposed.The algorithm can effectively cluster data streams which show similar behavior with some unknown time delays.The algorithm uses the autoregressive (AR...A new algorithm for clustering multiple data streams is proposed.The algorithm can effectively cluster data streams which show similar behavior with some unknown time delays.The algorithm uses the autoregressive (AR) modeling technique to measure correlations between data streams.It exploits estimated frequencies spectra to extract the essential features of streams.Each stream is represented as the sum of spectral components and the correlation is measured component-wise.Each spectral component is described by four parameters,namely,amplitude,phase,damping rate and frequency.The ε-lag-correlation between two spectral components is calculated.The algorithm uses such information as similarity measures in clustering data streams.Based on a sliding window model,the algorithm can continuously report the most recent clustering results and adjust the number of clusters.Experiments on real and synthetic streams show that the proposed clustering method has a higher speed and clustering quality than other similar methods.展开更多
The public has shown great interest in the data factor and data transactions,but the current attention is overly focused on personal behavioral data and transactions happening at Data Exchanges.To deliver a complete p...The public has shown great interest in the data factor and data transactions,but the current attention is overly focused on personal behavioral data and transactions happening at Data Exchanges.To deliver a complete picture of data flaw and transaction,this paper presents a systematic overview of the flow and transaction of personal,corporate and public data on the basis of data factor classification from various perspectives.By utilizing various sources of information,this paper estimates the volume of data generation&storage and the volume&trend of data market transactions for major economies in the world with the following findings:(i)Data classification is diverse due to a broad variety of applying scenarios,and data transaction and profit distribution are complex due to heterogenous entities,ownerships,information density and other attributes of different data types.(ii)Global data transaction has presented with the characteristics of productization,servitization and platform-based mode.(iii)For major economies,there is a commonly observed disequilibrium between data generation scale and storage scale,which is particularly striking for China.(i^v)The global data market is in a nascent stage of rapid development with a transaction volume of about 100 billion US dollars,and China s data market is even more underdeveloped and only accounts for some 10%of the world total.All sectors of the society should be flly aware of the diversity and complexity of data factor classification and data transactions,as well as the arduous and long-term nature of developing and improving relevant institutional systems.Adapting to such features,efforts should be made to improve data classification,enhance computing infrastructure development,foster professional data transaction and development institutions,and perfect the data governance system.展开更多
The rapid developments in the fields of telecommunication, sensor data, financial applications, analyzing of data streams, and so on, increase the rate of data arrival, among which the data mining technique is conside...The rapid developments in the fields of telecommunication, sensor data, financial applications, analyzing of data streams, and so on, increase the rate of data arrival, among which the data mining technique is considered a vital process. The data analysis process consists of different tasks, among which the data stream classification approaches face more challenges than the other commonly used techniques. Even though the classification is a continuous process, it requires a design that can adapt the classification model so as to adjust the concept change or the boundary change between the classes. Hence, we design a novel fuzzy classifier known as THRFuzzy to classify new incoming data streams. Rough set theory along with tangential holoentropy function helps in the designing the dynamic classification model. The classification approach uses kernel fuzzy c-means(FCM) clustering for the generation of the rules and tangential holoentropy function to update the membership function. The performance of the proposed THRFuzzy method is verified using three datasets, namely skin segmentation, localization, and breast cancer datasets, and the evaluated metrics, accuracy and time, comparing its performance with HRFuzzy and adaptive k-NN classifiers. The experimental results conclude that THRFuzzy classifier shows better classification results providing a maximum accuracy consuming a minimal time than the existing classifiers.展开更多
Logistic regression is a fast classifier and can achieve higher accuracy on small training data.Moreover,it can work on both discrete and continuous attributes with nonlinear patterns.Based on these properties of logi...Logistic regression is a fast classifier and can achieve higher accuracy on small training data.Moreover,it can work on both discrete and continuous attributes with nonlinear patterns.Based on these properties of logistic regression,this paper proposed an algorithm,called evolutionary logistical regression classifier(ELRClass),to solve the classification of evolving data streams.This algorithm applies logistic regression repeatedly to a sliding window of samples in order to update the existing classifier,to keep this classifier if its performance is deteriorated by the reason of bursting noise,or to construct a new classifier if a major concept drift is detected.The intensive experimental results demonstrate the effectiveness of this algorithm.展开更多
Fish were collected from 39 sites on the main channel and major tributaries of a highly erosive stream, Hotophia Creek. A total of 2,642 specimens representing 38 species were collected between 1986 through 2003. The ...Fish were collected from 39 sites on the main channel and major tributaries of a highly erosive stream, Hotophia Creek. A total of 2,642 specimens representing 38 species were collected between 1986 through 2003. The bluntface shiner Cyprinella camura was the dominant species of fish and when grouped with other cyprinids accounted for 38.0% of the total numbers collected. By weight, Lepisosteusoculatus, Lepomismegalotis, lctiobusbubalus, and Lepomismacrochirus were the dominant species; accounting for 49.9% of the total catch. While more diminutive species such as cyprinids that might be subject to predation by large fish more frequently were found in shallow channels. Fishes with specific habitat requirement such as the pirate perch were found in the middle group of sites, that were disturbed by erosion process but that featured the necessary habitat requirements. Sensitive or intolerant species like the Yazoo darter, creek chubsucker and cyprinids in general were more frequently found in the undisturbed and habitat complex channels. This study supports the hypothesis that geomorphological stream stages are associated with specific communities of fishes.展开更多
Packet classification (PC) has become the main method to support the quality of service and security of network application. And two-dimeusioual prefix packet classification (PPC) is the popular one. This paper analyz...Packet classification (PC) has become the main method to support the quality of service and security of network application. And two-dimeusioual prefix packet classification (PPC) is the popular one. This paper analyzes the problem of ruler conflict, and then presents a TCAM-based two-dimensional PPC algorithm. This algorithm makes use of the parallelism of TCAM to lookup the longest prefix in one instruction cycle. Then it uses a memory image and associated data structures to eliminate the conflicts between rulers, and performs a fast two-dimeusional PPC. Compared with other algorithms, this algorithm has the least time complexity and less space complexity.展开更多
Traditional packet classification for IPv4 involves examining standard 5-tuple of a packet header, source address, destination address, source port, destination port and protocol. With introduction of IPv6 flow label ...Traditional packet classification for IPv4 involves examining standard 5-tuple of a packet header, source address, destination address, source port, destination port and protocol. With introduction of IPv6 flow label field which entails labeling the packets belonging to the same flow, packet classification can be resolved based on 3 dimensions: flow label, source address and desti- nation address. In this paper, we propose a novel approach for the 3-tuple packet classification based on flow label. Besides, by introducing a conversion engine to covert the source-destination pairs to the compound address prefixes, we put forward an algorithm called Reducing Dimension (RD) with dimension reduction capability, which combines heuristic tree search with usage of buck- ets. And we also provide an improved version of RD, called Improved RD (IRD), which uses two mechanisms: path compression and priority tag, to optimize the perforrmnce. To evaluate our algo- rithm, extensive experiraents have been conducted using a number of synthetically generated databas- es. For the memory consumption, the two pro- posed new algorithms only consumes around 3% of the existing algorithms when the number of ill- ters increases to 10 k. And for the average search time, the search time of the two proposed algo- rithms is more than four times faster than others when the number of filters is 10 k. The results show that the proposed algorithm works well and outperforms rmny typical existing algorithms with the dimension reduction capability.展开更多
High dimensional data clustering,with the inherent sparsity of data and the existence of noise,is a serious challenge for clustering algorithms.A new linear manifold clustering method was proposed to address this prob...High dimensional data clustering,with the inherent sparsity of data and the existence of noise,is a serious challenge for clustering algorithms.A new linear manifold clustering method was proposed to address this problem.The basic idea was to search the line manifold clusters hidden in datasets,and then fuse some of the line manifold clusters to construct higher dimensional manifold clusters.The orthogonal distance and the tangent distance were considered together as the linear manifold distance metrics. Spatial neighbor information was fully utilized to construct the original line manifold and optimize line manifolds during the line manifold cluster searching procedure.The results obtained from experiments over real and synthetic data sets demonstrate the superiority of the proposed method over some competing clustering methods in terms of accuracy and computation time.The proposed method is able to obtain high clustering accuracy for various data sets with different sizes,manifold dimensions and noise ratios,which confirms the anti-noise capability and high clustering accuracy of the proposed method for high dimensional data.展开更多
Researchers in bioinformatics, biostatistics and other related fields seek biomarkers for many purposes, including risk assessment, disease diagnosis and prognosis, which can be formulated as a patient classification....Researchers in bioinformatics, biostatistics and other related fields seek biomarkers for many purposes, including risk assessment, disease diagnosis and prognosis, which can be formulated as a patient classification. In this paper, a new method of using a tree regression to improve logistic classification model is introduced in biomarker data analysis. The numerical results show that the linear logistic model can be significantly improved by a tree regression on the residuals. Although the classification problem of binary responses is discussed in this research, the idea is easy to extend to the classification of multinomial responses.展开更多
Image feature optimization is an important means to deal with high-dimensional image data in image semantic understanding and its applications. We formulate image feature optimization as the establishment of a mapping...Image feature optimization is an important means to deal with high-dimensional image data in image semantic understanding and its applications. We formulate image feature optimization as the establishment of a mapping between highand low-dimensional space via a five-tuple model. Nonlinear dimensionality reduction based on manifold learning provides a feasible way for solving such a problem. We propose a novel globular neighborhood based locally linear embedding (GNLLE) algorithm using neighborhood update and an incremental neighbor search scheme, which not only can handle sparse datasets but also has strong anti-noise capability and good topological stability. Given that the distance measure adopted in nonlinear dimensionality reduction is usually based on pairwise similarity calculation, we also present a globular neighborhood and path clustering based locally linear embedding (GNPCLLE) algorithm based on path-based clustering. Due to its full consideration of correlations between image data, GNPCLLE can eliminate the distortion of the overall topological structure within the dataset on the manifold. Experimental results on two image sets show the effectiveness and efficiency of the proposed algorithms.展开更多
基金The National Natural Science Foundation of China(No.60673060)the Natural Science Foundation of Jiangsu Province(No.BK2005047)
文摘A new algorithm for clustering multiple data streams is proposed.The algorithm can effectively cluster data streams which show similar behavior with some unknown time delays.The algorithm uses the autoregressive (AR) modeling technique to measure correlations between data streams.It exploits estimated frequencies spectra to extract the essential features of streams.Each stream is represented as the sum of spectral components and the correlation is measured component-wise.Each spectral component is described by four parameters,namely,amplitude,phase,damping rate and frequency.The ε-lag-correlation between two spectral components is calculated.The algorithm uses such information as similarity measures in clustering data streams.Based on a sliding window model,the algorithm can continuously report the most recent clustering results and adjust the number of clusters.Experiments on real and synthetic streams show that the proposed clustering method has a higher speed and clustering quality than other similar methods.
文摘The public has shown great interest in the data factor and data transactions,but the current attention is overly focused on personal behavioral data and transactions happening at Data Exchanges.To deliver a complete picture of data flaw and transaction,this paper presents a systematic overview of the flow and transaction of personal,corporate and public data on the basis of data factor classification from various perspectives.By utilizing various sources of information,this paper estimates the volume of data generation&storage and the volume&trend of data market transactions for major economies in the world with the following findings:(i)Data classification is diverse due to a broad variety of applying scenarios,and data transaction and profit distribution are complex due to heterogenous entities,ownerships,information density and other attributes of different data types.(ii)Global data transaction has presented with the characteristics of productization,servitization and platform-based mode.(iii)For major economies,there is a commonly observed disequilibrium between data generation scale and storage scale,which is particularly striking for China.(i^v)The global data market is in a nascent stage of rapid development with a transaction volume of about 100 billion US dollars,and China s data market is even more underdeveloped and only accounts for some 10%of the world total.All sectors of the society should be flly aware of the diversity and complexity of data factor classification and data transactions,as well as the arduous and long-term nature of developing and improving relevant institutional systems.Adapting to such features,efforts should be made to improve data classification,enhance computing infrastructure development,foster professional data transaction and development institutions,and perfect the data governance system.
基金supported by proposal No.OSD/BCUD/392/197 Board of Colleges and University Development,Savitribai Phule Pune University,Pune
文摘The rapid developments in the fields of telecommunication, sensor data, financial applications, analyzing of data streams, and so on, increase the rate of data arrival, among which the data mining technique is considered a vital process. The data analysis process consists of different tasks, among which the data stream classification approaches face more challenges than the other commonly used techniques. Even though the classification is a continuous process, it requires a design that can adapt the classification model so as to adjust the concept change or the boundary change between the classes. Hence, we design a novel fuzzy classifier known as THRFuzzy to classify new incoming data streams. Rough set theory along with tangential holoentropy function helps in the designing the dynamic classification model. The classification approach uses kernel fuzzy c-means(FCM) clustering for the generation of the rules and tangential holoentropy function to update the membership function. The performance of the proposed THRFuzzy method is verified using three datasets, namely skin segmentation, localization, and breast cancer datasets, and the evaluated metrics, accuracy and time, comparing its performance with HRFuzzy and adaptive k-NN classifiers. The experimental results conclude that THRFuzzy classifier shows better classification results providing a maximum accuracy consuming a minimal time than the existing classifiers.
文摘Logistic regression is a fast classifier and can achieve higher accuracy on small training data.Moreover,it can work on both discrete and continuous attributes with nonlinear patterns.Based on these properties of logistic regression,this paper proposed an algorithm,called evolutionary logistical regression classifier(ELRClass),to solve the classification of evolving data streams.This algorithm applies logistic regression repeatedly to a sliding window of samples in order to update the existing classifier,to keep this classifier if its performance is deteriorated by the reason of bursting noise,or to construct a new classifier if a major concept drift is detected.The intensive experimental results demonstrate the effectiveness of this algorithm.
文摘Fish were collected from 39 sites on the main channel and major tributaries of a highly erosive stream, Hotophia Creek. A total of 2,642 specimens representing 38 species were collected between 1986 through 2003. The bluntface shiner Cyprinella camura was the dominant species of fish and when grouped with other cyprinids accounted for 38.0% of the total numbers collected. By weight, Lepisosteusoculatus, Lepomismegalotis, lctiobusbubalus, and Lepomismacrochirus were the dominant species; accounting for 49.9% of the total catch. While more diminutive species such as cyprinids that might be subject to predation by large fish more frequently were found in shallow channels. Fishes with specific habitat requirement such as the pirate perch were found in the middle group of sites, that were disturbed by erosion process but that featured the necessary habitat requirements. Sensitive or intolerant species like the Yazoo darter, creek chubsucker and cyprinids in general were more frequently found in the undisturbed and habitat complex channels. This study supports the hypothesis that geomorphological stream stages are associated with specific communities of fishes.
基金Foundation item: supported by Intel Corporation (No. 9078)
文摘Packet classification (PC) has become the main method to support the quality of service and security of network application. And two-dimeusioual prefix packet classification (PPC) is the popular one. This paper analyzes the problem of ruler conflict, and then presents a TCAM-based two-dimensional PPC algorithm. This algorithm makes use of the parallelism of TCAM to lookup the longest prefix in one instruction cycle. Then it uses a memory image and associated data structures to eliminate the conflicts between rulers, and performs a fast two-dimeusional PPC. Compared with other algorithms, this algorithm has the least time complexity and less space complexity.
基金This paper was supported by the National Natural Science Foundation of China under Crant No. 61003282 the Funda- mental Research Funds for the Central Universities under Crant No. 2011RCI)508+1 种基金 National Basic Research Program of China under Crant No. 2009CB320505 National High Technol-ogy Research and Development Program of China under Oant No. 2011AA010704.
文摘Traditional packet classification for IPv4 involves examining standard 5-tuple of a packet header, source address, destination address, source port, destination port and protocol. With introduction of IPv6 flow label field which entails labeling the packets belonging to the same flow, packet classification can be resolved based on 3 dimensions: flow label, source address and desti- nation address. In this paper, we propose a novel approach for the 3-tuple packet classification based on flow label. Besides, by introducing a conversion engine to covert the source-destination pairs to the compound address prefixes, we put forward an algorithm called Reducing Dimension (RD) with dimension reduction capability, which combines heuristic tree search with usage of buck- ets. And we also provide an improved version of RD, called Improved RD (IRD), which uses two mechanisms: path compression and priority tag, to optimize the perforrmnce. To evaluate our algo- rithm, extensive experiraents have been conducted using a number of synthetically generated databas- es. For the memory consumption, the two pro- posed new algorithms only consumes around 3% of the existing algorithms when the number of ill- ters increases to 10 k. And for the average search time, the search time of the two proposed algo- rithms is more than four times faster than others when the number of filters is 10 k. The results show that the proposed algorithm works well and outperforms rmny typical existing algorithms with the dimension reduction capability.
基金Project(60835005) supported by the National Nature Science Foundation of China
文摘High dimensional data clustering,with the inherent sparsity of data and the existence of noise,is a serious challenge for clustering algorithms.A new linear manifold clustering method was proposed to address this problem.The basic idea was to search the line manifold clusters hidden in datasets,and then fuse some of the line manifold clusters to construct higher dimensional manifold clusters.The orthogonal distance and the tangent distance were considered together as the linear manifold distance metrics. Spatial neighbor information was fully utilized to construct the original line manifold and optimize line manifolds during the line manifold cluster searching procedure.The results obtained from experiments over real and synthetic data sets demonstrate the superiority of the proposed method over some competing clustering methods in terms of accuracy and computation time.The proposed method is able to obtain high clustering accuracy for various data sets with different sizes,manifold dimensions and noise ratios,which confirms the anti-noise capability and high clustering accuracy of the proposed method for high dimensional data.
文摘Researchers in bioinformatics, biostatistics and other related fields seek biomarkers for many purposes, including risk assessment, disease diagnosis and prognosis, which can be formulated as a patient classification. In this paper, a new method of using a tree regression to improve logistic classification model is introduced in biomarker data analysis. The numerical results show that the linear logistic model can be significantly improved by a tree regression on the residuals. Although the classification problem of binary responses is discussed in this research, the idea is easy to extend to the classification of multinomial responses.
基金Project (No 2008AA01Z132) supported by the National High-Tech Research and Development Program of China
文摘Image feature optimization is an important means to deal with high-dimensional image data in image semantic understanding and its applications. We formulate image feature optimization as the establishment of a mapping between highand low-dimensional space via a five-tuple model. Nonlinear dimensionality reduction based on manifold learning provides a feasible way for solving such a problem. We propose a novel globular neighborhood based locally linear embedding (GNLLE) algorithm using neighborhood update and an incremental neighbor search scheme, which not only can handle sparse datasets but also has strong anti-noise capability and good topological stability. Given that the distance measure adopted in nonlinear dimensionality reduction is usually based on pairwise similarity calculation, we also present a globular neighborhood and path clustering based locally linear embedding (GNPCLLE) algorithm based on path-based clustering. Due to its full consideration of correlations between image data, GNPCLLE can eliminate the distortion of the overall topological structure within the dataset on the manifold. Experimental results on two image sets show the effectiveness and efficiency of the proposed algorithms.