Pattern discovery from time series is of fundamental importance. Most of the algorithms of pattern discovery in time series capture the values of time series based on some kinds of similarity measures. Affected by the...Pattern discovery from time series is of fundamental importance. Most of the algorithms of pattern discovery in time series capture the values of time series based on some kinds of similarity measures. Affected by the scale and baseline, value-based methods bring about problem when the objective is to capture the shape. Thus, a similarity measure based on shape, Sh measure, is originally proposed, andthe properties of this similarity and corresponding proofs are given. Then a time series shape pattern discovery algorithm based on Sh measure is put forward. The proposed algorithm is terminated in finite iteration with given computational and storage complexity. Finally the experiments on synthetic datasets and sunspot datasets demonstrate that the time series shape pattern algorithm is valid.展开更多
This study presents a hybrid data-mining framework based on feature selection algorithms and clustering methods to perform the pattern discovery of high-speed railway train rescheduling strategies(RSs).The proposed mo...This study presents a hybrid data-mining framework based on feature selection algorithms and clustering methods to perform the pattern discovery of high-speed railway train rescheduling strategies(RSs).The proposed model is composed of two states.In the first state,decision tree,random forest,gradient boosting decision tree(GBDT)and extreme gradient boosting(XGBoost)models are used to investigate the importance of features.The features that have a high influence on RSs are first selected.In the second state,a K-means clustering method is used to uncover the interdependences between RSs and the influencing features,based on the results in the first state.The proposed method can determine the quantitative relationships between RSs and influencing factors.The results clearly show the influences of the factors on RSs,the possibilities of different train operation RSs under different situations,as well as some key time periods and key trains that the controllers should pay more attention to.The research in this paper can help train traffic controllers better understand the train operation patterns and provides direction for optimizing rail traffic RSs.展开更多
The word‘pattern’frequently appears in the visualisation and visual analytics literature,but what do we mean when we talk about patterns?We propose a practicable definition of the concept of a pattern in a data dist...The word‘pattern’frequently appears in the visualisation and visual analytics literature,but what do we mean when we talk about patterns?We propose a practicable definition of the concept of a pattern in a data distribution as a combination of multiple interrelated elements of two or more data components that can be represented and treated as a unified whole.Our theoretical model describes how patterns are made by relationships existing between data elements.Knowing the types of these relationships,it is possible to predict what kinds of patterns may exist.We demonstrate how our model underpins and refines the established fundamental principles of visualisation.The model also suggests a range of interactive analytical operations that can support visual analytics workflows where patterns,once discovered,are explicitly involved in further data analysis.展开更多
The rapid development of network technology and its evolution toward heterogeneous networks has increased the demand to support automatic monitoring and the management of heterogeneous wireless communication networks....The rapid development of network technology and its evolution toward heterogeneous networks has increased the demand to support automatic monitoring and the management of heterogeneous wireless communication networks.This paper presents a multilevel pattern mining architecture to support automatic network management by discovering interesting patterns from telecom network monitoring data.This architecture leverages and combines existing frequent itemset discovery over data streams,association rule deduction,frequent sequential pattern mining,and frequent temporal pattern mining techniques while also making use of distributed processing platforms to achieve high-volume throughput.展开更多
The discovery of gradual moving object clusters pattern from trajectory streams allows characterizing movement behavior in real time environment,which leverages new applications and services.Since the trajectory strea...The discovery of gradual moving object clusters pattern from trajectory streams allows characterizing movement behavior in real time environment,which leverages new applications and services.Since the trajectory streams is rapidly evolving,continuously created and cannot be stored indefinitely in memory,the existing approaches designed on static trajectory datasets are not suitable for discovering gradual moving object clusters pattern from trajectory streams.This paper proposes a novel algorithm of gradual moving object clusters pattern discovery from trajectory streams using sliding window models.By processing the trajectory data in current window,the mining algorithm can capture the trend and evolution of moving object clusters pattern.Firstly,the density peaks clustering algorithm is exploited to identify clusters of different snapshots.The stable relationship between relatively few moving objects is used to improve the clustering efficiency.Then,by intersecting clusters from different snapshots,the gradual moving object clusters pattern is updated.The relationship of clusters between adjacent snapshots and the gradual property are utilized to accelerate updating process.Finally,experiment results on two real datasets demonstrate that our algorithm is effective and efficient.展开更多
Measures relating word frequencies and expectations have been constantly ofinterest in Bioinformatics studies. With sequence data becoming massively available, exhaustiveenumeration of such measures have become concei...Measures relating word frequencies and expectations have been constantly ofinterest in Bioinformatics studies. With sequence data becoming massively available, exhaustiveenumeration of such measures have become conceivable, and yet pose significant computational burdeneven when limited to words of bounded maximum length. In addition, the display of the huge tablespossibly resulting from these counts poses practical problems of visualization and inference.VERBUMCULUS is a suite of software tools for the efficient and fast detection of over- orunder-represented words in nucleotide sequences. The inner core of VERBUMCULUS rests on subtlyinterwoven properties of statistics, pattern matching and combinatorics on words, that enable one tolimit drastically and a priori the set of over-or under-represented candidate words of all lengthsin a given sequence, thereby rendering it more feasible both to detect and visualize such words in afast and practically useful way. This paper is devoted to the description of the facility at theoutset and to report experimental results, ranging from simulations on synthetic data to thediscovery of regulatory elements on the upstream regions of a set of genes of the yeast.展开更多
Several experiments and observations have revealed the fact that small localdistinct structural features in RNA molecules are correlated with their biological function, forexample, in post-transcriptional regulation o...Several experiments and observations have revealed the fact that small localdistinct structural features in RNA molecules are correlated with their biological function, forexample, in post-transcriptional regulation of gene expression. Thus, finding similar structuralfeatures in a set of RNA sequences known to play the same biological function could providesubstantial information concerning which parts of the sequences are responsible for the functionitself. Unfortunately, finding common structural elements in RNA molecules is a very challengingtask, even if limited to secondary structure. The main difficulty lies in the fact that in nearlyall the cases the structure of the molecules is unknown, has to be somehow predicted, and thatsequences with little or no similarity can fold into similar structures. Although they differ insome details, the approaches proposed so far are usually based on the preliminary alignment of thesequences and attempt to predict common structures (either local or global, or for some selectedregions) for the aligned sequences. These methods give good results when sequence and structuresimilarity are very high, but function less well when similarity is limited to small and localelements, like single stem-loop motifs. Instead of aligning the sequences, the algorithm we presentdirectly searches for regions of the sequences that can fold into similar structures, where thedegree of similarity can be defined by the user. Any information concerning sequence similarity inthe motifs can be used either as a search constraint, or a posteriori, by post-processing theoutput. The search for the regions sharing structural similarity is implemented with the affix tree,a novel text-indexing structure that significantly accelerates the search for patterns having asymmetric layout, such as those forming stem-loop structures. Tests based on experimentally knownstructures have shown that the algorithm is able to identify functional motifs in the secondarystructure of non coding RNA, such as Iron Responsive Elements (IRE) in the untranslated regions offerritin mRNA, and the domain IV stem-loop structure in SRP RNA.展开更多
Missing values occur in bio-signal processing for various reasons,including technical problems or biological char-acteristics.These missing values are then either simply excluded or substituted with estimated values f...Missing values occur in bio-signal processing for various reasons,including technical problems or biological char-acteristics.These missing values are then either simply excluded or substituted with estimated values for further processing.When the missing signal values are estimated for electroencephalography (EEG) signals,an example where electrical signals arrive quickly and successively,rapid processing of high-speed data is required for immediate decision making.In this study,we propose an incremental expectation maximization principal component analysis (iEMPCA) method that automatically estimates missing values from multivariable EEG time series data without requiring a whole and complete data set.The proposed method solves the problem of a biased model,which inevitably results from simply removing incomplete data rather than estimating them,and thus reduces the loss of information by incorporating missing values in real time.By using an incremental approach,the proposed method alsominimizes memory usage and processing time of continuously arriving data.Experimental results show that the proposed method assigns more accurate missing values than previous methods.展开更多
文摘Pattern discovery from time series is of fundamental importance. Most of the algorithms of pattern discovery in time series capture the values of time series based on some kinds of similarity measures. Affected by the scale and baseline, value-based methods bring about problem when the objective is to capture the shape. Thus, a similarity measure based on shape, Sh measure, is originally proposed, andthe properties of this similarity and corresponding proofs are given. Then a time series shape pattern discovery algorithm based on Sh measure is put forward. The proposed algorithm is terminated in finite iteration with given computational and storage complexity. Finally the experiments on synthetic datasets and sunspot datasets demonstrate that the time series shape pattern algorithm is valid.
基金This work was supported by the National Natural Science Foundation of China(Grant No.71871188)The authors also acknowledge the Open Fund of Hubei Key Laboratory of Power System Design and Test for Electrical Vehicle and the support of the State Key Laboratory of Rail Traffic Control(Grant No.RCS2019K007).Finally,the authors are grateful for the useful contributions made by their project partners.
文摘This study presents a hybrid data-mining framework based on feature selection algorithms and clustering methods to perform the pattern discovery of high-speed railway train rescheduling strategies(RSs).The proposed model is composed of two states.In the first state,decision tree,random forest,gradient boosting decision tree(GBDT)and extreme gradient boosting(XGBoost)models are used to investigate the importance of features.The features that have a high influence on RSs are first selected.In the second state,a K-means clustering method is used to uncover the interdependences between RSs and the influencing features,based on the results in the first state.The proposed method can determine the quantitative relationships between RSs and influencing factors.The results clearly show the influences of the factors on RSs,the possibilities of different train operation RSs under different situations,as well as some key time periods and key trains that the controllers should pay more attention to.The research in this paper can help train traffic controllers better understand the train operation patterns and provides direction for optimizing rail traffic RSs.
基金This research was supported by Fraunhofer Center for Machine Learning within the Fraunhofer Cluster for Cognitive Internet Technologiesby DFG within Priority Programme 1894(SPP VGI)+2 种基金by EU in project SoBigData++by SESAR in projects TAPAS and SIMBADby Austrian Science Fund(FWF)project KnowVA(grant P31419-N31).
文摘The word‘pattern’frequently appears in the visualisation and visual analytics literature,but what do we mean when we talk about patterns?We propose a practicable definition of the concept of a pattern in a data distribution as a combination of multiple interrelated elements of two or more data components that can be represented and treated as a unified whole.Our theoretical model describes how patterns are made by relationships existing between data elements.Knowing the types of these relationships,it is possible to predict what kinds of patterns may exist.We demonstrate how our model underpins and refines the established fundamental principles of visualisation.The model also suggests a range of interactive analytical operations that can support visual analytics workflows where patterns,once discovered,are explicitly involved in further data analysis.
基金funded by the Enterprise Ireland Innovation Partnership Programme with Ericsson under grant agreement IP/2011/0135[6]supported by the National Natural Science Foundation of China(No.61373131,61303039,61232016,61501247)+1 种基金the PAPDCICAEET funds
文摘The rapid development of network technology and its evolution toward heterogeneous networks has increased the demand to support automatic monitoring and the management of heterogeneous wireless communication networks.This paper presents a multilevel pattern mining architecture to support automatic network management by discovering interesting patterns from telecom network monitoring data.This architecture leverages and combines existing frequent itemset discovery over data streams,association rule deduction,frequent sequential pattern mining,and frequent temporal pattern mining techniques while also making use of distributed processing platforms to achieve high-volume throughput.
基金This work is supported by the National Natural Science Foundationof China under Grants No. 41471371.
文摘The discovery of gradual moving object clusters pattern from trajectory streams allows characterizing movement behavior in real time environment,which leverages new applications and services.Since the trajectory streams is rapidly evolving,continuously created and cannot be stored indefinitely in memory,the existing approaches designed on static trajectory datasets are not suitable for discovering gradual moving object clusters pattern from trajectory streams.This paper proposes a novel algorithm of gradual moving object clusters pattern discovery from trajectory streams using sliding window models.By processing the trajectory data in current window,the mining algorithm can capture the trend and evolution of moving object clusters pattern.Firstly,the density peaks clustering algorithm is exploited to identify clusters of different snapshots.The stable relationship between relatively few moving objects is used to improve the clustering efficiency.Then,by intersecting clusters from different snapshots,the gradual moving object clusters pattern is updated.The relationship of clusters between adjacent snapshots and the gradual property are utilized to accelerate updating process.Finally,experiment results on two real datasets demonstrate that our algorithm is effective and efficient.
基金美国自然科学基金,Purdue Research Foundation,Italian Ministry of University and Research, and the Research Program of the University of Padova 及Purdue Research Foundation,the Italian Ministry of University and Re-search, the Research Program of the University of Padova and Bourns College of Engineering, University of California,Riverside
文摘Measures relating word frequencies and expectations have been constantly ofinterest in Bioinformatics studies. With sequence data becoming massively available, exhaustiveenumeration of such measures have become conceivable, and yet pose significant computational burdeneven when limited to words of bounded maximum length. In addition, the display of the huge tablespossibly resulting from these counts poses practical problems of visualization and inference.VERBUMCULUS is a suite of software tools for the efficient and fast detection of over- orunder-represented words in nucleotide sequences. The inner core of VERBUMCULUS rests on subtlyinterwoven properties of statistics, pattern matching and combinatorics on words, that enable one tolimit drastically and a priori the set of over-or under-represented candidate words of all lengthsin a given sequence, thereby rendering it more feasible both to detect and visualize such words in afast and practically useful way. This paper is devoted to the description of the facility at theoutset and to report experimental results, ranging from simulations on synthetic data to thediscovery of regulatory elements on the upstream regions of a set of genes of the yeast.
文摘Several experiments and observations have revealed the fact that small localdistinct structural features in RNA molecules are correlated with their biological function, forexample, in post-transcriptional regulation of gene expression. Thus, finding similar structuralfeatures in a set of RNA sequences known to play the same biological function could providesubstantial information concerning which parts of the sequences are responsible for the functionitself. Unfortunately, finding common structural elements in RNA molecules is a very challengingtask, even if limited to secondary structure. The main difficulty lies in the fact that in nearlyall the cases the structure of the molecules is unknown, has to be somehow predicted, and thatsequences with little or no similarity can fold into similar structures. Although they differ insome details, the approaches proposed so far are usually based on the preliminary alignment of thesequences and attempt to predict common structures (either local or global, or for some selectedregions) for the aligned sequences. These methods give good results when sequence and structuresimilarity are very high, but function less well when similarity is limited to small and localelements, like single stem-loop motifs. Instead of aligning the sequences, the algorithm we presentdirectly searches for regions of the sequences that can fold into similar structures, where thedegree of similarity can be defined by the user. Any information concerning sequence similarity inthe motifs can be used either as a search constraint, or a posteriori, by post-processing theoutput. The search for the regions sharing structural similarity is implemented with the affix tree,a novel text-indexing structure that significantly accelerates the search for patterns having asymmetric layout, such as those forming stem-loop structures. Tests based on experimentally knownstructures have shown that the algorithm is able to identify functional motifs in the secondarystructure of non coding RNA, such as Iron Responsive Elements (IRE) in the untranslated regions offerritin mRNA, and the domain IV stem-loop structure in SRP RNA.
基金supported by the Ministry of Knowledge Economy,Korea, under the Information Technology Research Center support program supervised by National IT Industry Promotion Agency (No.NIPA-2011-C1090-1111-0008)the Special Research Program of Chonnam National University,2009the LG Yonam Culture Foundation
文摘Missing values occur in bio-signal processing for various reasons,including technical problems or biological char-acteristics.These missing values are then either simply excluded or substituted with estimated values for further processing.When the missing signal values are estimated for electroencephalography (EEG) signals,an example where electrical signals arrive quickly and successively,rapid processing of high-speed data is required for immediate decision making.In this study,we propose an incremental expectation maximization principal component analysis (iEMPCA) method that automatically estimates missing values from multivariable EEG time series data without requiring a whole and complete data set.The proposed method solves the problem of a biased model,which inevitably results from simply removing incomplete data rather than estimating them,and thus reduces the loss of information by incorporating missing values in real time.By using an incremental approach,the proposed method alsominimizes memory usage and processing time of continuously arriving data.Experimental results show that the proposed method assigns more accurate missing values than previous methods.