We present the solid model edit distance(SMED),a powerful and flexible paradigm for exploiting shape similarities amongst CAD models.It is designed to measure the magnitude of distortions between two CAD models in bou...We present the solid model edit distance(SMED),a powerful and flexible paradigm for exploiting shape similarities amongst CAD models.It is designed to measure the magnitude of distortions between two CAD models in boundary representation(B-rep).We give the formal definition by analogy with graph edit distance,one of the most popular graph matching methods.To avoid the expensive computational cost potentially caused by exact computation,an approximate procedure based on the alignment of local structure sets is provided in addition.In order to verify the flexibility,we make intensive investigations on three typical applications in manufacturing industry,and describe how our method can be adapted to meet the various requirements.Furthermore,a multilevel method is proposed to make further improvements of the presented algorithm on both effectiveness and efficiency,in which the models are hierarchically segmented into the configurations of features.Experiment results show that SMED serves as a reasonable measurement of shape similarity for CAD models,and the proposed approach provides remarkable performance on a real-world CAD model database.展开更多
Human activity detection and recognition is a challenging task.Video surveillance can benefit greatly by advances in Internet of Things(IoT)and cloud computing.Artificial intelligence IoT(AIoT)based devices form the b...Human activity detection and recognition is a challenging task.Video surveillance can benefit greatly by advances in Internet of Things(IoT)and cloud computing.Artificial intelligence IoT(AIoT)based devices form the basis of a smart city.The research presents Intelligent dynamic gesture recognition(IDGR)using a Convolutional neural network(CNN)empowered by edit distance for video recognition.The proposed system has been evaluated using AIoT enabled devices for static and dynamic gestures of Pakistani sign language(PSL).However,the proposed methodology can work efficiently for any type of video.The proposed research concludes that deep learning and convolutional neural networks give a most appropriate solution retaining discriminative and dynamic information of the input action.The research proposes recognition of dynamic gestures using image recognition of the keyframes based on CNN extracted from the human activity.Edit distance is used to find out the label of the word to which those sets of frames belong to.The simulation results have shown that at 400 videos per human action,100 epochs,234×234 image size,the accuracy of the system is 90.79%,which is a reasonable accuracy for a relatively small dataset as compared to the previously published techniques.展开更多
There are numerous application areas of computing similarity between process models.It includes finding similar models from a repository,controlling redundancy of process models,and finding corresponding activities be...There are numerous application areas of computing similarity between process models.It includes finding similar models from a repository,controlling redundancy of process models,and finding corresponding activities between a pair of process models.The similarity between two process models is computed based on their similarity between labels,structures,and execution behaviors.Several attempts have been made to develop similarity techniques between activity labels,as well as their execution behavior.However,a notable problem with the process model similarity is that two process models can also be similar if there is a structural variation between them.However,neither a benchmark dataset exists for the structural similarity between process models nor there exist an effective technique to compute structural similarity.To that end,we have developed a large collection of process models in which structural changes are handcrafted while preserving the semantics of the models.Furthermore,we have used a machine learning-based approach to compute the similarity between a pair of process models having structural and label differences.Finally,we have evaluated the proposed approach using our generated collection of process models.展开更多
Positioning technology based on wireless network signals in indoor environments has developed rapidly in recent years as the demand for locationbased services continues to increase.Channel state information(CSI)can be...Positioning technology based on wireless network signals in indoor environments has developed rapidly in recent years as the demand for locationbased services continues to increase.Channel state information(CSI)can be used as location feature information in fingerprint-based positioning systems because it can reflect the characteristics of the signal on multiple subcarriers.However,the random noise contained in the raw CSI information increases the likelihood of confusion when matching fingerprint data.In this paper,the Dynamic Fusion Feature(DFF)is proposed as a new fingerprint formation method to remove the noise and improve the feature resolution of the system,which combines the pre-processed amplitude and phase data.Then,the improved edit distance on real sequence(IEDR)is used as a similarity metric for fingerprint matching.Based on the above studies,we propose a new indoor fingerprint positioning method,named DFF-EDR,for improving positioning performance.During the experimental stage,data were collected and analyzed in two typical indoor environments.The results show that the proposed localization method in this paper effectively improves the feature resolution of the system in terms of both fingerprint features and similarity measures,has good anti-noise capability,and effectively reduces the localization errors.展开更多
Traditional normalized tree edit distances do not satisfy the triangle inequality. We present a metric normalization method for tree edit distance, which results in a new normalized tree edit distance fulfilling the t...Traditional normalized tree edit distances do not satisfy the triangle inequality. We present a metric normalization method for tree edit distance, which results in a new normalized tree edit distance fulfilling the triangle inequality, under the condition that the weight function is a metric over the set of elementary edit operations with all costs of insertions/deletions having the same weight. We prove that the new distance, in the range [0, 1], is a genuine metric as a simple function of the sizes of two ordered labeled trees and the tree edit distance between them, which can be directly computed through tree edit distance with the same complexity. Based on an efficient algorithm to represent digits as ordered labeled trees, we show that the normalized tree edit metric can provide slightly better results than other existing methods in handwritten digit recognition experiments using the approximating and eliminating search algorithm (AESA) algorithm.展开更多
An important aim in pattern recognition is to cluster the given shapes. This paper presents a shape recognition and retrieval algorithm. The algorithm first extracts the skeletal features using the medial axis transfo...An important aim in pattern recognition is to cluster the given shapes. This paper presents a shape recognition and retrieval algorithm. The algorithm first extracts the skeletal features using the medial axis transform. Then, the features are transformed into a string of symbols with the similarity among those symbols computed based on the edit distance. Finally, the shapes are identified using dynamic programming. Two public datasets are analyzed to demonstrate that the present approach is better than previous approaches.展开更多
Musical rhythms are represented as sequences of symbols. The sequences may be composed of binary symbols denoting either silent or monophonic sounded pulses, or ternary symbols denoting silent pulses and two types of ...Musical rhythms are represented as sequences of symbols. The sequences may be composed of binary symbols denoting either silent or monophonic sounded pulses, or ternary symbols denoting silent pulses and two types of sounded pulses made up of low-pitched (dum) and high-pitched (tak) sounds. Experiments are described that compare the effectiveness of the many-to-many minimum-weight matching between two sequences to serve as a measure of similarity that correlates well with human judgements of rhythm similarity. This measure is also compared to the often used edit distance and to the one-to-one minimum-weight matching. New results are reported from experiments performed with three widely different datasets of real- world and artificially generated musical rhythms (including Afro-Cuban rhythms), and compared with results previously reported with a dataset of Middle Eastern dum-tak rhythms.展开更多
As an effective approach to achieve the“dual-carbon”goal,the grid-connected capacity of renewable energy increases constantly.Photovoltaics are the most widely used renewable energy sources and have been applied on ...As an effective approach to achieve the“dual-carbon”goal,the grid-connected capacity of renewable energy increases constantly.Photovoltaics are the most widely used renewable energy sources and have been applied on various occasions.However,the inherent randomness,intermittency,and weak support of grid-connected equipment not only cause changes in the original flow characteristics of the grid but also result in complex fault characteristics.Traditional overcurrent and differential protection methods cannot respond accurately due to the effects of unknown renewable energy sources.Therefore,a longitudinal protection method based on virtual measurement of current restraint is proposed in this paper.The positive sequence current data and the network parameters are used to calculate the virtual measurement current which compensates for the output current of photovoltaic(PV).The waveform difference between the virtual measured current and the terminal current for internal and external faults is used to construct the protection method.An improved edit distance algorithm is proposed to measure the similarity between virtual measurement current and terminal measurement current.Finally,the feasibility of the protection method is verified through PSCAD simulation.展开更多
A high-precision nominal flight profile,involving controllers′intentions is critical for 4Dtrajectory estimation in modern automatic air traffic control systems.We proposed a novel method to effectively improve the a...A high-precision nominal flight profile,involving controllers′intentions is critical for 4Dtrajectory estimation in modern automatic air traffic control systems.We proposed a novel method to effectively improve the accuracy of the nominal flight profile,including the nominal altitude profile and the speed profile.First,considering the characteristics of trajectory data,we developed an improved K-means algorithm.The approach was to measure the similarity between different altitude profiles by integrating the space warp edit distance algorithm,thereby to acquire several fitted nominal flight altitude profiles.This approach breaks the constraints of traditional K-means algorithms.Second,to eliminate the influence of meteorological factors,we introduced historical gridded binary data to determine the en-route wind speed and temperature via inverse distance weighted interpolation.Finally,we facilitated the true airspeed determined by speed triangle relationships and the calibrated airspeed determined by aircraft data model to extract a more accurate nominal speed profile from each cluster,therefore we could describe the airspeed profiles above and below the airspeed transition altitude,respectively.Our experimental results showed that the proposed method could obtain a highly accurate nominal flight profile,which reflects the actual aircraft flight status.展开更多
There are numerous application areas of computing similarity between process models.It includes finding similar models from a repository,controlling redundancy of process models,and finding corresponding activities be...There are numerous application areas of computing similarity between process models.It includes finding similar models from a repository,controlling redundancy of process models,and finding corresponding activities between a pair of process models.The similarity between two process models is computed based on their similarity between labels,structures,and execution behaviors.Several attempts have been made to develop similarity techniques between activity labels,as well as their execution behavior.However,a notable problem with the process model similarity is that two process models can also be similar if there is a structural variation between them.However,neither a benchmark dataset exists for the structural similarity between process models nor there exist an effective technique to compute structural similarity.To that end,we have developed a large collection of process models in which structural changes are handcrafted while preserving the semantics of the models.Furthermore,we have used a machine learning-based approach to compute the similarity between a pair of process models having structural and label differences.Finally,we have evaluated the proposed approach using our generated collection of process models.展开更多
The weighted edit distance and metaphone+ algorithm are combined to correct the non-word errors. The speed is also optimized based on the observation that people rarely make mistakes in the initial letter of a word. ...The weighted edit distance and metaphone+ algorithm are combined to correct the non-word errors. The speed is also optimized based on the observation that people rarely make mistakes in the initial letter of a word. A spelling checker is designed for an automatic detection and correction system for student essays. To evaluate the algorithm it is compared to some famous systems (MS Word2000, Aspell, winEdt). The resuits show that our approach is superior to the alternative approaches.展开更多
The increasing popularity of Android devices gives birth to a large amount of feature-rich applications (or apps) in various Android markets. Since adversaries can easily repackage mali- cious code into benign apps ...The increasing popularity of Android devices gives birth to a large amount of feature-rich applications (or apps) in various Android markets. Since adversaries can easily repackage mali- cious code into benign apps and spread them, it is urgent to detect the repackaged apps to maintain healthy Android mar- kets. In this paper we propose an efficient detection scheme based on twice context triggered piecewise hash (T-CTPH), in which CTPH process is called twice so as to generate two fin- gerprints for each app to detect the repackaged Android appli- cations. We also optimize the similarity calculation algorithm to improve the matching efficiency. Experimental results show that there are about 5% repackaged apps in pre- collected 6438 samples of 4 different types. The proposed scheme im- proves the detection accuracy of the repackaged apps and has positive and practical significance for the ecological system of the Android markets.展开更多
String similarity join(SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-and...String similarity join(SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics.The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.展开更多
Increasing the accuracy of the nucleotide sequence alignment is an essential issue in genomics research.Although classic dynamic programming(DP)algorithms(e.g.,Smith–Waterman and Needleman–Wunsch)guarantee to produc...Increasing the accuracy of the nucleotide sequence alignment is an essential issue in genomics research.Although classic dynamic programming(DP)algorithms(e.g.,Smith–Waterman and Needleman–Wunsch)guarantee to produce the optimal result,their time complexity hinders the application of large-scale sequence alignment.Many optimization efforts that aim to accelerate the alignment process generally come from three perspectives:redesigning data structures[e.g.,diagonal or striped Single Instruction Multiple Data(SIMD)implementations],increasing the number of parallelisms in SIMD operations(e.g.,difference recurrence relation),or reducing search space(e.g.,banded DP).However,no methods combine all these three aspects to build an ultra-fast algorithm.In this study,we developed a Banded Striped Aligner(BSAlign)library that delivers accurate alignment results at an ultra-fast speed by knitting a series of novel methods together to take advantage of all of the aforementioned three perspectives with highlights such as active F-loop in striped vectorization and striped move in banded DP.We applied our new acceleration design on both regular and edit distance pairwise alignment.BSAlign achieved 2-fold speed-up than other SIMD-based implementations for regular pairwise alignment,and 1.5-fold to 4-fold speed-up in edit distance-based implementations for long reads.BSAlign is implemented in C programing language and is available at https://github.com/ruanjue/bsalign.展开更多
Graphs have been widely used for complex data representation in many real applications, such as social network, bioinformatics, and computer vision. Therefore, graph similarity join has become imperative for integrati...Graphs have been widely used for complex data representation in many real applications, such as social network, bioinformatics, and computer vision. Therefore, graph similarity join has become imperative for integrating noisy and inconsistent data from multiple data sources. The edit distance is commonly used to measure the similarity between graphs. The graph similarity join problem studied in this paper is based on graph edit distance constraints. To accelerate the similarity join based on graph edit distance, in the paper, we make use of a preprocessing strategy to remove the mismatching graph pairs with significant differences. Then a novel method of building indexes for each graph is proposed by grouping the nodes which can be reached in k hops for each key node with structure conservation, which is the k-hop tree based indexing method. As for each candidate pair, we propose a similarity computation algorithm with boundary filtering, which can be applied with good efficiency and effectiveness. Experiments on real and synthetic graph databases also confirm that our method can achieve good join quality in graph similarity join. Besides, the join process can be finished in polynomial time.展开更多
Haussler's convolution kernel provides an effective framework for engineering positive semidefinite kernels, and has a wide range of applications.On the other hand,the mapping kernel that we introduce in this paper i...Haussler's convolution kernel provides an effective framework for engineering positive semidefinite kernels, and has a wide range of applications.On the other hand,the mapping kernel that we introduce in this paper is its natural generalization,and will enlarge the range of application significantly.Our main theorem with respect to positive semidefiniteness of the mapping kernel(1) implies Haussler's theorem as a corollary,(2) exhibits an easy-to-check necessary and sufficient condition for mapping kernels to be positive semidefinite,and(3) formalizes the mapping kernel so that significant flexibility is provided in engineering new kernels.As an evidence of the effectiveness of our results,we present a framework to engineer tree kernels.The tree is a data structure widely used in many applications,and tree kernels provide an effective method to analyze tree-type data.Thus,not only is the framework important as an example but also as a practical research tool.The description of the framework accompanies a survey of the tree kernels in the literature,where we see that 18 out of the 19 surveyed tree kernels of different types are instances of the mapping kernel,and examples of novel interesting tree kernels.展开更多
Inexact graph matching algorithms have proved to be useful in many applications,such as character recognition,shape analysis,and image analysis. Inexact graph matching is,however,inherently an NP-hard problem with exp...Inexact graph matching algorithms have proved to be useful in many applications,such as character recognition,shape analysis,and image analysis. Inexact graph matching is,however,inherently an NP-hard problem with exponential computational complexity. Much of the previous research has focused on solving this problem using heuristics or estimations. Unfortunately,many of these techniques do not guarantee that an optimal solution will be found. It is the aim of the proposed algorithm to reduce the complexity of the inexact graph matching process,while still producing an optimal solution for a known application. This is achieved by greatly simplifying each individual matching process,and compensating for lost robustness by producing a hierarchy of matching processes. The creation of each matching process in the hierarchy is driven by an application-specific criterion that operates at the subgraph scale. To our knowledge,this problem has never before been approached in this manner. Results show that the proposed algorithm is faster than two existing methods based on graph edit operations.The proposed algorithm produces accurate results in terms of matching graphs,and shows promise for the application of shape matching. The proposed algorithm can easily be extended to produce a sub-optimal solution if required.展开更多
基金Supported by National Science Foundation of China(61373071)
文摘We present the solid model edit distance(SMED),a powerful and flexible paradigm for exploiting shape similarities amongst CAD models.It is designed to measure the magnitude of distortions between two CAD models in boundary representation(B-rep).We give the formal definition by analogy with graph edit distance,one of the most popular graph matching methods.To avoid the expensive computational cost potentially caused by exact computation,an approximate procedure based on the alignment of local structure sets is provided in addition.In order to verify the flexibility,we make intensive investigations on three typical applications in manufacturing industry,and describe how our method can be adapted to meet the various requirements.Furthermore,a multilevel method is proposed to make further improvements of the presented algorithm on both effectiveness and efficiency,in which the models are hierarchically segmented into the configurations of features.Experiment results show that SMED serves as a reasonable measurement of shape similarity for CAD models,and the proposed approach provides remarkable performance on a real-world CAD model database.
文摘Human activity detection and recognition is a challenging task.Video surveillance can benefit greatly by advances in Internet of Things(IoT)and cloud computing.Artificial intelligence IoT(AIoT)based devices form the basis of a smart city.The research presents Intelligent dynamic gesture recognition(IDGR)using a Convolutional neural network(CNN)empowered by edit distance for video recognition.The proposed system has been evaluated using AIoT enabled devices for static and dynamic gestures of Pakistani sign language(PSL).However,the proposed methodology can work efficiently for any type of video.The proposed research concludes that deep learning and convolutional neural networks give a most appropriate solution retaining discriminative and dynamic information of the input action.The research proposes recognition of dynamic gestures using image recognition of the keyframes based on CNN extracted from the human activity.Edit distance is used to find out the label of the word to which those sets of frames belong to.The simulation results have shown that at 400 videos per human action,100 epochs,234×234 image size,the accuracy of the system is 90.79%,which is a reasonable accuracy for a relatively small dataset as compared to the previously published techniques.
文摘There are numerous application areas of computing similarity between process models.It includes finding similar models from a repository,controlling redundancy of process models,and finding corresponding activities between a pair of process models.The similarity between two process models is computed based on their similarity between labels,structures,and execution behaviors.Several attempts have been made to develop similarity techniques between activity labels,as well as their execution behavior.However,a notable problem with the process model similarity is that two process models can also be similar if there is a structural variation between them.However,neither a benchmark dataset exists for the structural similarity between process models nor there exist an effective technique to compute structural similarity.To that end,we have developed a large collection of process models in which structural changes are handcrafted while preserving the semantics of the models.Furthermore,we have used a machine learning-based approach to compute the similarity between a pair of process models having structural and label differences.Finally,we have evaluated the proposed approach using our generated collection of process models.
基金This work was financially supported by the National Key Research&Development Program of China under Grant No.2020YFC1511702the Beijing Municipal Natural Science Foundation under Grant No.L191003.
文摘Positioning technology based on wireless network signals in indoor environments has developed rapidly in recent years as the demand for locationbased services continues to increase.Channel state information(CSI)can be used as location feature information in fingerprint-based positioning systems because it can reflect the characteristics of the signal on multiple subcarriers.However,the random noise contained in the raw CSI information increases the likelihood of confusion when matching fingerprint data.In this paper,the Dynamic Fusion Feature(DFF)is proposed as a new fingerprint formation method to remove the noise and improve the feature resolution of the system,which combines the pre-processed amplitude and phase data.Then,the improved edit distance on real sequence(IEDR)is used as a similarity metric for fingerprint matching.Based on the above studies,we propose a new indoor fingerprint positioning method,named DFF-EDR,for improving positioning performance.During the experimental stage,data were collected and analyzed in two typical indoor environments.The results show that the proposed localization method in this paper effectively improves the feature resolution of the system in terms of both fingerprint features and similarity measures,has good anti-noise capability,and effectively reduces the localization errors.
文摘Traditional normalized tree edit distances do not satisfy the triangle inequality. We present a metric normalization method for tree edit distance, which results in a new normalized tree edit distance fulfilling the triangle inequality, under the condition that the weight function is a metric over the set of elementary edit operations with all costs of insertions/deletions having the same weight. We prove that the new distance, in the range [0, 1], is a genuine metric as a simple function of the sizes of two ordered labeled trees and the tree edit distance between them, which can be directly computed through tree edit distance with the same complexity. Based on an efficient algorithm to represent digits as ordered labeled trees, we show that the normalized tree edit metric can provide slightly better results than other existing methods in handwritten digit recognition experiments using the approximating and eliminating search algorithm (AESA) algorithm.
基金Supported by the National Natural Science Foundation of China (No.60772121)the Natural Science Foundation of Anhui Provincial Education Department (No.KJ2008B024)
文摘An important aim in pattern recognition is to cluster the given shapes. This paper presents a shape recognition and retrieval algorithm. The algorithm first extracts the skeletal features using the medial axis transform. Then, the features are transformed into a string of symbols with the similarity among those symbols computed based on the edit distance. Finally, the shapes are identified using dynamic programming. Two public datasets are analyzed to demonstrate that the present approach is better than previous approaches.
文摘Musical rhythms are represented as sequences of symbols. The sequences may be composed of binary symbols denoting either silent or monophonic sounded pulses, or ternary symbols denoting silent pulses and two types of sounded pulses made up of low-pitched (dum) and high-pitched (tak) sounds. Experiments are described that compare the effectiveness of the many-to-many minimum-weight matching between two sequences to serve as a measure of similarity that correlates well with human judgements of rhythm similarity. This measure is also compared to the often used edit distance and to the one-to-one minimum-weight matching. New results are reported from experiments performed with three widely different datasets of real- world and artificially generated musical rhythms (including Afro-Cuban rhythms), and compared with results previously reported with a dataset of Middle Eastern dum-tak rhythms.
基金funded by State Grid Anhui Electric Power Co.,Ltd.Science and Technology Project(52120021N00L)the National Key Research and Development Program of China(2022YFB2400015).
文摘As an effective approach to achieve the“dual-carbon”goal,the grid-connected capacity of renewable energy increases constantly.Photovoltaics are the most widely used renewable energy sources and have been applied on various occasions.However,the inherent randomness,intermittency,and weak support of grid-connected equipment not only cause changes in the original flow characteristics of the grid but also result in complex fault characteristics.Traditional overcurrent and differential protection methods cannot respond accurately due to the effects of unknown renewable energy sources.Therefore,a longitudinal protection method based on virtual measurement of current restraint is proposed in this paper.The positive sequence current data and the network parameters are used to calculate the virtual measurement current which compensates for the output current of photovoltaic(PV).The waveform difference between the virtual measured current and the terminal current for internal and external faults is used to construct the protection method.An improved edit distance algorithm is proposed to measure the similarity between virtual measurement current and terminal measurement current.Finally,the feasibility of the protection method is verified through PSCAD simulation.
基金supported by the National Natural Science Foundation of China(Nos.61174180,U1433125)the Jiangsu Province Science Foundation (No.BK20141413)the Chinese Postdoctoral Science Foundation (No.2014M550291)
文摘A high-precision nominal flight profile,involving controllers′intentions is critical for 4Dtrajectory estimation in modern automatic air traffic control systems.We proposed a novel method to effectively improve the accuracy of the nominal flight profile,including the nominal altitude profile and the speed profile.First,considering the characteristics of trajectory data,we developed an improved K-means algorithm.The approach was to measure the similarity between different altitude profiles by integrating the space warp edit distance algorithm,thereby to acquire several fitted nominal flight altitude profiles.This approach breaks the constraints of traditional K-means algorithms.Second,to eliminate the influence of meteorological factors,we introduced historical gridded binary data to determine the en-route wind speed and temperature via inverse distance weighted interpolation.Finally,we facilitated the true airspeed determined by speed triangle relationships and the calibrated airspeed determined by aircraft data model to extract a more accurate nominal speed profile from each cluster,therefore we could describe the airspeed profiles above and below the airspeed transition altitude,respectively.Our experimental results showed that the proposed method could obtain a highly accurate nominal flight profile,which reflects the actual aircraft flight status.
基金This work is supported by the Information Technology Department,College of Computer,Qassim University,6633,Buraidah 51452,Saudi Arabia.
文摘There are numerous application areas of computing similarity between process models.It includes finding similar models from a repository,controlling redundancy of process models,and finding corresponding activities between a pair of process models.The similarity between two process models is computed based on their similarity between labels,structures,and execution behaviors.Several attempts have been made to develop similarity techniques between activity labels,as well as their execution behavior.However,a notable problem with the process model similarity is that two process models can also be similar if there is a structural variation between them.However,neither a benchmark dataset exists for the structural similarity between process models nor there exist an effective technique to compute structural similarity.To that end,we have developed a large collection of process models in which structural changes are handcrafted while preserving the semantics of the models.Furthermore,we have used a machine learning-based approach to compute the similarity between a pair of process models having structural and label differences.Finally,we have evaluated the proposed approach using our generated collection of process models.
文摘The weighted edit distance and metaphone+ algorithm are combined to correct the non-word errors. The speed is also optimized based on the observation that people rarely make mistakes in the initial letter of a word. A spelling checker is designed for an automatic detection and correction system for student essays. To evaluate the algorithm it is compared to some famous systems (MS Word2000, Aspell, winEdt). The resuits show that our approach is superior to the alternative approaches.
基金supported by ZTE Industry-Academia-Research Cooperation Funds
文摘The increasing popularity of Android devices gives birth to a large amount of feature-rich applications (or apps) in various Android markets. Since adversaries can easily repackage mali- cious code into benign apps and spread them, it is urgent to detect the repackaged apps to maintain healthy Android mar- kets. In this paper we propose an efficient detection scheme based on twice context triggered piecewise hash (T-CTPH), in which CTPH process is called twice so as to generate two fin- gerprints for each app to detect the repackaged Android appli- cations. We also optimize the similarity calculation algorithm to improve the matching efficiency. Experimental results show that there are about 5% repackaged apps in pre- collected 6438 samples of 4 different types. The proposed scheme im- proves the detection accuracy of the repackaged apps and has positive and practical significance for the ecological system of the Android markets.
文摘String similarity join(SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics.The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.
基金supported by the National Natural Science Foundation of China(Grant Nos.31822029 and 32200517)the National Key R&D Project Program of China(Grant No.2019YFE0109600).
文摘Increasing the accuracy of the nucleotide sequence alignment is an essential issue in genomics research.Although classic dynamic programming(DP)algorithms(e.g.,Smith–Waterman and Needleman–Wunsch)guarantee to produce the optimal result,their time complexity hinders the application of large-scale sequence alignment.Many optimization efforts that aim to accelerate the alignment process generally come from three perspectives:redesigning data structures[e.g.,diagonal or striped Single Instruction Multiple Data(SIMD)implementations],increasing the number of parallelisms in SIMD operations(e.g.,difference recurrence relation),or reducing search space(e.g.,banded DP).However,no methods combine all these three aspects to build an ultra-fast algorithm.In this study,we developed a Banded Striped Aligner(BSAlign)library that delivers accurate alignment results at an ultra-fast speed by knitting a series of novel methods together to take advantage of all of the aforementioned three perspectives with highlights such as active F-loop in striped vectorization and striped move in banded DP.We applied our new acceleration design on both regular and edit distance pairwise alignment.BSAlign achieved 2-fold speed-up than other SIMD-based implementations for regular pairwise alignment,and 1.5-fold to 4-fold speed-up in edit distance-based implementations for long reads.BSAlign is implemented in C programing language and is available at https://github.com/ruanjue/bsalign.
文摘Graphs have been widely used for complex data representation in many real applications, such as social network, bioinformatics, and computer vision. Therefore, graph similarity join has become imperative for integrating noisy and inconsistent data from multiple data sources. The edit distance is commonly used to measure the similarity between graphs. The graph similarity join problem studied in this paper is based on graph edit distance constraints. To accelerate the similarity join based on graph edit distance, in the paper, we make use of a preprocessing strategy to remove the mismatching graph pairs with significant differences. Then a novel method of building indexes for each graph is proposed by grouping the nodes which can be reached in k hops for each key node with structure conservation, which is the k-hop tree based indexing method. As for each candidate pair, we propose a similarity computation algorithm with boundary filtering, which can be applied with good efficiency and effectiveness. Experiments on real and synthetic graph databases also confirm that our method can achieve good join quality in graph similarity join. Besides, the join process can be finished in polynomial time.
文摘Haussler's convolution kernel provides an effective framework for engineering positive semidefinite kernels, and has a wide range of applications.On the other hand,the mapping kernel that we introduce in this paper is its natural generalization,and will enlarge the range of application significantly.Our main theorem with respect to positive semidefiniteness of the mapping kernel(1) implies Haussler's theorem as a corollary,(2) exhibits an easy-to-check necessary and sufficient condition for mapping kernels to be positive semidefinite,and(3) formalizes the mapping kernel so that significant flexibility is provided in engineering new kernels.As an evidence of the effectiveness of our results,we present a framework to engineer tree kernels.The tree is a data structure widely used in many applications,and tree kernels provide an effective method to analyze tree-type data.Thus,not only is the framework important as an example but also as a practical research tool.The description of the framework accompanies a survey of the tree kernels in the literature,where we see that 18 out of the 19 surveyed tree kernels of different types are instances of the mapping kernel,and examples of novel interesting tree kernels.
文摘Inexact graph matching algorithms have proved to be useful in many applications,such as character recognition,shape analysis,and image analysis. Inexact graph matching is,however,inherently an NP-hard problem with exponential computational complexity. Much of the previous research has focused on solving this problem using heuristics or estimations. Unfortunately,many of these techniques do not guarantee that an optimal solution will be found. It is the aim of the proposed algorithm to reduce the complexity of the inexact graph matching process,while still producing an optimal solution for a known application. This is achieved by greatly simplifying each individual matching process,and compensating for lost robustness by producing a hierarchy of matching processes. The creation of each matching process in the hierarchy is driven by an application-specific criterion that operates at the subgraph scale. To our knowledge,this problem has never before been approached in this manner. Results show that the proposed algorithm is faster than two existing methods based on graph edit operations.The proposed algorithm produces accurate results in terms of matching graphs,and shows promise for the application of shape matching. The proposed algorithm can easily be extended to produce a sub-optimal solution if required.