The molecular similarity of 139 organic compounds was calculated by the topologic index method, the flexible super-ball algorithm was used to scan similar molecules and structures. The results show that the properti...The molecular similarity of 139 organic compounds was calculated by the topologic index method, the flexible super-ball algorithm was used to scan similar molecules and structures. The results show that the properties of organic compounds estimated from this method are reliable.展开更多
K-means algorithm is one of the most widely used algorithms in the clustering analysis. To deal with the problem caused by the random selection of initial center points in the traditional al- gorithm, this paper propo...K-means algorithm is one of the most widely used algorithms in the clustering analysis. To deal with the problem caused by the random selection of initial center points in the traditional al- gorithm, this paper proposes an improved K-means algorithm based on the similarity matrix. The im- proved algorithm can effectively avoid the random selection of initial center points, therefore it can provide effective initial points for clustering process, and reduce the fluctuation of clustering results which are resulted from initial points selections, thus a better clustering quality can be obtained. The experimental results also show that the F-measure of the improved K-means algorithm has been greatly improved and the clustering results are more stable.展开更多
The fundamental problem of similarity studies, in the frame of data-mining, is to examine and detect similar items in articles, papers, and books with huge sizes. In this paper, we are interested in the probabilistic,...The fundamental problem of similarity studies, in the frame of data-mining, is to examine and detect similar items in articles, papers, and books with huge sizes. In this paper, we are interested in the probabilistic, and the statistical and the algorithmic aspects in studies of texts. We will be using the approach of k-shinglings, a k-shingling being defined as a sequence of k consecutive characters that are extracted from a text (k ≥ 1). The main stake in this field is to find accurate and quick algorithms to compute the similarity in short times. This will be achieved in using approximation methods. The first approximation method is statistical and, is based on the theorem of Glivenko-Cantelli. The second is the banding technique. And the third concerns a modification of the algorithm proposed by Rajaraman et al. ([1]), denoted here as (RUM). The Jaccard index is the one being used in this paper. We finally illustrate these results of the paper on the four Gospels. The results are very conclusive.展开更多
Pattern discovery from time series is of fundamental importance. Most of the algorithms of pattern discovery in time series capture the values of time series based on some kinds of similarity measures. Affected by the...Pattern discovery from time series is of fundamental importance. Most of the algorithms of pattern discovery in time series capture the values of time series based on some kinds of similarity measures. Affected by the scale and baseline, value-based methods bring about problem when the objective is to capture the shape. Thus, a similarity measure based on shape, Sh measure, is originally proposed, andthe properties of this similarity and corresponding proofs are given. Then a time series shape pattern discovery algorithm based on Sh measure is put forward. The proposed algorithm is terminated in finite iteration with given computational and storage complexity. Finally the experiments on synthetic datasets and sunspot datasets demonstrate that the time series shape pattern algorithm is valid.展开更多
Although the encryption of network packets significantly increases privacy, the density of the traffic can still provide useful information to the observer, and maybe results in the breach of confidentiality. In this ...Although the encryption of network packets significantly increases privacy, the density of the traffic can still provide useful information to the observer, and maybe results in the breach of confidentiality. In this paper, we address issues related to hiding information in self-similar network, which is proved to be similar with modern communication network. And a statistical hiding algorithm is proposed for traffic padding. The figures and the comparison of Hurst Parameters before and after traffic padding, show the effective performance of the algorithm.展开更多
In this paper, we proposed an improved hybrid semantic matching algorithm combining Input/Output (I/O) semantic matching with text lexical similarity to overcome the disadvantage that the existing semantic matching al...In this paper, we proposed an improved hybrid semantic matching algorithm combining Input/Output (I/O) semantic matching with text lexical similarity to overcome the disadvantage that the existing semantic matching algorithms were unable to distinguish those services with the same I/O by only performing I/O based service signature matching in semantic web service discovery techniques. The improved algorithm consists of two steps, the first is logic based I/O concept ontology matching, through which the candidate service set is obtained and the second is the service name matching with lexical similarity against the candidate service set, through which the final precise matching result is concluded. Using Ontology Web Language for Services (OWL-S) test collection, we tested our hybrid algorithm and compared it with OWL-S Matchmaker-X (OWLS-MX), the experimental results have shown that the proposed algorithm could pick out the most suitable advertised service corresponding to user's request from very similar ones and provide better matching precision and efficiency than OWLS-MX.展开更多
Under the background of the rapid development of the air transport industry, the abnormal phenomenon of flights has become increasingly serious due to various factors such as the gradual reduction of resources, advers...Under the background of the rapid development of the air transport industry, the abnormal phenomenon of flights has become increasingly serious due to various factors such as the gradual reduction of resources, adverse climatic conditions, problems in air traffic control and mechanical failures. In order to reduce losses, it has become a major problem for airlines to use optimization algorithm to study the recovery of abnormal flights. By upgrading the passenger recovery engine, the purpose of this paper is to provide the optimal recovery scheme for passengers, so as to reduce the risk of transferring overseas flights, and thus reduce the economic loss of airlines. In this paper, the optimization model and algorithm based on network flow, combined with actual business requirements, comprehensively consider multiple optimization objectives to quickly generate passenger recovery solutions, and at the same time achieve the optimal income of airlines and the acceptance rate of passenger recovery, so as to balance the two. The practicability and effectiveness of the proposed model and algorithm are proved by some concrete examples.展开更多
为研究灰尘对光伏发电性能的影响,通过搭建的实验台采集清洁与污染光伏组串每天的发电数据,同时监测气象数据,分析积灰及天气对光伏组件发电性能的影响。结果表明,冬季PM2.5质量浓度的上升和春季沙尘暴天气的频发使得光伏组件表面灰尘...为研究灰尘对光伏发电性能的影响,通过搭建的实验台采集清洁与污染光伏组串每天的发电数据,同时监测气象数据,分析积灰及天气对光伏组件发电性能的影响。结果表明,冬季PM2.5质量浓度的上升和春季沙尘暴天气的频发使得光伏组件表面灰尘积累较多,累计发电量损失增长较快,而夏季由于降水增加,灰尘难以积聚在光伏组件上,累计发电量损失增长缓慢。此外,利用DTW(dynamic time warping)算法来寻找相似日。首先通过熵值法计算出各气象参数的权重,然后按日期逆序逐个计算出每个历史日各个气象参数对应的DTW值,再乘以其权重并相加得到历史日的综合DTW值。通过比较各历史日的综合DTW值,选出与当前日最接近的气象相似日。在避开极端天气的情况下,选择数据集中的一部分作为验证集,并对寻找相似日的判据进行优化,选取每天09:00—15:00的数据分为3个时间段进行分析,并设定平均太阳辐照度不小于600 W/m2的条件。优化后,预测模型的评价指标决定系数为0.83,均方根误差为0.22,预测效果显著提升。最后利用该算法为光伏电站制定清洗策略,经过累计发电量损失与清洗成本的对比,确定在长期不降雨情况下,电站应每28天进行一次清洗。展开更多
基金the National Natural Science Foundation of China(Grant No. 29767001).
文摘The molecular similarity of 139 organic compounds was calculated by the topologic index method, the flexible super-ball algorithm was used to scan similar molecules and structures. The results show that the properties of organic compounds estimated from this method are reliable.
文摘K-means algorithm is one of the most widely used algorithms in the clustering analysis. To deal with the problem caused by the random selection of initial center points in the traditional al- gorithm, this paper proposes an improved K-means algorithm based on the similarity matrix. The im- proved algorithm can effectively avoid the random selection of initial center points, therefore it can provide effective initial points for clustering process, and reduce the fluctuation of clustering results which are resulted from initial points selections, thus a better clustering quality can be obtained. The experimental results also show that the F-measure of the improved K-means algorithm has been greatly improved and the clustering results are more stable.
文摘The fundamental problem of similarity studies, in the frame of data-mining, is to examine and detect similar items in articles, papers, and books with huge sizes. In this paper, we are interested in the probabilistic, and the statistical and the algorithmic aspects in studies of texts. We will be using the approach of k-shinglings, a k-shingling being defined as a sequence of k consecutive characters that are extracted from a text (k ≥ 1). The main stake in this field is to find accurate and quick algorithms to compute the similarity in short times. This will be achieved in using approximation methods. The first approximation method is statistical and, is based on the theorem of Glivenko-Cantelli. The second is the banding technique. And the third concerns a modification of the algorithm proposed by Rajaraman et al. ([1]), denoted here as (RUM). The Jaccard index is the one being used in this paper. We finally illustrate these results of the paper on the four Gospels. The results are very conclusive.
文摘Pattern discovery from time series is of fundamental importance. Most of the algorithms of pattern discovery in time series capture the values of time series based on some kinds of similarity measures. Affected by the scale and baseline, value-based methods bring about problem when the objective is to capture the shape. Thus, a similarity measure based on shape, Sh measure, is originally proposed, andthe properties of this similarity and corresponding proofs are given. Then a time series shape pattern discovery algorithm based on Sh measure is put forward. The proposed algorithm is terminated in finite iteration with given computational and storage complexity. Finally the experiments on synthetic datasets and sunspot datasets demonstrate that the time series shape pattern algorithm is valid.
基金Sponsored by the Program for New Excellent Talents in University(Grant No.NZCT2004-0332)
文摘Although the encryption of network packets significantly increases privacy, the density of the traffic can still provide useful information to the observer, and maybe results in the breach of confidentiality. In this paper, we address issues related to hiding information in self-similar network, which is proved to be similar with modern communication network. And a statistical hiding algorithm is proposed for traffic padding. The figures and the comparison of Hurst Parameters before and after traffic padding, show the effective performance of the algorithm.
基金Supported by the National Natural Science Foundation of China (No. 60872018)the Specialized Research Fund for the Doctoral Program of Higher Education (No. 20070293001)973 Project (No. 2007CB310607)
文摘In this paper, we proposed an improved hybrid semantic matching algorithm combining Input/Output (I/O) semantic matching with text lexical similarity to overcome the disadvantage that the existing semantic matching algorithms were unable to distinguish those services with the same I/O by only performing I/O based service signature matching in semantic web service discovery techniques. The improved algorithm consists of two steps, the first is logic based I/O concept ontology matching, through which the candidate service set is obtained and the second is the service name matching with lexical similarity against the candidate service set, through which the final precise matching result is concluded. Using Ontology Web Language for Services (OWL-S) test collection, we tested our hybrid algorithm and compared it with OWL-S Matchmaker-X (OWLS-MX), the experimental results have shown that the proposed algorithm could pick out the most suitable advertised service corresponding to user's request from very similar ones and provide better matching precision and efficiency than OWLS-MX.
文摘Under the background of the rapid development of the air transport industry, the abnormal phenomenon of flights has become increasingly serious due to various factors such as the gradual reduction of resources, adverse climatic conditions, problems in air traffic control and mechanical failures. In order to reduce losses, it has become a major problem for airlines to use optimization algorithm to study the recovery of abnormal flights. By upgrading the passenger recovery engine, the purpose of this paper is to provide the optimal recovery scheme for passengers, so as to reduce the risk of transferring overseas flights, and thus reduce the economic loss of airlines. In this paper, the optimization model and algorithm based on network flow, combined with actual business requirements, comprehensively consider multiple optimization objectives to quickly generate passenger recovery solutions, and at the same time achieve the optimal income of airlines and the acceptance rate of passenger recovery, so as to balance the two. The practicability and effectiveness of the proposed model and algorithm are proved by some concrete examples.
文摘为研究灰尘对光伏发电性能的影响,通过搭建的实验台采集清洁与污染光伏组串每天的发电数据,同时监测气象数据,分析积灰及天气对光伏组件发电性能的影响。结果表明,冬季PM2.5质量浓度的上升和春季沙尘暴天气的频发使得光伏组件表面灰尘积累较多,累计发电量损失增长较快,而夏季由于降水增加,灰尘难以积聚在光伏组件上,累计发电量损失增长缓慢。此外,利用DTW(dynamic time warping)算法来寻找相似日。首先通过熵值法计算出各气象参数的权重,然后按日期逆序逐个计算出每个历史日各个气象参数对应的DTW值,再乘以其权重并相加得到历史日的综合DTW值。通过比较各历史日的综合DTW值,选出与当前日最接近的气象相似日。在避开极端天气的情况下,选择数据集中的一部分作为验证集,并对寻找相似日的判据进行优化,选取每天09:00—15:00的数据分为3个时间段进行分析,并设定平均太阳辐照度不小于600 W/m2的条件。优化后,预测模型的评价指标决定系数为0.83,均方根误差为0.22,预测效果显著提升。最后利用该算法为光伏电站制定清洗策略,经过累计发电量损失与清洗成本的对比,确定在长期不降雨情况下,电站应每28天进行一次清洗。