In this paper, we propose a new online system that can quickly detect malicious spam emails and adapt to the changes in the email contents and the Uniform Resource Locator (URL) links leading to malicious websites by ...In this paper, we propose a new online system that can quickly detect malicious spam emails and adapt to the changes in the email contents and the Uniform Resource Locator (URL) links leading to malicious websites by updating the system daily. We introduce an autonomous function for a server to generate training examples, in which double-bounce emails are automatically collected and their class labels are given by a crawler-type software to analyze the website maliciousness called SPIKE. In general, since spammers use botnets to spread numerous malicious emails within a short time, such distributed spam emails often have the same or similar contents. Therefore, it is not necessary for all spam emails to be learned. To adapt to new malicious campaigns quickly, only new types of spam emails should be selected for learning and this can be realized by introducing an active learning scheme into a classifier model. For this purpose, we adopt Resource Allocating Network with Locality Sensitive Hashing (RAN-LSH) as a classifier model with a data selection function. In RAN-LSH, the same or similar spam emails that have already been learned are quickly searched for a hash table in Locally Sensitive Hashing (LSH), in which the matched similar emails located in “well-learned” are discarded without being used as training data. To analyze email contents, we adopt the Bag of Words (BoW) approach and generate feature vectors whose attributes are transformed based on the normalized term frequency-inverse document frequency (TF-IDF). We use a data set of double-bounce spam emails collected at National Institute of Information and Communications Technology (NICT) in Japan from March 1st, 2013 until May 10th, 2013 to evaluate the performance of the proposed system. The results confirm that the proposed spam email detection system has capability of detecting with high detection rate.展开更多
In recent years, the nearest neighbor search (NNS) problem has been widely used in various interesting applications. Locality-sensitive hashing (LSH), a popular algorithm for the approximate nearest neighbor probl...In recent years, the nearest neighbor search (NNS) problem has been widely used in various interesting applications. Locality-sensitive hashing (LSH), a popular algorithm for the approximate nearest neighbor problem, is proved to be an efficient method to solve the NNS problem in the high-dimensional and large-scale databases. Based on the scheme of p-stable LSH, this paper introduces a novel improvement algorithm called randomness-based locality-sensitive hashing (RLSH) based on p-stable LSH. Our proposed algorithm modifies the query strategy that it randomly selects a certain hash table to project the query point instead of mapping the query point into all hash tables in the period of the nearest neighbor query and reconstructs the candidate points for finding the nearest neighbors. This improvement strategy ensures that RLSH spends less time searching for the nearest neighbors than the p-stable LSH algorithm to keep a high recall. Besides, this strategy is proved to promote the diversity of the candidate points even with fewer hash tables. Experiments are executed on the synthetic dataset and open dataset. The results show that our method can cost less time consumption and less space requirements than the p-stable LSH while balancing the same recall.展开更多
With the growing penetration of wind power in power systems, more accurate prediction of wind speed and wind power is required for real-time scheduling and operation. In this paper, a novel forecast model for shortter...With the growing penetration of wind power in power systems, more accurate prediction of wind speed and wind power is required for real-time scheduling and operation. In this paper, a novel forecast model for shortterm prediction of wind speed and wind power is proposed,which is based on singular spectrum analysis(SSA) and locality-sensitive hashing(LSH). To deal with the impact of high volatility of the original time series, SSA is applied to decompose it into two components: the mean trend,which represents the mean tendency of the original time series, and the fluctuation component, which reveals the stochastic characteristics. Both components are reconstructed in a phase space to obtain mean trend segments and fluctuation component segments. After that, LSH is utilized to select similar segments of the mean trend segments, which are then employed in local forecasting, so that the accuracy and efficiency of prediction can be enhanced. Finally, support vector regression is adopted forprediction, where the training input is the synthesis of the similar mean trend segments and the corresponding fluctuation component segments. Simulation studies are conducted on wind speed and wind power time series from four databases, and the final results demonstrate that the proposed model is more accurate and stable in comparison with other models.展开更多
文摘In this paper, we propose a new online system that can quickly detect malicious spam emails and adapt to the changes in the email contents and the Uniform Resource Locator (URL) links leading to malicious websites by updating the system daily. We introduce an autonomous function for a server to generate training examples, in which double-bounce emails are automatically collected and their class labels are given by a crawler-type software to analyze the website maliciousness called SPIKE. In general, since spammers use botnets to spread numerous malicious emails within a short time, such distributed spam emails often have the same or similar contents. Therefore, it is not necessary for all spam emails to be learned. To adapt to new malicious campaigns quickly, only new types of spam emails should be selected for learning and this can be realized by introducing an active learning scheme into a classifier model. For this purpose, we adopt Resource Allocating Network with Locality Sensitive Hashing (RAN-LSH) as a classifier model with a data selection function. In RAN-LSH, the same or similar spam emails that have already been learned are quickly searched for a hash table in Locally Sensitive Hashing (LSH), in which the matched similar emails located in “well-learned” are discarded without being used as training data. To analyze email contents, we adopt the Bag of Words (BoW) approach and generate feature vectors whose attributes are transformed based on the normalized term frequency-inverse document frequency (TF-IDF). We use a data set of double-bounce spam emails collected at National Institute of Information and Communications Technology (NICT) in Japan from March 1st, 2013 until May 10th, 2013 to evaluate the performance of the proposed system. The results confirm that the proposed spam email detection system has capability of detecting with high detection rate.
基金Project supported by the National Natural Science Foundation of China(Grant No.61173143)the Special Public Sector Research Program of China(Grant No.GYHY201206030)the Deanship of Scientific Research at King Saud University for funding this work through research group No.RGP-VPP-264
文摘In recent years, the nearest neighbor search (NNS) problem has been widely used in various interesting applications. Locality-sensitive hashing (LSH), a popular algorithm for the approximate nearest neighbor problem, is proved to be an efficient method to solve the NNS problem in the high-dimensional and large-scale databases. Based on the scheme of p-stable LSH, this paper introduces a novel improvement algorithm called randomness-based locality-sensitive hashing (RLSH) based on p-stable LSH. Our proposed algorithm modifies the query strategy that it randomly selects a certain hash table to project the query point instead of mapping the query point into all hash tables in the period of the nearest neighbor query and reconstructs the candidate points for finding the nearest neighbors. This improvement strategy ensures that RLSH spends less time searching for the nearest neighbors than the p-stable LSH algorithm to keep a high recall. Besides, this strategy is proved to promote the diversity of the candidate points even with fewer hash tables. Experiments are executed on the synthetic dataset and open dataset. The results show that our method can cost less time consumption and less space requirements than the p-stable LSH while balancing the same recall.
基金supported by the Guangdong Innovative Research Team Program(No.201001N0104744201)the State Key Program of the National Natural Science Foundation of China(No.51437006)
文摘With the growing penetration of wind power in power systems, more accurate prediction of wind speed and wind power is required for real-time scheduling and operation. In this paper, a novel forecast model for shortterm prediction of wind speed and wind power is proposed,which is based on singular spectrum analysis(SSA) and locality-sensitive hashing(LSH). To deal with the impact of high volatility of the original time series, SSA is applied to decompose it into two components: the mean trend,which represents the mean tendency of the original time series, and the fluctuation component, which reveals the stochastic characteristics. Both components are reconstructed in a phase space to obtain mean trend segments and fluctuation component segments. After that, LSH is utilized to select similar segments of the mean trend segments, which are then employed in local forecasting, so that the accuracy and efficiency of prediction can be enhanced. Finally, support vector regression is adopted forprediction, where the training input is the synthesis of the similar mean trend segments and the corresponding fluctuation component segments. Simulation studies are conducted on wind speed and wind power time series from four databases, and the final results demonstrate that the proposed model is more accurate and stable in comparison with other models.