In recent years,multimedia annotation problem has been attracting significant research attention in multimedia and computer vision areas,especially for automatic image annotation,whose purpose is to provide an efficie...In recent years,multimedia annotation problem has been attracting significant research attention in multimedia and computer vision areas,especially for automatic image annotation,whose purpose is to provide an efficient and effective searching environment for users to query their images more easily. In this paper,a semi-supervised learning based probabilistic latent semantic analysis( PLSA) model for automatic image annotation is presenred. Since it's often hard to obtain or create labeled images in large quantities while unlabeled ones are easier to collect,a transductive support vector machine( TSVM) is exploited to enhance the quality of the training image data. Then,different image features with different magnitudes will result in different performance for automatic image annotation. To this end,a Gaussian normalization method is utilized to normalize different features extracted from effective image regions segmented by the normalized cuts algorithm so as to reserve the intrinsic content of images as complete as possible. Finally,a PLSA model with asymmetric modalities is constructed based on the expectation maximization( EM) algorithm to predict a candidate set of annotations with confidence scores. Extensive experiments on the general-purpose Corel5k dataset demonstrate that the proposed model can significantly improve performance of traditional PLSA for the task of automatic image annotation.展开更多
The detection of phishing and legitimate websites is considered a great challenge for web service providers because the users of such websites are indistinguishable.Phishing websites also create traffic in the entire ...The detection of phishing and legitimate websites is considered a great challenge for web service providers because the users of such websites are indistinguishable.Phishing websites also create traffic in the entire network.Another phishing issue is the broadening malware of the entire network,thus highlighting the demand for their detection while massive datasets(i.e.,big data)are processed.Despite the application of boosting mechanisms in phishing detection,these methods are prone to significant errors in their output,specifically due to the combination of all website features in the training state.The upcoming big data system requires MapReduce,a popular parallel programming,to process massive datasets.To address these issues,a probabilistic latent semantic and greedy levy gradient boosting(PLS-GLGB)algorithm for website phishing detection using MapReduce is proposed.A feature selection-based model is provided using a probabilistic intersective latent semantic preprocessing model to minimize errors in website phishing detection.Here,the missing data in each URL are identified and discarded for further processing to ensure data quality.Subsequently,with the preprocessed features(URLs),feature vectors are updated by the greedy levy divergence gradient(model)that selects the optimal features in the URL and accurately detects the websites.Thus,greedy levy efficiently differentiates between phishing websites and legitimate websites.Experiments are conducted using one of the largest public corpora of a website phish tank dataset.Results show that the PLS-GLGB algorithm for website phishing detection outperforms stateof-the-art phishing detection methods.Significant amounts of phishing detection time and errors are also saved during the detection of website phishing.展开更多
Human Activity Recognition(HAR)has become a subject of concern and plays an important role in daily life.HAR uses sensor devices to collect user behavior data,obtain human activity information and identify them.Markov...Human Activity Recognition(HAR)has become a subject of concern and plays an important role in daily life.HAR uses sensor devices to collect user behavior data,obtain human activity information and identify them.Markov Logic Networks(MLN)are widely used in HAR as an effective combination of knowledge and data.MLN can solve the problems of complexity and uncertainty,and has good knowledge expression ability.However,MLN structure learning is relatively weak and requires a lot of computing and storage resources.Essentially,the MLN structure is derived from sensor data in the current scene.Assuming that the sensor data can be effectively sliced and the sliced data can be converted into semantic rules,MLN structure can be obtained.To this end,we propose a rulebase building scheme based on probabilistic latent semantic analysis to provide a semantic rulebase for MLN learning.Such a rulebase can reduce the time required for MLN structure learning.We apply the rulebase building scheme to single-person indoor activity recognition and prove that the scheme can effectively reduce the MLN learning time.In addition,we evaluate the parameters of the rulebase building scheme to check its stability.展开更多
Probabilistic latent semantic analysis (PLSA) is a topic model for text documents, which has been widely used in text mining, computer vision, computational biology and so on. For batch PLSA inference algorithms, th...Probabilistic latent semantic analysis (PLSA) is a topic model for text documents, which has been widely used in text mining, computer vision, computational biology and so on. For batch PLSA inference algorithms, the required memory size grows linearly with the data size, and handling massive data streams is very difficult. To process big data streams, we propose an online belief propagation (OBP) algorithm based on the improved factor graph representation for PLSA. The factor graph of PLSA facilitates the classic belief propagation (BP) algorithm. Furthermore, OBP splits the data stream into a set of small segments, and uses the estimated parameters of previous segments to calculate the gradient descent of the current segment. Because OBP removes each segment from memory after processing, it is memoryefficient for big data streams. We examine the performance of OBP on four document data sets, and demonstrate that OBP is competitive in both speed and accuracy for online ex- pectation maximization (OEM) in PLSA, and can also give a more accurate topic evolution. Experiments on massive data streams from Baidu further confirm the effectiveness of the OBP algorithm.展开更多
基金Supported by the National Program on Key Basic Research Project(No.2013CB329502)the National Natural Science Foundation of China(No.61202212)+1 种基金the Special Research Project of the Educational Department of Shaanxi Province of China(No.15JK1038)the Key Research Project of Baoji University of Arts and Sciences(No.ZK16047)
文摘In recent years,multimedia annotation problem has been attracting significant research attention in multimedia and computer vision areas,especially for automatic image annotation,whose purpose is to provide an efficient and effective searching environment for users to query their images more easily. In this paper,a semi-supervised learning based probabilistic latent semantic analysis( PLSA) model for automatic image annotation is presenred. Since it's often hard to obtain or create labeled images in large quantities while unlabeled ones are easier to collect,a transductive support vector machine( TSVM) is exploited to enhance the quality of the training image data. Then,different image features with different magnitudes will result in different performance for automatic image annotation. To this end,a Gaussian normalization method is utilized to normalize different features extracted from effective image regions segmented by the normalized cuts algorithm so as to reserve the intrinsic content of images as complete as possible. Finally,a PLSA model with asymmetric modalities is constructed based on the expectation maximization( EM) algorithm to predict a candidate set of annotations with confidence scores. Extensive experiments on the general-purpose Corel5k dataset demonstrate that the proposed model can significantly improve performance of traditional PLSA for the task of automatic image annotation.
文摘The detection of phishing and legitimate websites is considered a great challenge for web service providers because the users of such websites are indistinguishable.Phishing websites also create traffic in the entire network.Another phishing issue is the broadening malware of the entire network,thus highlighting the demand for their detection while massive datasets(i.e.,big data)are processed.Despite the application of boosting mechanisms in phishing detection,these methods are prone to significant errors in their output,specifically due to the combination of all website features in the training state.The upcoming big data system requires MapReduce,a popular parallel programming,to process massive datasets.To address these issues,a probabilistic latent semantic and greedy levy gradient boosting(PLS-GLGB)algorithm for website phishing detection using MapReduce is proposed.A feature selection-based model is provided using a probabilistic intersective latent semantic preprocessing model to minimize errors in website phishing detection.Here,the missing data in each URL are identified and discarded for further processing to ensure data quality.Subsequently,with the preprocessed features(URLs),feature vectors are updated by the greedy levy divergence gradient(model)that selects the optimal features in the URL and accurately detects the websites.Thus,greedy levy efficiently differentiates between phishing websites and legitimate websites.Experiments are conducted using one of the largest public corpora of a website phish tank dataset.Results show that the PLS-GLGB algorithm for website phishing detection outperforms stateof-the-art phishing detection methods.Significant amounts of phishing detection time and errors are also saved during the detection of website phishing.
基金supported by the National Natural Science Foundation of China(No.61872038).
文摘Human Activity Recognition(HAR)has become a subject of concern and plays an important role in daily life.HAR uses sensor devices to collect user behavior data,obtain human activity information and identify them.Markov Logic Networks(MLN)are widely used in HAR as an effective combination of knowledge and data.MLN can solve the problems of complexity and uncertainty,and has good knowledge expression ability.However,MLN structure learning is relatively weak and requires a lot of computing and storage resources.Essentially,the MLN structure is derived from sensor data in the current scene.Assuming that the sensor data can be effectively sliced and the sliced data can be converted into semantic rules,MLN structure can be obtained.To this end,we propose a rulebase building scheme based on probabilistic latent semantic analysis to provide a semantic rulebase for MLN learning.Such a rulebase can reduce the time required for MLN structure learning.We apply the rulebase building scheme to single-person indoor activity recognition and prove that the scheme can effectively reduce the MLN learning time.In addition,we evaluate the parameters of the rulebase building scheme to check its stability.
文摘Probabilistic latent semantic analysis (PLSA) is a topic model for text documents, which has been widely used in text mining, computer vision, computational biology and so on. For batch PLSA inference algorithms, the required memory size grows linearly with the data size, and handling massive data streams is very difficult. To process big data streams, we propose an online belief propagation (OBP) algorithm based on the improved factor graph representation for PLSA. The factor graph of PLSA facilitates the classic belief propagation (BP) algorithm. Furthermore, OBP splits the data stream into a set of small segments, and uses the estimated parameters of previous segments to calculate the gradient descent of the current segment. Because OBP removes each segment from memory after processing, it is memoryefficient for big data streams. We examine the performance of OBP on four document data sets, and demonstrate that OBP is competitive in both speed and accuracy for online ex- pectation maximization (OEM) in PLSA, and can also give a more accurate topic evolution. Experiments on massive data streams from Baidu further confirm the effectiveness of the OBP algorithm.