Recently, approaches utilizing spatial-temporal features to form Bag-of-Words (BoWs) models have achieved great success due to their simplicity and effectiveness. But they still have difficulties when distinguishing...Recently, approaches utilizing spatial-temporal features to form Bag-of-Words (BoWs) models have achieved great success due to their simplicity and effectiveness. But they still have difficulties when distinguishing between actions with high inter-ambiguity. The main reason is that they describe actions by orderless bag of features, and ignore the spatial and temporal structure information of visual words. In order to improve classification performance, we present a novel approach called sequential Bag-of-Words. It captures temporal sequential structure by segmenting the entire action into sub-actions. Meanwhile, we pay more attention to the distinguishing parts of an action by classifying sub- actions separately, which is then employed to vote for the final result. Extensive experiments are conducted on challenging datasets and real scenes to evaluate our method. Concretely, we compare our results to some state-of-the-art classification approaches and confirm the advantages of our approach to distinguish similar actions. Results show that our approach is robust and outperforms most existing BoWs based classification approaches, especially on complex datasets with interactive activities, cluttered backgrounds and inter-class action ambiguities.展开更多
Due to advances in satellite and sensor technology,the number and size of Remote Sensing(RS)images continue to grow at a rapid pace.The continuous stream of sensor data from satellites poses major challenges for the r...Due to advances in satellite and sensor technology,the number and size of Remote Sensing(RS)images continue to grow at a rapid pace.The continuous stream of sensor data from satellites poses major challenges for the retrieval of relevant information from those satellite datastreams.The Bag-of-Words(BoW)framework is a leading image search approach and has been successfully applied in a broad range of computer vision problems and hence has received much attention from the RS community.However,the recognition performance of a typical BoW framework becomes very poor when the framework is applied to application scenarios where the appearance and texture of images are very similar.In this paper,we propose a simple method to improve recognition performance of a typical BoW framework by representing images with local features extracted from base images.In addition,we propose a similarity measure for RS images by counting the number of same words assigned to images.We compare the performance of these methods with a typical BoW framework.Our experiments show that the proposed method has better recognition performance than that of the BoW and requires less storage space for saving local invariant features.展开更多
Person re-identification(person re-id) aims to match observations on pedestrians from different cameras.It is a challenging task in real word surveillance systems and draws extensive attention from the community.Most ...Person re-identification(person re-id) aims to match observations on pedestrians from different cameras.It is a challenging task in real word surveillance systems and draws extensive attention from the community.Most existing methods are based on supervised learning which requires a large number of labeled data. In this paper, we develop a robust unsupervised learning approach for person re-id. We propose an improved Bag-of-Words(i Bo W) model to describe and match pedestrians under different camera views. The proposed descriptor does not require any re-id labels, and is robust against pedestrian variations. Experiments show the proposed i Bo W descriptor outperforms other unsupervised methods. By combination with efficient metric learning algorithms, we obtained competitive accuracy compared to existing state-of-the-art methods on person re-identification benchmarks, including VIPe R, PRID450 S, and Market1501.展开更多
Image classification based on bag-of-words(BOW)has a broad application prospect in pattern recognition field but the shortcomings such as single feature and low classification accuracy are apparent.To deal with this...Image classification based on bag-of-words(BOW)has a broad application prospect in pattern recognition field but the shortcomings such as single feature and low classification accuracy are apparent.To deal with this problem,this paper proposes to combine two ingredients:(i)Three features with functions of mutual complementation are adopted to describe the images,including pyramid histogram of words(PHOW),pyramid histogram of color(PHOC)and pyramid histogram of orientated gradients(PHOG).(ii)An adaptive feature-weight adjusted image categorization algorithm based on the SVM and the decision level fusion of multiple features are employed.Experiments are carried out on the Caltech101 database,which confirms the validity of the proposed approach.The experimental results show that the classification accuracy rate of the proposed method is improved by 7%-14%higher than that of the traditional BOW methods.With full utilization of global,local and spatial information,the algorithm is much more complete and flexible to describe the feature information of the image through the multi-feature fusion and the pyramid structure composed by image spatial multi-resolution decomposition.Significant improvements to the classification accuracy are achieved as the result.展开更多
This paper presents a human action recognition method. It analyzes the spatio-temporal grids along the dense trajectories and generates the histogram of oriented gradients (HOG) and histogram of optical flow (HOF)...This paper presents a human action recognition method. It analyzes the spatio-temporal grids along the dense trajectories and generates the histogram of oriented gradients (HOG) and histogram of optical flow (HOF) to describe the appearance and motion of the human object. Then, HOG combined with HOF is converted to bag-of-words (BoWs) by the vocabulary tree. Finally, it applies random forest to recognize the type of human action. In the experiments, KTH database and URADL database are tested for the performance evaluation. Comparing with the other approaches, we show that our approach has a better performance for the action videos with high inter-class and low inter-class variabilities.展开更多
It is illegal to spread and transmit pornographic images over internet,either in real or in artificial format.The traditional methods are designed to identify real pornographic images and they are less efficient in de...It is illegal to spread and transmit pornographic images over internet,either in real or in artificial format.The traditional methods are designed to identify real pornographic images and they are less efficient in dealing with artificial images.Therefore,criminals turn to release artificial pornographic images in some specific scenes,e.g.,in social networks.To efficiently identify artificial pornographic images,a novel bag-of-visual-words based approach is proposed in the work.In the bag-of-words(Bo W)framework,speeded-up robust feature(SURF)is adopted for feature extraction at first,then a visual vocabulary is constructed through K-means clustering and images are represented by an improved Bo W encoding method,and finally the visual words are fed into a learning machine for training and classification.Different from the traditional BoW method,the proposed method sets a weight on each visual word according to the number of features that each cluster contains.Moreover,a non-binary encoding method and cross-matching strategy are utilized to improve the discriminative power of the visual words.Experimental results indicate that the proposed method outperforms the traditional method.展开更多
A Deep Neural Sentiment Classification Network(DNSCN)is devel-oped in this work to classify the Twitter data unambiguously.It attempts to extract the negative and positive sentiments in the Twitter database.The main go...A Deep Neural Sentiment Classification Network(DNSCN)is devel-oped in this work to classify the Twitter data unambiguously.It attempts to extract the negative and positive sentiments in the Twitter database.The main goal of the system is tofind the sentiment behavior of tweets with minimum ambiguity.A well-defined neural network extracts deep features from the tweets automatically.Before extracting features deeper and deeper,the text in each tweet is represented by Bag-of-Words(BoW)and Word Embeddings(WE)models.The effectiveness of DNSCN architecture is analyzed using Twitter-Sanders-Apple2(TSA2),Twit-ter-Sanders-Apple3(TSA3),and Twitter-DataSet(TDS).TSA2 and TDS consist of positive and negative tweets,whereas TSA3 has neutral tweets also.Thus,the proposed DNSCN acts as a binary classifier for TSA2 and TDS databases and a multiclass classifier for TSA3.The performances of DNSCN architecture are evaluated by F1 score,precision,and recall rates using 5-fold and 10-fold cross-validation.Results show that the DNSCN-WE model provides more accuracy than the DNSCN-BoW model for representing the tweets in the feature encoding.The F1 score of the DNSCN-BW based system on the TSA2 database is 0.98(binary classification)and 0.97(three-class classification)for the TSA3 database.This system provides better a F1 score of 0.99 for the TDS database.展开更多
A privacy-preserving search model for JPEG images is proposed in paper,which uses the bag-of-encrypted-words based on QDCT(Quaternion Discrete Cosine Transform)encoding.The JPEG image is obtained by a series of steps ...A privacy-preserving search model for JPEG images is proposed in paper,which uses the bag-of-encrypted-words based on QDCT(Quaternion Discrete Cosine Transform)encoding.The JPEG image is obtained by a series of steps such as DCT(Discrete Cosine Transform)transformation,quantization,entropy coding,etc.In this paper,we firstly transform the images from spatial domain into quaternion domain.By analyzing the algebraic relationship between QDCT and DCT,a QDCT quantization table and QDTC coding for color images are proposed.Then the compressed image data is encrypted after the steps of block permutation,intra-block permutation,single table substitution and stream cipher.At last,the similarity between original image and query image can be measured by the Manhattan distance,which is calculated by two feature vectors with the model of bag-of-words on the cloud server side.The outcome shows good performance in security attack and retrieval accuracy.展开更多
Video understanding and content boundary detection are vital stages in video recommendation.However,previous content boundary detection methods require collecting information,including location,cast,action,and audio,a...Video understanding and content boundary detection are vital stages in video recommendation.However,previous content boundary detection methods require collecting information,including location,cast,action,and audio,and if any of these elements are missing,the results may be adversely affected.To address this issue and effectively detect transitions in video content,in this paper,we introduce a video classification and boundary detection method named JudPriNet.The focus of this paper is on objects in videos along with their labels,enabling automatic scene detection in video clips and establishing semantic connections among local objects in the images.As a significant contribution,JudPriNet presents a framework that maps labels to“Continuous Bag of Visual Words Model”to cluster labels and generates new standardized labels as video-type tags.This facilitates automatic classification of video clips.Furthermore,JudPriNet employs Monte Carlo sampling method to classify video clips,the features of video clips as elements within the framework.This proposed method seamlessly integrates video and textual components without compromising training and inference speed.Through experimentation,we have demonstrated that JudPriNet,with its semantic connections,is able to effectively classify videos alongside textual content.Our results indicate that,compared with several other detection approaches,JudPriNet excels in high-level content detection without disrupting the integrity of the video content,outperforming existing methods.展开更多
We propose a heterogeneous, mid-level feature based method for recognizing natural scene categories. The proposed feature introduces spatial information among the latent topics by means of spatial pyramid, while the l...We propose a heterogeneous, mid-level feature based method for recognizing natural scene categories. The proposed feature introduces spatial information among the latent topics by means of spatial pyramid, while the latent topics are obtained by using probabilistic latent semantic analysis (pLSA) based on the bag-of-words representation. The proposed feature always performs better than standard pLSA because the performance of pLSA is adversely affected in many cases due to the loss of spatial information. By combining various interest point detectors and local region descriptors used in the bag-of-words model, the proposed feature can make further improvement for diverse scene category recognition tasks. We also propose a two-stage framework for multi-class classification. In the first stage, for each of possible detector/descriptor pairs, adaptive boosting classifiers are employed to select the most discriminative topics and further compute posterior probabilities of an unknown image from those selected topics. The second stage uses the prod-max rule to combine information coming from multiple sources and assigns the unknown image to the scene category with the highest 'final' posterior probability. Experimental results on three benchmark scene datasets show that the proposed method exceeds most state-of-the-art methods.展开更多
that are duplicate or near duplicate to a query image.One of the most popular and practical methods in near-duplicate image retrieval is based on bag-of-words(BoW)model.However,the fundamental deficiency of current Bo...that are duplicate or near duplicate to a query image.One of the most popular and practical methods in near-duplicate image retrieval is based on bag-of-words(BoW)model.However,the fundamental deficiency of current BoW method is the gap between visual word and image’s semantic meaning.Similar problem also plagues existing text retrieval.A prevalent method against such issue in text retrieval is to eliminate text synonymy and polysemy and therefore improve the whole performance.Our proposed approach borrows ideas from text retrieval and tries to overcome these deficiencies of BoW model by treating the semantic gap problem as visual synonymy and polysemy issues.We use visual synonymy in a very general sense to describe the fact that there are many different visual words referring to the same visual meaning.By visual polysemy,we refer to the general fact that most visual words have more than one distinct meaning.To eliminate visual synonymy,we present an extended similarity function to implicitly extend query visual words.To eliminate visual polysemy,we use visual pattern and prove that the most efficient way of using visual pattern is merging visual word vector together with visual pattern vector and obtain the similarity score by cosine function.In addition,we observe that there is a high possibility that duplicates visual words occur in an adjacent area.Therefore,we modify traditional Apriori algorithm to mine quantitative pattern that can be defined as patterns containing duplicate items.Experiments prove quantitative patterns improving mean average precision(MAP)significantly.展开更多
文摘Recently, approaches utilizing spatial-temporal features to form Bag-of-Words (BoWs) models have achieved great success due to their simplicity and effectiveness. But they still have difficulties when distinguishing between actions with high inter-ambiguity. The main reason is that they describe actions by orderless bag of features, and ignore the spatial and temporal structure information of visual words. In order to improve classification performance, we present a novel approach called sequential Bag-of-Words. It captures temporal sequential structure by segmenting the entire action into sub-actions. Meanwhile, we pay more attention to the distinguishing parts of an action by classifying sub- actions separately, which is then employed to vote for the final result. Extensive experiments are conducted on challenging datasets and real scenes to evaluate our method. Concretely, we compare our results to some state-of-the-art classification approaches and confirm the advantages of our approach to distinguish similar actions. Results show that our approach is robust and outperforms most existing BoWs based classification approaches, especially on complex datasets with interactive activities, cluttered backgrounds and inter-class action ambiguities.
文摘Due to advances in satellite and sensor technology,the number and size of Remote Sensing(RS)images continue to grow at a rapid pace.The continuous stream of sensor data from satellites poses major challenges for the retrieval of relevant information from those satellite datastreams.The Bag-of-Words(BoW)framework is a leading image search approach and has been successfully applied in a broad range of computer vision problems and hence has received much attention from the RS community.However,the recognition performance of a typical BoW framework becomes very poor when the framework is applied to application scenarios where the appearance and texture of images are very similar.In this paper,we propose a simple method to improve recognition performance of a typical BoW framework by representing images with local features extracted from base images.In addition,we propose a similarity measure for RS images by counting the number of same words assigned to images.We compare the performance of these methods with a typical BoW framework.Our experiments show that the proposed method has better recognition performance than that of the BoW and requires less storage space for saving local invariant features.
基金supported by the National Natural Science Foundation of China (No. 61071135)the National Science and Technology Support Program (No. 2013BAK02B04)
文摘Person re-identification(person re-id) aims to match observations on pedestrians from different cameras.It is a challenging task in real word surveillance systems and draws extensive attention from the community.Most existing methods are based on supervised learning which requires a large number of labeled data. In this paper, we develop a robust unsupervised learning approach for person re-id. We propose an improved Bag-of-Words(i Bo W) model to describe and match pedestrians under different camera views. The proposed descriptor does not require any re-id labels, and is robust against pedestrian variations. Experiments show the proposed i Bo W descriptor outperforms other unsupervised methods. By combination with efficient metric learning algorithms, we obtained competitive accuracy compared to existing state-of-the-art methods on person re-identification benchmarks, including VIPe R, PRID450 S, and Market1501.
基金Supported by Foundation for Innovative Research Groups of the National Natural Science Foundation of China(61321002)Projects of Major International(Regional)Jiont Research Program NSFC(61120106010)+1 种基金Beijing Education Committee Cooperation Building Foundation ProjectProgram for Changjiang Scholars and Innovative Research Team in University(IRT1208)
文摘Image classification based on bag-of-words(BOW)has a broad application prospect in pattern recognition field but the shortcomings such as single feature and low classification accuracy are apparent.To deal with this problem,this paper proposes to combine two ingredients:(i)Three features with functions of mutual complementation are adopted to describe the images,including pyramid histogram of words(PHOW),pyramid histogram of color(PHOC)and pyramid histogram of orientated gradients(PHOG).(ii)An adaptive feature-weight adjusted image categorization algorithm based on the SVM and the decision level fusion of multiple features are employed.Experiments are carried out on the Caltech101 database,which confirms the validity of the proposed approach.The experimental results show that the classification accuracy rate of the proposed method is improved by 7%-14%higher than that of the traditional BOW methods.With full utilization of global,local and spatial information,the algorithm is much more complete and flexible to describe the feature information of the image through the multi-feature fusion and the pyramid structure composed by image spatial multi-resolution decomposition.Significant improvements to the classification accuracy are achieved as the result.
基金supported by the MOST,Taiwan under Grant No.102-2221-E-468-013
文摘This paper presents a human action recognition method. It analyzes the spatio-temporal grids along the dense trajectories and generates the histogram of oriented gradients (HOG) and histogram of optical flow (HOF) to describe the appearance and motion of the human object. Then, HOG combined with HOF is converted to bag-of-words (BoWs) by the vocabulary tree. Finally, it applies random forest to recognize the type of human action. In the experiments, KTH database and URADL database are tested for the performance evaluation. Comparing with the other approaches, we show that our approach has a better performance for the action videos with high inter-class and low inter-class variabilities.
基金Projects(41001260,61173122,61573380) supported by the National Natural Science Foundation of ChinaProject(11JJ5044) supported by the Hunan Provincial Natural Science Foundation of China
文摘It is illegal to spread and transmit pornographic images over internet,either in real or in artificial format.The traditional methods are designed to identify real pornographic images and they are less efficient in dealing with artificial images.Therefore,criminals turn to release artificial pornographic images in some specific scenes,e.g.,in social networks.To efficiently identify artificial pornographic images,a novel bag-of-visual-words based approach is proposed in the work.In the bag-of-words(Bo W)framework,speeded-up robust feature(SURF)is adopted for feature extraction at first,then a visual vocabulary is constructed through K-means clustering and images are represented by an improved Bo W encoding method,and finally the visual words are fed into a learning machine for training and classification.Different from the traditional BoW method,the proposed method sets a weight on each visual word according to the number of features that each cluster contains.Moreover,a non-binary encoding method and cross-matching strategy are utilized to improve the discriminative power of the visual words.Experimental results indicate that the proposed method outperforms the traditional method.
文摘A Deep Neural Sentiment Classification Network(DNSCN)is devel-oped in this work to classify the Twitter data unambiguously.It attempts to extract the negative and positive sentiments in the Twitter database.The main goal of the system is tofind the sentiment behavior of tweets with minimum ambiguity.A well-defined neural network extracts deep features from the tweets automatically.Before extracting features deeper and deeper,the text in each tweet is represented by Bag-of-Words(BoW)and Word Embeddings(WE)models.The effectiveness of DNSCN architecture is analyzed using Twitter-Sanders-Apple2(TSA2),Twit-ter-Sanders-Apple3(TSA3),and Twitter-DataSet(TDS).TSA2 and TDS consist of positive and negative tweets,whereas TSA3 has neutral tweets also.Thus,the proposed DNSCN acts as a binary classifier for TSA2 and TDS databases and a multiclass classifier for TSA3.The performances of DNSCN architecture are evaluated by F1 score,precision,and recall rates using 5-fold and 10-fold cross-validation.Results show that the DNSCN-WE model provides more accuracy than the DNSCN-BoW model for representing the tweets in the feature encoding.The F1 score of the DNSCN-BW based system on the TSA2 database is 0.98(binary classification)and 0.97(three-class classification)for the TSA3 database.This system provides better a F1 score of 0.99 for the TDS database.
基金This work is supported in part by the Jiangsu Basic Research Programs-Natural Science Foundation under grant numbers BK20181407in part by the National Natural Science Foundation of China under grant numbers U1936118,61672294+3 种基金in part by Six peak talent project of Jiangsu Province(R2016L13)Qinglan Project of Jiangsu Province,and“333”project of Jiangsu Province,in part by the National Natural Science Foundation of China under grant numbers U1836208,61702276,61772283,61602253,and 61601236in part by National Key R\&D Program of China under grant 2018YFB1003205in part by the Priority Academic Program Development of Jiangsu Higher Education Institutions(PAPD)fund,in part by the Collaborative Innovation Center of Atmospheric Environment and Equipment Technology(CICAEET)fund,China.Zhihua Xia is supported by BK21+program from the Ministry of Education of Korea.
文摘A privacy-preserving search model for JPEG images is proposed in paper,which uses the bag-of-encrypted-words based on QDCT(Quaternion Discrete Cosine Transform)encoding.The JPEG image is obtained by a series of steps such as DCT(Discrete Cosine Transform)transformation,quantization,entropy coding,etc.In this paper,we firstly transform the images from spatial domain into quaternion domain.By analyzing the algebraic relationship between QDCT and DCT,a QDCT quantization table and QDTC coding for color images are proposed.Then the compressed image data is encrypted after the steps of block permutation,intra-block permutation,single table substitution and stream cipher.At last,the similarity between original image and query image can be measured by the Manhattan distance,which is calculated by two feature vectors with the model of bag-of-words on the cloud server side.The outcome shows good performance in security attack and retrieval accuracy.
文摘Video understanding and content boundary detection are vital stages in video recommendation.However,previous content boundary detection methods require collecting information,including location,cast,action,and audio,and if any of these elements are missing,the results may be adversely affected.To address this issue and effectively detect transitions in video content,in this paper,we introduce a video classification and boundary detection method named JudPriNet.The focus of this paper is on objects in videos along with their labels,enabling automatic scene detection in video clips and establishing semantic connections among local objects in the images.As a significant contribution,JudPriNet presents a framework that maps labels to“Continuous Bag of Visual Words Model”to cluster labels and generates new standardized labels as video-type tags.This facilitates automatic classification of video clips.Furthermore,JudPriNet employs Monte Carlo sampling method to classify video clips,the features of video clips as elements within the framework.This proposed method seamlessly integrates video and textual components without compromising training and inference speed.Through experimentation,we have demonstrated that JudPriNet,with its semantic connections,is able to effectively classify videos alongside textual content.Our results indicate that,compared with several other detection approaches,JudPriNet excels in high-level content detection without disrupting the integrity of the video content,outperforming existing methods.
基金Project supported by the Fundamental Research Funds for the Central Universities,China(No.lzujbky-2013-41)the National Natural Science Foundation of China(No.61201446)the Basic Scientific Research Business Expenses of the Central University and Open Project of Key Laboratory for Magnetism and Magnetic Materials of the Ministry of Education,Lanzhou University(No.LZUMMM2015010)
文摘We propose a heterogeneous, mid-level feature based method for recognizing natural scene categories. The proposed feature introduces spatial information among the latent topics by means of spatial pyramid, while the latent topics are obtained by using probabilistic latent semantic analysis (pLSA) based on the bag-of-words representation. The proposed feature always performs better than standard pLSA because the performance of pLSA is adversely affected in many cases due to the loss of spatial information. By combining various interest point detectors and local region descriptors used in the bag-of-words model, the proposed feature can make further improvement for diverse scene category recognition tasks. We also propose a two-stage framework for multi-class classification. In the first stage, for each of possible detector/descriptor pairs, adaptive boosting classifiers are employed to select the most discriminative topics and further compute posterior probabilities of an unknown image from those selected topics. The second stage uses the prod-max rule to combine information coming from multiple sources and assigns the unknown image to the scene category with the highest 'final' posterior probability. Experimental results on three benchmark scene datasets show that the proposed method exceeds most state-of-the-art methods.
文摘that are duplicate or near duplicate to a query image.One of the most popular and practical methods in near-duplicate image retrieval is based on bag-of-words(BoW)model.However,the fundamental deficiency of current BoW method is the gap between visual word and image’s semantic meaning.Similar problem also plagues existing text retrieval.A prevalent method against such issue in text retrieval is to eliminate text synonymy and polysemy and therefore improve the whole performance.Our proposed approach borrows ideas from text retrieval and tries to overcome these deficiencies of BoW model by treating the semantic gap problem as visual synonymy and polysemy issues.We use visual synonymy in a very general sense to describe the fact that there are many different visual words referring to the same visual meaning.By visual polysemy,we refer to the general fact that most visual words have more than one distinct meaning.To eliminate visual synonymy,we present an extended similarity function to implicitly extend query visual words.To eliminate visual polysemy,we use visual pattern and prove that the most efficient way of using visual pattern is merging visual word vector together with visual pattern vector and obtain the similarity score by cosine function.In addition,we observe that there is a high possibility that duplicates visual words occur in an adjacent area.Therefore,we modify traditional Apriori algorithm to mine quantitative pattern that can be defined as patterns containing duplicate items.Experiments prove quantitative patterns improving mean average precision(MAP)significantly.