Classic local space-time features are successful representations for action recognition in videos. However, these features always confuse object motions with camera motions, which seriously affect the accuracy of acti...Classic local space-time features are successful representations for action recognition in videos. However, these features always confuse object motions with camera motions, which seriously affect the accuracy of action recognition. In this paper, we propose improved motion scale-inviriant feature transform (iMoSIFT) algorithm to eliminate the negative effects caused by camera motions. Based on iMoSIFT, we consider the spatial-temporal structure relationship among iMoSIFT interest points, and adopt locally weighted word context descriptors to code this relationship. Then, we use two-layer BOW representation for every video clip. The proposed approach is evaluated on available datasets, namely Weizemann, KTH and UCF sports. The experimental results clearly demonstrate the effectiveness of the proposed approach.展开更多
基金Supported by the National Natural Science Foundation of China(61231015,61170023)National High Technology Research and Development Program of China(863 Program,2015AA016306)+3 种基金Internet of Things Development Funding Project of Ministry of Industry in 2013(25)Technology Research Program of Ministry of Public Security(2014JSYJA016)Major Science and Technology Innovation Plan of Hubei Province(2013AAA020)Natural Science Foundation of Hubei Province(2014CFB712)
文摘Classic local space-time features are successful representations for action recognition in videos. However, these features always confuse object motions with camera motions, which seriously affect the accuracy of action recognition. In this paper, we propose improved motion scale-inviriant feature transform (iMoSIFT) algorithm to eliminate the negative effects caused by camera motions. Based on iMoSIFT, we consider the spatial-temporal structure relationship among iMoSIFT interest points, and adopt locally weighted word context descriptors to code this relationship. Then, we use two-layer BOW representation for every video clip. The proposed approach is evaluated on available datasets, namely Weizemann, KTH and UCF sports. The experimental results clearly demonstrate the effectiveness of the proposed approach.