摘要
Gaze information is important for finding region of interest(ROI)which implies where the next action will happen.Supervised gaze estimation does not work on EPIC-Kitchens for lack of ground truth.In this paper,we develop an unsupervised gaze estimation method that helps with egocentric action anticipation.We adopt gaze map as a feature representation,and input it into a multiple modality network jointly with red-green-blue(RGB),optical flow and object features.We explore the method on EGTEA dataset.The estimated gaze map is further optimized with dilation and Gaussian filter,masked onto the original RGB frame and encoded as the important gaze modality.Our results outperform the strong baseline Rolling-Unrolling LSTMs(RULSTM),with top-5 accuracy achieving 34.31%on the seen test set(S1)and 22.07%on unseen test set(S2).The accuracy is improved by 0.58%and 0.87%,respectively.
基金
Supported by the National Natural Science Foundation of China(61772328)