摘要
Self-supervised learning aims to learn a universal feature representation without labels.To date,most existing self-supervised learning methods are designed and optimized for image classification.These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction.To fill this gap,we aim to design an effective,dense self-supervised learning framework that directly works at the level of pixels(or local features)by taking into account the correspondence between local features.Specifically,we present dense contrastive learning(DenseCL),which implements self-supervised learning by optimizing a pairwise contrastive(dis)similarity loss at the pixel level between two views of input images.Compared to the supervised ImageNet pre-training and other self-supervised learning methods,our self-supervised DenseCL pretraining demonstrates consistently superior performance when transferring to downstream dense prediction tasks including object detection,semantic segmentation and instance segmentation.Specifically,our approach significantly outperforms the strong MoCo-v2 by 2.0%AP on PASCAL VOC object detection,1.1%AP on COCO object detection,0.9%AP on COCO instance segmentation,3.0%mIoU on PASCAL VOC semantic segmentation and 1.8%mIoU on Cityscapes semantic segmentation.The improvements are up to 3.5%AP and 8.8%mIoU over MoCo-v2,and 6.1%AP and 6.1%mIoU over supervised counterpart with frozen-backbone evaluation protocol.