Show simple item record

dc.contributor.authorChen, Jun
dc.contributor.authorHu, Ming
dc.contributor.authorLi, Boyang
dc.contributor.authorElhoseiny, Mohamed
dc.date.accessioned2022-06-05T13:11:38Z
dc.date.available2022-06-05T13:11:38Z
dc.date.issued2022-06-01
dc.identifier.urihttp://hdl.handle.net/10754/678605
dc.description.abstractSelf-supervised learning for computer vision has achieved tremendous progress and improved many downstream vision tasks such as image classification, semantic segmentation, and object detection. Among these, generative self-supervised vision learning approaches such as MAE and BEiT show promising performance. However, their global masked reconstruction mechanism is computationally demanding. To address this issue, we propose local masked reconstruction (LoMaR), a simple yet effective approach that performs masked reconstruction within a small window of 7×7 patches on a simple Transformer encoder, improving the trade-off between efficiency and accuracy compared to global masked reconstruction over the entire image. Extensive experiments show that LoMaR reaches 84.1% top-1 accuracy on ImageNet-1K classification, outperforming MAE by 0.5%. After finetuning the pretrained LoMaR on 384×384 images, it can reach 85.4% top-1 accuracy, surpassing MAE by 0.6%. On MS COCO, LoMaR outperforms MAE by 0.5 APbox on object detection and 0.5 APmask on instance segmentation. LoMaR is especially more computation-efficient on pretraining high-resolution images, e.g., it is 3.1× faster than MAE with 0.2% higher classification accuracy on pretraining 448×448 images. This local masked reconstruction learning mechanism can be easily integrated into any other generative self-supervised learning approach. Our code will be publicly available.
dc.publisherarXiv
dc.relation.urlhttps://arxiv.org/pdf/2206.00790.pdf
dc.rightsArchived with thanks to arXiv
dc.titleEfficient Self-supervised Vision Pretraining with Local Masked Reconstruction
dc.typePreprint
dc.contributor.departmentComputer Science Program
dc.contributor.departmentComputer, Electrical and Mathematical Science and Engineering (CEMSE) Division
dc.contributor.departmentVisual Computing Center (VCC)
dc.eprint.versionPre-print
dc.contributor.institutionNanyang Technological University
dc.identifier.arxivid2206.00790
kaust.personChen, Jun
kaust.personHu, Ming
kaust.personElhoseiny, Mohamed
refterms.dateFOA2022-06-05T13:12:35Z


Files in this item

Thumbnail
Name:
2206.00790.pdf
Size:
8.790Mb
Format:
PDF
Description:
Preprint

This item appears in the following Collection(s)

Show simple item record