Method Details

Details for method 'InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions'

name	InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
challenge	pixel-level semantic labeling
details	We use Mask2Former as the segmentation framework, and initialize our InternImage-H model with the pre-trained weights on the 427M joint dataset of public Laion-400M, YFCC-15M, and CC12M. Following common practices, we first pre-train on Mapillary Vistas for 80k iterations, and then fine-tune on Cityscapes for 80k iterations. The crop size is set to 1024×1024 in this experiment. As a result, our InternImage-H achieves 87.0 multi-scale mIoU on the validation set, and 86.1 multi-scale mIoU on the test set.
publication	InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, Yu Qiao CVPR 2023 https://arxiv.org/abs/2211.05778
project page / code	https://github.com/OpenGVLab/InternImage
used Cityscapes data	fine annotations, coarse annotations
used external data	Laion-400M, YFCC-15M, CC12M, ImageNet, Mapillary
runtime	n/a
subsampling	no
submission date	November, 2022
previous submissions