Detailed Results


On this page, we provide detailed results containing the performances of all methods in terms of all metrics on all classes and categories. Please refer to the Benchmark Suite for details on the evaluation and metrics. Jump to the individual tables via the following links:

Pixel-Level Semantic Labeling Task

Instance-Level Semantic Labeling Task

Panoptic Semantic Labeling Task

3D Vehicle Detection Task

Usage

Within each table, use the buttons in the first row to hide columns or to export the visible data to various formats. Use the widgets in the second row to filter the table by selecting values of interest (multiple selections possible). Click the numeric columns for sorting.

Pixel-Level Semantic Labeling Task

IoU on class-level

namefinefinecoarsecoarse16-bit16-bitdepthdepthvideovideosubsubcodecodetitleauthorsvenuedescriptionRuntime [s]averageroadsidewalkbuildingwallfencepoletraffic lighttraffic signvegetationterrainskypersonridercartruckbustrainmotorcyclebicycle
FCN 8syesyesnonononononononononoyesyesFully Convolutional Networks for Semantic SegmentationJ. Long, E. Shelhamer, and T. DarrellCVPR 2015Trained by Marius Cordts on a pre-release version of the dataset
more details
0.565.397.478.489.234.944.247.460.165.091.469.393.977.151.492.635.348.646.551.666.8
RRR-ResNet152-MultiScaleyesyesyesyesnonononononononononoAnonymousupdate: this submission actually used the coarse labels, which was previously not marked accordingly
more details
n/a75.898.384.092.050.854.562.667.773.792.870.895.082.660.695.065.383.176.663.371.3
Dilation10yesyesnonononononononononoyesyesMulti-Scale Context Aggregation by Dilated ConvolutionsFisher Yu and Vladlen KoltunICLR 2016Dilation10 is a convolutional network that consists of a front-end prediction module and a context aggregation module. Both are described in the paper. The combined network was trained jointly. The context module consists of 10 layers, each of which has C=19 feature maps. The larger number of layers in the context module (10 for Cityscapes versus 8 for Pascal VOC) is due to the high input resolution. The Dilation10 model is a pure convolutional network: there is no CRF and no structured prediction. Dilation10 can therefore be used as the baseline input for structured prediction models. Note that the reported results were produced by training on the training set only; the network was not retrained on train+val.
more details
4.067.197.679.289.937.347.653.258.665.291.869.493.778.955.093.345.553.447.752.266.0
AdelaideyesyesnonononononononononononoEfficient Piecewise Training of Deep Structured Models for Semantic SegmentationG. Lin, C. Shen, I. Reid, and A. van den HengelarXiv preprint 2015Trained on a pre-release version of the dataset
more details
35.066.497.378.588.444.548.334.155.561.790.169.592.272.552.391.054.661.651.655.063.1
DeepLab LargeFOV StrongWeakyesyesyesyesnononononono22yesyesWeakly- and Semi-Supervised Learning of a DCNN for Semantic Image SegmentationG. Papandreou, L.-C. Chen, K. Murphy, and A. L. YuilleICCV 2015Trained on a pre-release version of the dataset
more details
4.064.897.478.388.147.544.229.544.455.489.467.392.871.049.391.455.966.656.748.158.1
DeepLab LargeFOV Strongyesyesnononononononono22yesyesSemantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFsL.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. YuilleICLR 2015Trained on a pre-release version of the dataset
more details
4.063.197.377.787.743.640.529.744.555.489.467.092.771.249.491.448.756.749.147.958.6
DPNyesyesyesyesnononononono33nonoSemantic Image Segmentation via Deep Parsing NetworkZ. Liu, X. Li, P. Luo, C. C. Loy, and X. TangICCV 2015Trained on a pre-release version of the dataset
more details
n/a59.196.371.786.743.731.729.235.847.488.463.193.964.738.788.848.056.449.438.350.0
Segnet basicyesyesnononononononono44yesyesSegNet: A Deep Convolutional Encoder-Decoder Architecture for Image SegmentationV. Badrinarayanan, A. Kendall, and R. CipollaarXiv preprint 2015Trained on a pre-release version of the dataset
more details
0.0657.096.473.284.028.529.035.739.845.287.063.891.862.842.889.338.143.144.235.851.9
Segnet extendedyesyesnononononononono44yesyesSegNet: A Deep Convolutional Encoder-Decoder Architecture for Image SegmentationV. Badrinarayanan, A. Kendall, and R. CipollaarXiv preprint 2015Trained on a pre-release version of the dataset
more details
0.0656.195.670.182.829.931.938.143.144.687.362.391.767.350.787.921.729.034.740.556.6
CRFasRNNyesyesnononononononono22yesyesConditional Random Fields as Recurrent Neural NetworksS. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. TorrICCV 2015Trained on a pre-release version of the dataset
more details
0.762.596.373.988.247.641.335.249.559.790.666.193.570.434.790.139.257.555.443.954.6
Scale invariant CNN + CRFyesyesnonononoyesyesnonononoyesyesConvolutional Scale Invariance for Semantic SegmentationI. Kreso, D. Causevic, J. Krapac, and S. SegvicGCPR 2016We propose an effective technique to address large scale variation in images taken from a moving car by cross-breeding deep learning with stereo reconstruction. Our main contribution is a novel scale selection layer which extracts convolutional features at the scale which matches the corresponding reconstructed depth. The recovered scaleinvariant representation disentangles appearance from scale and frees the pixel-level classifier from the need to learn the laws of the perspective. This results in improved segmentation results due to more effi- cient exploitation of representation capacity and training data. We perform experiments on two challenging stereoscopic datasets (KITTI and Cityscapes) and report competitive class-level IoU performance.
more details
n/a66.396.376.888.840.045.450.163.369.690.667.192.277.655.990.139.251.344.454.466.1
DPNyesyesnonononononononononononoSemantic Image Segmentation via Deep Parsing NetworkZ. Liu, X. Li, P. Luo, C. C. Loy, and X. TangICCV 2015DPN trained on full resolution images
more details
n/a66.897.578.589.540.445.951.156.865.391.569.494.577.554.292.544.553.449.952.164.8
Pixel-level Encoding for Instance SegmentationyesyesnonononoyesyesnonononononoPixel-level Encoding and Depth Layering for Instance-level Semantic LabelingJ. Uhrig, M. Cordts, U. Franke, and T. BroxGCPR 2016We predict three encoding channels from a single image using an FCN: semantic labels, depth classes, and an instance-aware representation based on directions towards instance centers. Using low-level computer vision techniques, we obtain pixel-level and instance-level semantic labeling paired with a depth estimate of the instances.
more details
n/a64.397.477.788.827.740.151.560.164.791.167.693.577.754.292.433.742.042.552.566.5
Adelaide_contextyesyesnonononononononononononoEfficient Piecewise Training of Deep Structured Models for Semantic SegmentationGuosheng Lin, Chunhua Shen, Anton van den Hengel, Ian ReidCVPR 2016We explore contextual information to improve semantic image segmentation. Details are described in the paper. We trained contextual networks for coarse level prediction and a refinement network for refining the coarse prediction. Our models are trained on the training set only (2975 images) without adding the validation set.
more details
n/a71.698.082.690.644.050.751.165.071.792.072.094.181.561.194.361.165.153.861.670.6
NVSegNetyesyesnonononononononononononoAnonymousIn the inference, we use the image of 2 different scales. The same for training!
more details
0.467.498.081.990.135.739.857.460.669.391.767.694.679.354.593.543.852.450.353.067.8
ENetyesyesnononononononono22yesyesENet: A Deep Neural Network Architecture for Real-Time Semantic SegmentationAdam Paszke, Abhishek Chaurasia, Sangpil Kim, Eugenio Culurciello
more details
0.01358.396.374.285.032.233.243.534.144.088.661.490.665.538.490.636.950.548.138.855.4
DeepLabv2-CRFyesyesnonononononononononoyesyesDeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFsLiang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, Alan L. YuillearXiv preprintDeepLabv2-CRF is based on three main methods. First, we employ convolution with upsampled filters, or ‘atrous convolution’, as a powerful tool to repurpose ResNet-101 (trained on image classification task) in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within DCNNs. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and fully connected Conditional Random Fields (CRFs). The model is only trained on train set.
more details
n/a70.497.981.390.348.847.449.657.967.391.969.494.279.859.893.756.567.557.557.768.8
m-TCFsyesyesyesyesnonononononononononoAnonymousConvolutional Neural Network
more details
1.071.898.283.691.248.453.255.864.370.392.270.294.579.959.294.156.069.158.256.768.4
DeepLab+DynamicCRFyesyesnonononononononononononoru.nl
more details
n/a64.596.877.988.937.545.439.151.561.690.858.093.676.653.892.641.852.550.553.264.2
LRR-4xyesyesnonononononononononoyesyesLaplacian Pyramid Reconstruction and Refinement for Semantic SegmentationGolnaz Ghiasi, Charless C. FowlkesECCV 2016We introduce a CNN architecture that reconstructs high-resolution class label predictions from low-resolution feature maps using class-specific basis functions. Our multi-resolution architecture also uses skip connections from higher resolution feature maps to successively refine segment boundaries reconstructed from lower resolution maps. The model used for this submission is based on VGG-16 and it was trained on the training set (2975 images). The segmentation predictions were not post-processed using CRF. (This is a revision of a previous submission in which we didn't use the correct basis functions; the method name changed from 'LLR-4x' to 'LRR-4x')
more details
n/a69.797.779.990.744.448.658.668.272.092.569.394.781.660.094.043.656.847.254.869.7
LRR-4xyesyesyesyesnonononononononoyesyesLaplacian Pyramid Reconstruction and Refinement for Semantic SegmentationGolnaz Ghiasi, Charless C. FowlkesECCV 2016We introduce a CNN architecture that reconstructs high-resolution class label predictions from low-resolution feature maps using class-specific basis functions. Our multi-resolution architecture also uses skip connections from higher resolution feature maps to successively refine segment boundaries reconstructed from lower resolution maps. The model used for this submission is based on VGG-16 and it was trained using both coarse and fine annotations. The segmentation predictions were not post-processed using CRF.
more details
n/a71.897.981.591.450.552.759.466.872.792.570.195.081.360.194.351.267.754.655.669.6
Le_Selfdriving_VGGyesyesnonononononononononononoAnonymous
more details
n/a65.997.578.788.942.744.246.453.461.190.268.693.474.148.591.944.962.452.351.261.3
SQyesyesnonononononononononononoSpeeding up Semantic Segmentation for Autonomous DrivingMichael Treml, José Arjona-Medina, Thomas Unterthiner, Rupesh Durgesh, Felix Friedmann, Peter Schuberth, Andreas Mayr, Martin Heusel, Markus Hofmarcher, Michael Widrich, Bernhard Nessler, Sepp HochreiterNIPS 2016 Workshop - MLITS Machine Learning for Intelligent Transportation Systems Neural Information Processing Systems 2016, Barcelona, Spain
more details
0.0659.896.975.487.931.635.750.952.061.790.965.893.073.842.691.518.841.233.334.059.9
SAITyesyesyesyesnonononononononononoAnonymousAnonymous
more details
4.076.998.585.892.659.256.662.469.475.393.272.195.283.666.195.268.680.973.061.772.4
FoveaNetyesyesnonononononononononononoFoveaNetXin Li, Jiashi Feng1.caffe-master
2.resnet-101
3.single scale testing

Previously listed as "LXFCRN".
more details
n/a74.198.283.291.544.451.263.270.875.592.770.194.583.364.294.660.870.763.363.073.2
RefineNetyesyesnonononononononononoyesyesRefineNet: Multi-Path Refinement Networks for High-Resolution Semantic SegmentationGuosheng Lin; Anton Milan; Chunhua Shen; Ian Reid;Please refer to our technical report for details: "RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation" (https://arxiv.org/abs/1611.06612). Our source code is available at: https://github.com/guosheng/refinenet
2975 images (training set with fine labels) are used for training.
more details
n/a73.698.283.391.347.850.456.166.971.392.370.394.880.963.394.564.676.164.362.270.0
SegModelyesyesnonononononononononononoAnonymousBoth train set (2975) and val set (500) are used to train model for this submission.
more details
0.878.598.686.492.852.459.759.672.578.393.372.895.585.470.095.775.484.175.168.775.0
TuSimpleyesyesnonononononononononoyesyesUnderstanding Convolution for Semantic SegmentationPanqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, Garrison Cottrell
more details
n/a77.698.585.592.858.655.565.073.577.993.372.095.284.868.595.470.978.868.765.973.8
Global-Local-RefinementyesyesnonononononononononononoGlobal-residual and Local-boundary Refinement Networks for Rectifying Scene Parsing PredictionsRui Zhang, Sheng Tang, Min Lin, Jintao Li, Shuicheng YanInternational Joint Conference on Artificial Intelligence (IJCAI) 2017global-residual and local-boundary refinement

The method was previously listed as "RefineNet". To avoid confusions with a recently appeared and similarly named approach, the submission name was updated.
more details
n/a77.398.686.192.857.058.363.370.876.893.472.295.484.967.995.668.577.569.465.274.5
XPARSEyesyesnonononononononononononoAnonymous
more details
n/a73.498.383.991.647.653.459.566.872.592.770.995.282.463.594.757.468.862.262.671.5
ResNet-38yesyesnonononononononononoyesyesWider or Deeper: Revisiting the ResNet Model for Visual RecognitionZifeng Wu, Chunhua Shen, Anton van den Hengelarxivsingle model, single scale, no post-processing with CRFs
Model A2, 2 conv., fine only, single scale testing

The submissions was previously listed as "Model A2, 2 conv.". The name was changed for consistency with the other submission of the same work.
more details
n/a78.498.585.793.155.559.167.174.878.793.772.695.586.669.295.764.578.874.169.076.7
SegModelyesyesyesyesnonononononononononoAnonymous
more details
n/a79.298.686.293.053.760.464.273.578.593.472.295.585.368.695.877.987.078.068.075.1
Deep Layer Cascade (LC)yesyesnonononononononononononoNot All Pixels Are Equal: Difficulty-aware Semantic Segmentation via Deep Layer CascadeXiaoxiao Li, Ziwei Liu, Ping Luo, Chen Change Loy, Xiaoou TangCVPR 2017We propose a novel deep layer cascade (LC) method to improve the accuracy and speed of semantic segmentation. Unlike the conventional model cascade (MC) that is composed of multiple independent models, LC treats a single deep model as a cascade of several sub-models. Earlier sub-models are trained to handle easy and confident regions, and they progressively feed-forward harder regions to the next sub-model for processing. Convolutions are only calculated on these regions to reduce computations. The proposed method possesses several advantages. First, LC classifies most of the easy regions in the shallow stage and makes deeper stage focuses on a few hard regions. Such an adaptive and 'difficulty-aware' learning improves segmentation performance. Second, LC accelerates both training and testing of deep network thanks to early decisions in the shallow stage. Third, in comparison to MC, LC is an end-to-end trainable framework, allowing joint learning of all sub-models. We evaluate our method on PASCAL VOC and
more details
n/a71.198.182.891.247.152.857.363.970.792.570.594.281.257.994.150.159.657.058.671.1
FRRNyesyesnononononononono22yesyesFull-Resolution Residual Networks for Semantic Segmentation in Street ScenesTobias Pohlen, Alexander Hermans, Markus Mathias, Bastian LeibeArxivFull-Resolution Residual Networks (FRRN) combine multi-scale context with pixel-level accuracy by using two processing streams within one network: One stream carries information at the full image resolution, enabling precise adherence to segment boundaries. The other stream undergoes a sequence of pooling operations to obtain robust features for recognition.
more details
n/a71.898.283.391.645.851.162.269.472.492.670.094.981.662.794.649.167.155.353.569.5
MNet_MPRGyesyesnonononononononononononoChubu University, MPRGwithout val dataset, external dataset (e.g. image net) and post-processing
more details
0.671.998.182.991.843.650.564.371.474.692.770.394.782.460.994.150.962.557.253.870.0
ResNet-38yesyesyesyesnonononononononoyesyesWider or Deeper: Revisiting the ResNet Model for Visual RecognitionZifeng Wu, Chunhua Shen, Anton van den Hengelarxivsingle model, no post-processing with CRFs
Model A2, 2 conv., fine+coarse, multi scale testing
more details
n/a80.698.786.993.360.462.967.675.078.793.773.795.586.871.196.175.287.681.969.876.7
FCN8s-QunjieYuyesyesnonononononononononononoAnonymous
more details
n/a57.496.774.588.030.737.845.58.363.191.768.593.375.845.492.015.430.525.742.564.9
RGB-D FCNyesyesyesyesnonoyesyesnonononononoAnonymousGoogLeNet + depth branch, single model
no data augmentation, no training on validation set, no graphical model
Used coarse labels to initialize depth branch
more details
n/a67.497.981.290.741.044.856.865.369.491.968.794.778.952.993.138.853.143.751.067.0
MultiBoostyesyesyesyesnonoyesyesnono22nonoAnonymousBoosting based solution.
Publication is under review.
more details
0.2559.395.969.587.334.432.740.554.958.689.265.390.368.442.589.022.551.940.936.555.7
GoogLeNet FCNyesyesnonononononononononononoGoing Deeper with ConvolutionsChristian Szegedy , Wei Liu , Yangqing Jia , Pierre Sermanet , Scott Reed , Dragomir Anguelov , Dumitru Erhan , Vincent Vanhoucke , Andrew RabinovichCVPR 2015GoogLeNet
No data augmentation, no graphical model
Trained by Lukas Schneider, following "Fully Convolutional Networks for Semantic Segmentation", Long et al. CVPR 2015
more details
n/a63.097.477.989.235.039.050.659.864.191.266.993.776.245.192.633.440.432.747.364.6
ERFNet (pretrained)yesyesnononononononono22yesyesERFNet: Efficient Residual Factorized ConvNet for Real-time Semantic SegmentationEduardo Romera, Jose M. Alvarez, Luis M. Bergasa and Roberto ArroyoTransactions on Intelligent Transportation Systems (T-ITS)ERFNet pretrained on ImageNet and trained only on the fine train (2975) annotated images


more details
0.0269.797.982.190.745.250.459.062.668.491.969.494.278.559.893.452.360.853.749.964.2
ERFNet (from scratch)yesyesnononononononono22yesyesEfficient ConvNet for Real-time Semantic SegmentationEduardo Romera, Jose M. Alvarez, Luis M. Bergasa and Roberto ArroyoIV2017ERFNet trained entirely on the fine train set (2975 images) without any pretraining nor coarse labels
more details
0.0268.097.781.089.842.548.056.259.865.391.468.294.276.857.192.850.860.151.847.361.6
TuSimple_CoarseyesyesyesyesnonononononononoyesyesUnderstanding Convolution for Semantic SegmentationPanqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, Garrison CottrellHere we show how to improve pixel-wise semantic segmentation by manipulating convolution-related operations that are better for practical use. First, we implement dense upsampling convolution (DUC) to generate pixel-level prediction, which is able to capture and decode more detailed information that is generally missing in bilinear upsampling. Second, we propose a hybrid dilated convolution (HDC) framework in the encoding phase. This framework 1) effectively enlarges the receptive fields of the network to aggregate global information; 2) alleviates what we call the "gridding issue" caused by the standard dilated convolution operation. We evaluate our approaches thoroughly on the Cityscapes dataset, and achieve a new state-of-art result of 80.1% mIOU in the test set. We also are state-of-the-art overall on the KITTI road estimation benchmark and the
PASCAL VOC2012 segmentation task. Pretrained models are available at https://goo.gl/DQMeun.
more details
n/a80.198.585.993.257.761.167.273.778.093.472.395.485.970.595.976.190.683.767.475.7
SAC-multipleyesyesnonononononononononononoScale-adaptive Convolutions for Scene ParsingRui Zhang, Sheng Tang, Yongdong Zhang, Jintao Li, and Shuicheng YanInternational Conference on Computer Vision (ICCV) 2017
more details
n/a78.198.786.593.156.359.565.173.078.293.572.695.685.970.895.971.278.666.267.776.0
NetWarpyesyesyesyesnonononoyesyesnonononoAnonymous
more details
n/a80.598.686.793.460.662.668.675.980.093.572.095.386.572.195.972.989.977.470.576.4
depthAwareSeg_RNN_ffyesyesnonononononononononoyesyesAnonymoustraining with fine-annotated training images only (val set is not used); flip-augmentation only in training; single GPU for train&test; softmax loss; resnet101 as front end; multiscale test.
more details
n/a78.298.585.492.554.460.960.272.376.893.171.694.885.269.095.770.186.575.568.375.5
Ladder DenseNetyesyesnonononononononononoyesyesLadder-style DenseNets for Semantic Segmentation of Large Natural ImagesIvan Krešo, Josip Krapac, Siniša ŠegvićICCV 2017https://ivankreso.github.io/publication/ladder-densenet/
more details
0.4574.397.480.292.047.653.964.672.876.392.866.495.583.866.194.355.670.367.062.173.0
Real-time FCNyesyesyesyesnonononononononononoUnderstanding Cityscapes: Efficient Urban Semantic Scene UnderstandingMarius CordtsDissertationCombines the following concepts:
Network architecture: "Going deeper with convolutions". Szegedy et al., CVPR 2015
Framework and skip connections: "Fully convolutional networks for semantic segmentation". Long et al., CVPR 2015
Context modules: "Multi-scale context aggregation by dilated convolutions". Yu and Kolutin, ICLR 2016
more details
0.04472.698.081.491.144.650.757.364.171.292.168.594.781.261.294.654.576.572.257.668.7
GridNetyesyesnonononononononononononoAnonymousConv-Deconv Grid-Network for semantic segmentation.
Using only the training set without extra coarse annotated data (only 2975 images).
No pre-training (ImageNet).
No post-processing (like CRF).
more details
n/a69.598.082.890.841.848.359.365.469.492.469.293.881.862.393.141.856.249.055.269.1
PEARLyesyesnonononononoyesyesnonononoVideo Scene Parsing with Predictive Feature LearningXiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe Lin, Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu, Zequn Jie, Jiashi Feng, and Shuicheng YanICCV 2017We proposed a novel Parsing with prEdictive feAtuRe Learning (PEARL) model to address the following two problems in video scene parsing: firstly, how to effectively learn meaningful video representations for producing the temporally consistent labeling maps; secondly, how to overcome the problem of insufficient labeled video training data, i.e. how to effectively conduct unsupervised deep learning. To our knowledge, this is the first model to employ predictive feature learning in the video scene parsing.
more details
n/a75.498.484.592.154.156.660.469.074.092.970.995.283.565.795.061.872.269.664.872.8
pruned & dilated inception-resnet-v2 (PD-IR2)yesyesyesyesnonononononononoyesyesAnonymous
more details
0.6967.397.981.990.239.547.754.858.169.991.370.494.477.051.992.940.754.355.245.565.1
PSPNetyesyesyesyesnonononononononoyesyesPyramid Scene Parsing NetworkHengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, Jiaya JiaCVPR 2017This submission is trained on coarse+fine(train+val set, 2975+500 images).

Former submission is trained on coarse+fine(train set, 2975 images) which gets 80.2 mIoU: https://www.cityscapes-dataset.com/method-details/?submissionID=314

Previous versions of this method were listed as "SenseSeg_1026".
more details
n/a81.298.786.993.558.463.767.776.180.593.672.295.386.871.996.277.791.583.670.877.5
motovisyesyesyesyesnonononononononononomotovis.com
more details
n/a81.398.786.693.555.562.769.476.380.493.872.695.887.172.496.277.991.388.669.577.1
ML-CRNNyesyesnonononononononononononoMulti-level Contextual RNNs with Attention Model for Scene LabelingHeng Fan, Xue Mei, Danil Prokhorov, Haibin LingarXivA framework based on CNNs and RNNs is proposed, in which the RNNs are used to model spatial dependencies among image units. Besides, to enrich deep features, we use different features from multiple levels, and adopt a novel attention model to fuse them.
more details
n/a71.297.981.091.050.352.456.765.771.492.269.694.680.259.393.951.167.654.555.168.6
Hybrid ModelyesyesnonononononononononononoAnonymous
more details
n/a65.897.578.589.039.046.148.658.764.091.268.391.876.851.992.240.050.644.954.366.6
tek-IflyyesyesnonononononononononononoIflytekIflytek-yinusing a fusion strategy of three single models, the best result of a single model is 80.01%,multi-scale
more details
n/a81.198.686.393.561.264.166.075.679.193.772.895.686.369.996.076.890.786.871.077.1
GridNetyesyesnonononononononononoyesyesResidual Conv-Deconv Grid Network for Semantic SegmentationDamien Fourure, Rémi Emonet, Elisa Fromont, Damien Muselet, Alain Tremeau & Christian WolfBMVC 2017We used a new architecture for semantic image segmentation called GridNet, following a grid pattern allowing multiple interconnected streams to work at different resolutions (see paper).
We used only the training set without extra coarse annotated data (only 2975 images) and no pre-training (ImageNet) nor pre or post-processing.
more details
n/a69.898.183.090.941.449.260.166.570.292.569.893.882.363.293.242.655.848.555.469.8
firenetyesyesnononononononono22nonoAnonymous
more details
n/a68.294.174.287.440.144.654.265.465.190.066.592.176.761.892.845.064.959.354.467.5
DeepLabv3yesyesyesyesnonononononononononoRethinking Atrous Convolution for Semantic Image SegmentationLiang-Chieh Chen, George Papandreou, Florian Schroff, Hartwig AdamarXiv preprintIn this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter’s field-of-view as well as control the resolution of feature responses computed by Deep Convolutional Neural Networks, in the application of semantic image segmentation. To handle the problem of segmenting objects
at multiple scales, we employ a module, called Atrous Spatial Pyrmid Pooling (ASPP), which adopts atrous convolution in parallel to capture multi-scale context with multiple atrous rates. Furthermore, we propose to augment ASPP module with image-level features encoding global context and further boost performance.
Results obtained with a single model (no ensemble), trained with fine + coarse annotations. More details will be shown in the updated arXiv report.
more details
n/a81.398.686.293.555.263.270.077.181.393.872.395.987.673.496.375.190.485.172.178.3
EdgeSenseSegyesyesnonononononononononononoAnonymousDeep segmentation network with hard negative mining and other tricks.
more details
n/a76.898.484.892.552.058.161.573.076.193.371.895.085.268.595.462.477.670.766.875.5
ScaleNetyesyesyesyesnonononononononononoScaleNet: Scale Invariant Network for Semantic Segmentation in Urban Driving ScenesMohammad Dawud Ansari, Stephan Krarß, Oliver Wasenmüller and Didier StrickerInternational Conference on Computer Vision Theory and Applications, Funchal, Portugal, 2018The scale difference in driving scenarios is one of the essential challenges in semantic scene segmentation.
Close objects cover significantly more pixels than far objects. In this paper, we address this challenge with a
scale invariant architecture. Within this architecture, we explicitly estimate the depth and adapt the pooling
field size accordingly. Our model is compact and can be extended easily to other research domains. Finally,
the accuracy of our approach is comparable to the state-of-the-art and superior for scale problems. We evaluate
on the widely used automotive dataset Cityscapes as well as a self-recorded dataset.
more details
n/a75.198.384.892.450.159.662.871.876.893.271.494.683.665.295.156.071.659.966.373.6
K-netyesyesnonononononononononononoXinLiang Zhong
more details
n/a76.098.384.392.152.356.559.069.873.292.770.494.683.066.395.268.579.974.062.572.1
MSNETyesyesnonononononononononononoAnonymouspreviously also listed as "MultiPathJoin" and "MultiPath_Scale".
more details
0.276.898.385.092.548.656.767.575.578.493.372.894.985.871.195.359.873.365.968.576.4
Multitask LearningyesyesnonononononononononononoMulti-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and SemanticsAlex Kendall, Yarin Gal and Roberto CipollaNumerous deep learning applications benefit from multi-task learning with multiple regression and classification objectives. In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task's loss. Tuning these weights by hand is a difficult and expensive process, making multi-task learning prohibitive in practice. We propose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the homoscedastic uncertainty of each task. This allows us to simultaneously learn various quantities with different units or scales in both classification and regression settings. We demonstrate our model learning per-pixel depth regression, semantic and instance segmentation from a monocular input image. Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate models trained individually on each task.
more details
n/a78.598.485.292.854.260.862.473.477.593.371.595.184.969.595.368.586.280.067.875.6
DeepMotionyesyesnonononononononononononoAnonymousWe propose a novel method based on convnets to extract multi-scale features in a large range particularly for solving street scene segmentation.
more details
n/a81.498.787.093.561.662.665.474.678.693.672.595.486.272.396.182.392.885.770.276.6
SR-AICyesyesyesyesnonononononononononoAnonymous
more details
n/a81.998.787.293.762.664.769.076.480.893.773.395.586.872.296.277.990.687.971.277.3
Roadstar.ai_CV(SFNet)yesyesnonononononononononononoRoadstar.ai-CVMaosheng Ye, Guang Zhou, Tongyi Cao, YongTao Huang, Yinzi Chensame foucs net(SFNet), based on only fine labels, with focus on the loss distribution and same focus on the every layer of feature map
more details
0.279.298.485.493.059.659.267.576.479.393.773.695.386.873.895.767.581.272.169.277.1
DFNyesyesyesyesnonononononononononoLearning a Discriminative Feature Network for Semantic SegmentationChangqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, Nong SangarxivMost existing methods of semantic segmentation still suffer from two aspects of challenges: intra-class inconsistency and inter-class indistinction. To tackle these two problems, we propose a Discriminative Feature Network (DFN), which contains two sub-networks: Smooth Network and Border Network. Specifically, to handle the intra-class inconsistency problem, we specially design a Smooth Network with Channel Attention Block and global average pooling to select the more discriminative features. Furthermore, we propose a Border Network to make the bilateral features of boundary distinguishable with deep semantic boundary supervision. Based on our proposed DFN, we achieve state-of-the-art performance 86.2% mean IOU on PASCAL VOC 2012 and 80.3% mean IOU on Cityscapes dataset.
more details
n/a80.398.685.993.259.661.066.673.278.293.571.695.586.570.596.177.189.984.768.276.5
RelationNet_CoarseyesyesyesyesnonononononononononoRelationNet: Learning Deep-Aligned Representation for Semantic Image SegmentationYueqing ZhuangICPR Semantic image segmentation, which assigns labels in pixel level, plays a central role in image understanding. Recent approaches have attempted to harness the capabilities of deep learning. However, one central problem of these methods is that deep convolution neural network gives little consideration to the correlation among pixels. To handle this issue, in this paper, we propose a novel deep neural network named RelationNet, which utilizes CNN and RNN to aggregate context information. Besides, a spatial correlation loss is applied to supervise RelationNet to align features of spatial pixels belonging to same category. Importantly, since it is expensive to obtain pixel-wise annotations, we exploit a new training method for combining the coarsely and finely labeled data. Separate experiments show the detailed improvements of each proposal. Experimental results demonstrate the effectiveness of our proposed method to the problem of semantic image segmentation.
more details
n/a82.498.887.994.067.764.470.277.181.193.973.595.887.873.496.475.389.488.172.078.3
ARSAITyesyesnonononononononononononoAnonymousanonymous
more details
1.073.698.282.791.948.551.360.767.773.492.869.995.282.661.994.958.866.766.762.371.4
Mapillary Research: In-Place Activated BatchNormyesyesyesyesnonononononononoyesyesIn-Place Activated BatchNorm for Memory-Optimized Training of DNNsSamuel Rota Bulò, Lorenzo Porzi, Peter KontschiederarXivIn-Place Activated Batch Normalization (InPlace-ABN) is a novel approach to drastically reduce the training memory footprint of modern deep neural networks in a computationally efficient way. Our solution substitutes the conventionally used succession of BatchNorm + Activation layers with a single plugin layer, hence avoiding invasive framework surgery while providing straightforward applicability for existing deep learning frameworks. We obtain memory savings of up to 50% by dropping intermediate results and by recovering required information during the backward pass through the inversion of stored forward results, with only minor increase (0.8-2%) in computation time. Test results are obtained using a single model.
more details
n/a82.098.485.093.661.763.967.777.480.893.771.995.686.772.895.779.993.189.772.678.2
EFBNETyesyesnonononononononononononoAnonymous
more details
n/a81.898.686.693.460.663.465.475.479.693.572.195.286.572.996.081.492.789.873.677.2
Ladder DenseNet v2yesyesnonononononononononononoJournal submissionAnonymousDenseNet-121 model used in downsampling path with ladder-style skip connections upsampling path on top of it.
more details
1.078.498.786.693.052.956.967.274.478.993.673.195.586.269.795.965.982.176.366.875.7
ESPNetyesyesnononononononono22yesyesESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation Sachin Mehta, Mohammad Rastegari, Anat Caspi, Linda Shapiro, and Hannaneh HajishirziWe introduce a fast and efficient convolutional neural network, ESPNet, for semantic segmentation of high resolution images under resource constraints. ESPNet is based on a new convolutional module, efficient spatial pyramid (ESP), which is efficient in terms of computation, memory, and power. ESPNet is 22 times faster (on a standard GPU) and 180 times smaller than the state-of-the-art semantic segmentation network PSPNet, while its category-wise accuracy is only 8% less. We evaluated EPSNet on a variety of semantic segmentation datasets including Cityscapes, PASCAL VOC, and a breast biopsy whole slide image dataset. Under the same constraints on memory and computation, ESPNet outperforms all the current efficient CNN networks such as MobileNet, ShuffleNet, and ENet on both standard metrics and our newly introduced performance metrics that measure efficiency on edge devices. Our network can process high resolution images at a rate of 112 and 9 frames per second on a standard GPU and edge device, respectively
more details
0.008960.395.773.386.632.836.447.146.955.489.866.092.568.545.889.940.047.740.736.454.9
ENet with the Lovász-Softmax lossyesyesnononononononono22yesyesThe Lovász-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networksMaxim Berman, Amal Rannen Triki, Matthew B. BlaschkoarxivThe Lovász-Softmax loss is a novel surrogate for optimizing the IoU measure in neural networks. Here we finetune the weights provided by the authors of ENet (arXiv:1606.02147) with this loss, for 10'000 iterations on training dataset. The runtimes are unchanged with respect to the ENet architecture.
more details
0.01363.197.377.287.236.139.048.552.058.189.967.792.771.449.691.039.449.350.541.659.8
DRN_CRL_CoarseyesyesyesyesnonononononononoyesyesDense Relation Network: Learning Consistent and Context-Aware Representation For Semantic Image SegmentationYueqing ZhuangICIPDRN_CoarseSemantic image segmentation, which aims at assigning pixel-wise category, is one of challenging image understanding problems. Global context plays an important role on local pixel-wise category assignment. To make the best of global context, in this paper, we propose dense relation network (DRN) and context-restricted loss (CRL) to aggregate global and local information. DRN uses Recurrent Neural Network (RNN) with different skip lengths in spatial directions to get context-aware representations while CRL helps aggregate them to learn consistency. Compared with previous methods, our proposed method takes full advantage of hierarchical contextual representations to produce high-quality results. Extensive experiments demonstrate that our methods achieves significant state-of-the-art performances on Cityscapes and Pascal Context benchmarks, with mean-IoU of 82.8% and 49.0% respectively.
more details
n/a82.898.887.794.065.164.270.177.481.693.973.595.888.074.996.580.892.188.572.178.8
ShuffleSegyesyesyesyesnonononononononononoShuffleSeg: Real-time Semantic Segmentation NetworkMostafa Gamal, Mennatullah Siam, Mo'men Abdel-RazekUnder Review by ICIP 2018ShuffleSeg: An efficient realtime semantic segmentation network with skip connections and ShuffleNet units
more details
n/a58.395.672.085.131.933.739.444.051.188.763.892.564.438.589.137.051.140.935.952.8
SkipNet-MobileNetyesyesyesyesnonononononononononoRTSeg: Real-time Semantic Segmentation FrameworkMennatullah Siam, Mostafa Gamal, Moemen Abdel-Razek, Senthil Yogamani, Martin JagersandUnder Review by ICIP 2018An efficient realtime semantic segmentation network with skip connections based on MobileNet.

more details
n/a61.595.873.986.236.739.444.547.254.389.566.092.969.345.189.935.653.945.644.858.2
ThunderNetyesyesnononononononono22nonoAnonymous
more details
0.010464.097.377.488.341.238.448.555.660.890.767.793.071.646.691.639.349.949.845.562.3
PAC: Perspective-adaptive ConvolutionsyesyesnonononononononononononoPerspective-adaptive Convolutions for Scene ParsingRui Zhang, Sheng Tang, Yongdong Zhang, Jintao Li, and Shuicheng YanIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)Many existing scene parsing methods adopt Convolutional Neural Networks with receptive fields of fixed sizes and shapes, which frequently results in inconsistent predictions of large objects and invisibility of small objects. To tackle this issue, we propose perspective-adaptive convolutions to acquire receptive fields of flexible sizes and shapes during scene parsing. Through adding a new perspective regression layer, we can dynamically infer the position-adaptive perspective coefficient vectors utilized to reshape the convolutional patches. Consequently, the receptive fields can be adjusted automatically according to the various sizes and perspective deformations of the objects in scene images. Our proposed convolutions are differentiable to learn the convolutional parameters and perspective coefficients in an end-to-end way without any extra training supervision of object sizes. Furthermore, considering that the standard convolutions lack contextual information and spatial dependencies, we propose a context adaptive bias to capture both local and global contextual information through average pooling on the local feature patches and global feature maps, followed by flexible attentive summing to the convolutional results. The attentive weights are position-adaptive and context-aware, and can be learned through adding an additional context regression layer. Experiments on Cityscapes and ADE20K datasets well demonstrate the effectiveness of the proposed methods.
more details
n/a78.998.786.993.358.960.465.873.078.393.672.895.686.071.396.073.482.469.567.375.9
SU_NetnonononononononononononononoAnonymous
more details
n/a75.398.484.791.954.356.454.667.473.892.771.194.583.767.395.265.176.864.165.673.9
MobileNetV2PlusyesyesnonononononononononononoHuijun LiuMobileNetV2Plus
more details
n/a70.798.081.990.947.148.657.162.868.392.368.694.580.360.493.859.165.455.752.067.0
DeepLabv3+yesyesyesyesnonononononononoyesyes Encoder-Decoder with Atrous Separable Convolution for Semantic Image SegmentationLiang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, Hartwig AdamarXivSpatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information. In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. We will provide more details in the coming update on the arXiv report.
more details
n/a82.198.787.093.959.563.771.478.282.294.073.095.888.073.396.478.090.983.973.878.9
RFMobileNetV2PlusyesyesnonononononononononononoHuijun LiuReceptive Filed MobileNetV2Plus for Semantic Segmentation
more details
n/a70.797.880.690.944.347.059.565.672.392.568.694.181.660.694.351.361.257.154.069.5
GoogLeNetV1_ROByesyesnonononononononononononoAnonymousGoogLeNet-v1 FCN trained on Cityscapes, KITTI, and ScanNet, as required by the Robust Vision Challenge at CVPR'18 (http://robustvision.net/)
more details
n/a59.696.673.887.127.131.647.253.259.089.655.192.272.348.390.929.840.033.842.962.0
SAITv2yesyesyesyesnonononononononononoAnonymous
more details
0.02570.097.980.889.654.345.847.752.960.090.567.893.773.254.592.867.280.672.048.160.5
GUNetyesyesnononononononono22nonoGuided Upsampling Network for Real-Time Semantic SegmentationDavide MazziniarxivGuided Upsampling Network for Real-Time Semantic Segmentation
more details
0.0370.498.282.790.647.345.451.959.166.691.768.594.879.059.594.160.371.454.054.967.2
RMNetyesyesnonononononononononononoAnonymousA fast and light net for semantic segmentation.
more details
0.01464.597.579.388.537.040.252.655.758.290.567.293.372.851.291.442.553.752.741.958.5
ContextNetyesyesnonononononononononononoContextNet: Exploring Context and Detail for Semantic Segmentation in Real-timeRudra PK Poudel, Ujwal Bonde, Stephan Liwicki, Christopher ZacharXivModern deep learning architectures produce highly accurate results on many challenging semantic segmentation datasets. State-of-the-art methods are, however, not directly transferable to real-time applications or embedded devices, since naive adaptation of such systems to reduce computational cost (speed, memory and energy) causes a significant drop in accuracy. We propose ContextNet, a new deep neural network architecture which builds on factorized convolution, network compression and pyramid representations to produce competitive semantic segmentation in real-time with low memory requirements. ContextNet combines a deep branch at low resolution that captures global context information efficiently with a shallow branch that focuses on high-resolution segmentation details. We analyze our network in a thorough ablation study and present results on the Cityscapes dataset, achieving 66.1% accuracy at 18.3 frames per second at full (1024x2048) resolution.
more details
0.023866.197.679.288.843.842.937.952.058.990.066.992.072.253.991.754.066.558.448.961.1
RFLRyesyesyesyesyesyesnononono44nonoRandom Forest with Learned Representations for Semantic SegmentationByeongkeun Kang, Truong Q. NguyenIEEE Transactions on Image ProcessingRandom Forest with Learned Representations for Semantic Segmentation
more details
0.0330.082.032.958.99.97.110.111.917.974.043.684.626.72.164.98.910.64.42.417.7
DPCyesyesyesyesnonononononononoyesyesSearching for Efficient Multi-Scale Architectures for Dense Image PredictionLiang-Chieh Chen, Maxwell D. Collins, Yukun Zhu, George Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, Jonathon ShlensNIPS 2018In this work we explore the construction of meta-learning techniques for dense image prediction focused on the tasks of scene parsing. Constructing viable search spaces in this domain is challenging because of the multi-scale representation of visual information and the necessity to operate on high resolution imagery. Based on a survey of techniques in dense image prediction, we construct a recursive search space and demonstrate that even with efficient random search, we can identify architectures that achieve state-of-the-art performance. Additionally, the resulting architecture (called DPC for Dense Prediction Cell) is more computationally efficient, requiring half the parameters and half the computational cost as previous state of the art systems.
more details
n/a82.798.787.193.857.763.571.078.082.194.073.395.488.274.596.581.293.389.074.179.0
NV-ADLRyesyesyesyesnonononononononononoAnonymousNVIDIA Applied Deep Learning Research
more details
n/a83.298.787.694.163.566.772.078.882.494.375.396.187.873.396.478.993.090.472.778.6
Adaptive Affinity Field on PSPNetyesyesnonononononononononoyesyesAdaptive Affinity Field for Semantic SegmentationTsung-Wei Ke*, Jyh-Jing Hwang*, Ziwei Liu, Stella X. YuECCV 2018Existing semantic segmentation methods mostly rely on per-pixel supervision, unable to capture structural regularity present in natural images. Instead of learning to enforce semantic labels on individual pixels, we propose to enforce affinity field patterns in individual pixel neighbourhoods, i.e., the semantic label patterns of whether neighbouring pixels are in the same segment should match between the prediction and the ground-truth. The affinity fields characterize geometric relationships within the image, such as "motorcycles have round wheels". We further develop a novel method for learning the optimal neighbourhood size for each semantic category, with an adversarial loss that optimizes over worst-case scenarios. Unlike the common Conditional Random Field (CRF) approaches, our adaptive affinity field (AAF) method has no extra parameters during inference, and is less sensitive to appearance changes in the image.
more details
n/a79.198.585.693.053.859.065.975.078.493.772.495.686.470.595.973.982.776.968.776.4
APMoE_seg_ROByesyesnonononononononononoyesyesPixel-wise Attentional Gating for Parsimonious Pixel LabelingShu Kong, Charless FowlkesarxivThe Pixel-level Attentional Gating (PAG) unit is trained to choose for each pixel the pooling size to adopt to aggregate contextual region around it. There are multiple branches with different dilate rates for varied pooling size, thus varying receptive field. For this ROB challenge, PAG is expected to robustly aggregate information for final prediction.

This is our entry for Robust Vision Challenge 2018 workshop (ROB). The model is based on ResNet50, trained over mixed dataset of Cityscapes, ScanNet and Kitti.
more details
0.956.596.173.987.630.729.545.448.060.290.566.892.369.920.691.227.835.619.430.257.9
BatMAN_ROByesyesyesyesnonononononononononoAnonymousbatch-normalized multistage attention network
more details
1.055.497.475.788.229.034.848.449.460.690.363.394.266.316.891.220.935.216.621.152.7
HiSS_ROByesyesnononononononono22nonoAnonymous
more details
0.0658.997.377.187.638.038.936.447.253.989.366.192.967.846.489.535.732.324.139.958.3
VENUS_ROByesyesnonononononononononononoAnonymousVENUS_ROB
more details
n/a66.496.977.989.648.145.744.155.062.591.167.293.875.247.192.752.661.851.646.863.0
VlocNet++_ROBnonononononononononononononoAnonymous
more details
n/a62.797.478.188.437.828.845.948.857.189.366.292.972.250.391.747.156.041.542.659.0
AHiSS_ROByesyesyesyesnononononono22nonoAnonymousAugmented Hierarchical Semantic Segmentation
more details
0.0670.698.081.689.956.552.343.055.260.890.668.193.773.555.493.167.579.166.453.962.1
IBN-PSP-SA_ROByesyesnonononononononononononoAnonymousIBN-PSP-SA_ROB
more details
n/a75.198.484.892.248.055.560.367.773.592.771.795.483.063.795.171.777.663.660.771.5
LDN2_ROByesyesnonononononononononononoAnonymousLadder DenseNet: https://ivankreso.github.io/publication/ladder-densenet/
more details
1.077.998.585.192.855.656.763.770.976.193.371.395.685.068.895.574.981.074.766.573.8
MiniNetyesyesnononononononono44nonoAnonymous
more details
0.00440.794.259.977.615.511.320.520.727.482.158.989.041.916.180.014.518.411.011.523.0
AdapNetv2_ROByesyesnonononononononononononoAnonymous
more details
n/a63.897.679.288.737.833.148.350.159.190.067.093.873.550.592.446.358.542.942.659.9
MapillaryAI_ROByesyesnonononononononononononoAnonymous
more details
n/a80.598.585.793.660.664.767.676.480.293.873.695.886.471.095.872.383.980.871.376.7
FCN101_ROByesyesnonononononononononononoAnonymous
more details
n/a30.491.651.875.014.07.60.90.00.976.640.576.431.80.174.98.90.015.40.011.2
MaskRCNN_BOSHyesyesnonononononononononononoJin shengtao, Yi zhihao, Liu wei [Our team name is firefly]Bosh autodrive challenge
more details
n/a73.997.780.490.249.051.760.960.169.892.265.488.382.064.594.571.984.272.356.871.2
EnsembleModel_BoschyesyesnonononononononononononoJin shengtao, Yi zhihao, Liu wei [Our team name was MaskRCNN_BOSH,firefly]we've ensembled three model(erfnet,deeplab-mobilenet,tusimple) and gained 0.57 improvment of IoU Classes value. The best single model is 73.8549
more details
n/a74.498.283.691.448.553.759.366.871.992.670.994.582.464.594.761.877.571.259.971.1
EVANetyesyesnonononononononononononoAnonymous
more details
n/a69.898.082.490.843.449.060.064.268.992.069.094.679.258.893.552.155.753.653.567.3
CLRCNetyesyesnonononononononononononoCLRCNet: Cascaded Low-Rank Convolutions for Semantic Segmentation in Real-timeAnonymousA lightweight and real-time semantic segmentation method.
more details
0.01363.397.277.388.035.940.150.155.160.090.966.793.473.149.890.837.451.545.141.958.0
Edgenetyesyesnononononononono22nonoAnonymousA lightweight semantic segmentation network combined with edge information and channel-wise attention mechanism.
more details
0.0371.098.183.191.645.450.662.667.271.492.469.794.980.461.194.350.060.952.555.367.7
L2-SPyesyesyesyesnonononononononoyesyesExplicit Inductive Bias for Transfer Learning with Convolutional NetworksXuhong Li, Yves Grandvalet, Franck DavoineICML-2018With a simple variant of weight decay, L2-SP regularization (see the paper for details), we reproduced PSPNet based on the original ResNet-101 using "train_fine + val_fine + train_extra" set (2975 + 500 + 20000 images), with a small batch size 8. The sync batch normalization layer is implemented in Tensorflow (see the code).
more details
n/a81.298.786.893.664.763.467.474.579.493.673.195.686.572.596.176.090.783.170.576.5
ALV303yesyesnonononononononononononoAnonymous
more details
0.272.298.384.292.042.151.865.973.777.592.971.695.083.061.694.646.860.054.355.571.8
NCTU-ITRIyesyesnononononononono22nonoAnonymousFor the purpose of fast semantic segmentation, we design a CNN-based encoder-decoder architecture, which is called DSNet. The encoder part is constructed based on the concept of DenseNet, and a simple decoder is adopted to make the network more efficient without degrading the accuracy. We pre-train the encoder network on the ImageNet dataset. Then, only the fine-annotated Cityscapes dataset (2975 training images) is used to train the complete DSNet. The DSNet demonstrates a good trade-off between accuracy and speed. It can process 68 frames per second on 1024x512 resolution images on a single GTX 1080 Ti GPU.
more details
0.014769.198.082.190.442.445.856.461.166.691.770.094.377.259.193.249.559.456.353.565.8
ADSCNetyesyesnonononononononononononoADSCNet: Asymmetric Depthwise Separable Convolution for Semantic Segmentation in Real-timeAnonymousA lightweight and real-time semantic segmentation method for mobile devices.
more details
0.01364.597.378.088.639.540.151.455.060.391.166.993.573.750.691.444.951.748.043.459.9
SRC-B-MachineLearningLabyesyesyesyesnonononononononononoJianlong Yuan, Zelu Deng, Shu Wang, Zhenbo LuoSamsung Research Center MachineLearningLab. The result is tested by multi scale and filp. The paper is in preparing.
more details
n/a82.598.787.394.064.864.570.776.781.294.073.995.987.972.796.379.292.288.171.577.8
Tencent AI LabyesyesyesyesnonononononononononoAnonymous
more details
n/a82.998.686.994.163.563.070.777.780.294.073.195.987.874.596.382.894.390.474.077.5
ERINetyesyesnononononononono22nonoAnonymousEfficient residual inception networks for real-time semantic segmentation
more details
0.02369.898.082.490.441.747.759.063.067.792.069.694.778.357.393.450.560.761.553.365.7
PGCNet_Res101_fineyesyesnonononononononononononoAnonymouswe choose the ResNet101 pretrained on ImageNet as our backbone, then we use both the train-fine and the val-fine data to train our model with batch size=8 for 8w iterations without any bells and whistles. We will release our paper latter.
more details
n/a80.598.787.193.659.662.769.377.680.493.973.195.687.373.196.470.884.976.571.578.1
EDANetyesyesnononononononono22yesyesEfficient Dense Modules of Asymmetric Convolution for Real-Time Semantic SegmentationShao-Yuan Lo (NCTU), Hsueh-Ming Hang (NCTU), Sheng-Wei Chan (ITRI), Jing-Jhih Lin (ITRI)Training data: Fine annotations only (train+val. set, 2975+500 images) without any pretraining nor coarse annotations.
For training on fine annotations (train set only, 2975 images), it attains a mIoU of 66.3%.

Runtime: (resolution 512x1024) 0.0092s on a single GTX 1080Ti, 0.0123s on a single Titan X.
more details
0.009267.397.880.689.542.046.052.359.865.091.468.793.675.754.392.440.958.756.050.464.0
OCNet_ResNet101_fineyesyesnonononononononononononoAnonymousContext is essential for various computer vision tasks.
The state-of-the-art scene parsing methods define the context as the prior of the scene categories (e.g., bathroom, badroom, street).
Such scene context is not suitable for the street scene parsing tasks as most of the scenes are similar.

In this work, we propose the Object Context that captures the prior of the object's category that the pixel belongs to.
We compute the object context by aggregating all the pixels' features according to a attention map that encodes the probability of each pixel that it belongs to the same category with the associated pixel.
Specifically, We employ the self-attention method to compute the pixel-wise attention map.

We further propose the Pyramid Object Context and Atrous Spatial Pyramid Object Context to handle the problem of multi-scales.
more details
n/a81.298.787.193.759.462.369.678.080.893.972.695.887.573.596.473.688.280.671.978.3
Knowledge-AwareyesyesnonononononononononononoAnonymousKnowledge-Aware Semantic Segmentation
more details
n/a79.397.680.193.056.857.366.072.678.293.471.095.785.969.795.974.988.586.368.175.5
CASIA_IVA_DANet_NoCoarseyesyesnonononononononononoyesyesDual Attention Network for Scene SegmentationJun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang,and Hanqing LuCVPR2019we address the scene segmentation task by capturing rich contextual dependencies based on the selfattention mechanism. Unlike previous works that capture contexts by multi-scale features fusion, we propose a Dual Attention Networks (DANet) to adaptively integrate local features with their global dependencies. Specifically, we append two types of attention modules on top of traditional dilated FCN, which model the semantic interdependencies in spatial and channel dimensions respectively. The position attention module selectively aggregates the features at each
position by a weighted sum of the features at all positions. Similar features would be related to each other regardless of their distances. Meanwhile, the channel attention module selectively emphasizes interdependent channel maps by integrating associated features among all channel maps.
We sum the outputs of the two attention modules to further improve feature representation which contributes to more precise segmentation results
more details
n/a81.598.686.193.556.263.369.777.381.393.972.995.787.372.996.276.889.586.572.278.2
LDFNetyesyesnonononoyesyesnono22yesyesIncorporating Luminance, Depth and Color Information by Fusion-based Networks for Semantic SegmentationShang-Wei Hung, Shao-Yuan LoWe propose a preferred solution, which incorporates Luminance, Depth and color information by a Fusion-based network named LDFNet. It includes a distinctive encoder sub-network to process the depth maps and further employs the luminance images to assist the depth information in a process. LDFNet achieves very competitive results compared to the other state-of-art systems on the challenging Cityscapes dataset, while it maintains an inference speed faster than most of the existing top-performing networks. The experimental results show the effectiveness of the proposed information-fused approach and the potential of LDFNet for road scene understanding tasks.
more details
n/a71.398.183.591.145.949.562.067.170.792.570.794.881.062.993.952.661.055.154.568.0
CGNetyesyesnonononononononononoyesyesTianyi Wu et alwe propose a novel Context Guided Network for semantic segmentation on mobile devices. We first design a Context Guided (CG) block by considering the inherent characteristic of semantic segmentation. CG Block aggregates local feature, surrounding context feature and global context feature effectively and efficiently. Based on the CG block, we develop Context Guided Network (CGNet), which not only has a strong capacity of localization and recognition, but also has a low computational and memory footprint. Under a similar number of parameters, the proposed
CGNet significantly outperforms existing segmentation networks. Extensive experiments on Cityscapes and CamVid datasets verify the effectiveness of the proposed approach.
Specifically, without any post-processing, the proposed approach achieves 64.8% mean IoU on Cityscapes test set with less than 0.5 M parameters, and has a frame-rate of 50 fps on one NVIDIA Tesla K80 card for 2048 × 1024 high-resolution image.
more details
0.0264.895.973.989.943.946.052.955.963.891.768.394.176.754.291.341.356.032.841.160.9
SAITv2-lightyesyesyesyesnonononononononononoAnonymous
more details
0.02573.098.283.591.255.252.856.561.968.492.470.894.578.859.493.961.678.067.654.867.4
Deform_ResNet_BalancedyesyesnonononononononononononoAnonymous
more details
0.25851.394.869.083.017.418.638.528.036.588.656.088.263.029.389.423.636.831.428.454.6
NfS-SegyesyesyesyesnonoyesyesyesyesnonononoUncertainty-Aware Knowledge Distillation for Real-Time Scene Segmentation: 7.43 GFLOPs at Full-HD Image with 120 fpsAnonymous
more details
0.0083731273.198.283.791.355.653.157.262.668.792.470.994.678.959.794.060.776.367.355.467.8
Improving Semantic Segmentation via Video Propagation and Label RelaxationyesyesyesyesnonononoyesyesnonoyesyesImproving Semantic Segmentation via Video Propagation and Label RelaxationYi Zhu, Karan Sapra, Fitsum A. Reda, Kevin J. Shih, Shawn Newsam, Andrew Tao, Bryan CatanzaroCVPR 2019Semantic segmentation requires large amounts of pixel-wise annotations to learn accurate models. In this paper, we present a video prediction-based methodology to scale up training sets by synthesizing new training samples in order to improve the accuracy of semantic segmentation networks. We exploit video prediction models' ability to predict future frames in order to also predict future labels. A joint propagation strategy is also proposed to alleviate mis-alignments in synthesized samples. We demonstrate that training segmentation models on datasets augmented by the synthesized samples lead to significant improvements in accuracy. Furthermore, we introduce a novel boundary label relaxation technique that makes training robust to annotation noise and propagation artifacts along object boundaries. Our proposed methods achieve state-of-the-art mIoUs of 83.5% on Cityscapes and 82.9% on CamVid. Our single model, without model ensembles, achieves 72.8% mIoU on the KITTI semantic segmentation test set, which surpasses the winning entry of the ROB challenge 2018.
more details
n/a83.598.887.894.264.165.072.479.082.894.274.096.188.275.496.578.894.091.673.779.0
Spatial Sampling Netyesyesnononononononono22nonoSpatial Sampling Network for Fast Scene UnderstandingDavide Mazzini, Raimondo SchettiniCVPR 2019 Workshop on Autonomous DrivingWe propose a network architecture to perform efficient scene understanding. This work presents three main novelties: the first is an Improved Guided Upsampling Module that can replace in toto the decoder part in common semantic segmentation networks.
Our second contribution is the introduction of a new module based on spatial sampling to perform Instance Segmentation. It provides a very fast instance segmentation, needing only thresholding as post-processing step at inference time. Finally, we propose a novel efficient network design that includes the new modules and we test it against different datasets for outdoor scene understanding.
more details
0.0088468.997.981.290.247.447.850.957.364.691.467.994.077.157.393.552.669.053.151.064.1
SwiftNetRN-18yesyesnonononononononononoyesyesIn Defense of Pre-trained ImageNet Architectures for Real-time Semantic Segmentation of Road-driving ImagesMarin Oršić, Ivan Krešo, Petra Bevandić, Siniša ŠegvićCVPR 2019
more details
0.024375.598.383.992.246.352.863.270.675.893.170.395.484.064.595.363.978.071.961.673.6
Fast-SCNNyesyesyesyesnonononononononononoFast-SCNN: Fast Semantic Segmentation NetworkRudra PK Poudel, Stephan Liwicki, Roberto CipollaThe encoder-decoder framework is state-of-the-art for offline semantic image segmentation. Since the rise in autonomous systems, real-time computation is increasingly desirable. In this paper, we introduce fast segmentation convolutional neural network (Fast-SCNN), an above real-time semantic segmentation model on high resolution image data (1024x2048px) suited to efficient computation on embedded devices with low memory. Building on existing two-branch methods for fast segmentation, we introduce our `learning to downsample' module which computes low-level features for multiple resolution branches simultaneously. Our network combines spatial detail at high resolution with deep features extracted at lower resolution, yielding an accuracy of 68.0% mean intersection over union at 123.5 frames per second on Cityscapes. We also show that large scale pre-training is unnecessary. We thoroughly validate our metric in experiments with ImageNet pre-training and the coarse labeled data of Cityscapes. Finally, we show even faster computation with competitive results on subsampled inputs, without any network modifications.
more details
0.008168.097.981.689.746.448.648.353.160.590.767.294.374.054.693.057.465.558.250.061.2
Fast-SCNN (Half-resolution)yesyesyesyesnononononono22nonoFast-SCNN: Fast Semantic Segmentation NetworkRudra P K Poudel, Stephan Liwicki, Roberto CipollaThe encoder-decoder framework is state-of-the-art for offline semantic image segmentation. Since the rise in autonomous systems, real-time computation is increasingly desirable. In this paper, we introduce fast segmentation convolutional neural network (Fast-SCNN), an above real-time semantic segmentation model on high resolution image data (1024x2048px) suited to efficient computation on embedded devices with low memory. Building on existing two-branch methods for fast segmentation, we introduce our `learning to downsample' module which computes low-level features for multiple resolution branches simultaneously. Our network combines spatial detail at high resolution with deep features extracted at lower resolution, yielding an accuracy of 68.0% mean intersection over union at 123.5 frames per second on Cityscapes. We also show that large scale pre-training is unnecessary. We thoroughly validate our metric in experiments with ImageNet pre-training and the coarse labeled data of Cityscapes. Finally, we show even faster computation with competitive results on subsampled inputs, without any network modifications.
more details
0.003562.897.477.887.439.741.835.039.450.588.563.392.765.746.491.056.970.356.540.952.6
Fast-SCNN (Quarter-resolution)yesyesnononononononono44nonoFast-SCNN: Fast Semantic Segmentation NetworkRudra P K Poudel, Stephan Liwicki, Roberto CipollaThe encoder-decoder framework is state-of-the-art for offline semantic image segmentation. Since the rise in autonomous systems, real-time computation is increasingly desirable. In this paper, we introduce fast segmentation convolutional neural network (Fast-SCNN), an above real-time semantic segmentation model on high resolution image data (1024x2048px) suited to efficient computation on embedded devices with low memory. Building on existing two-branch methods for fast segmentation, we introduce our `learning to downsample' module which computes low-level features for multiple resolution branches simultaneously. Our network combines spatial detail at high resolution with deep features extracted at lower resolution, yielding an accuracy of 68.0% mean intersection over union at 123.5 frames per second on Cityscapes. We also show that large scale pre-training is unnecessary. We thoroughly validate our metric in experiments with ImageNet pre-training and the coarse labeled data of Cityscapes. Finally, we show even faster computation with competitive results on subsampled inputs, without any network modifications.
more details
0.0020651.996.370.583.126.123.518.726.133.184.555.789.555.235.486.838.747.446.727.342.0
DSNetyesyesyesyesnononononono22yesyesDSNet for Real-Time Driving Scene Semantic SegmentationWenfu WangDSNet for Real-Time Driving Scene Semantic Segmentation
more details
0.02769.397.179.789.437.850.456.763.168.591.067.993.575.761.491.950.466.856.854.165.4
SwiftNetRN-18 pyramidyesyesnonononononononononononoAnonymous
more details
n/a74.498.484.792.048.250.962.769.174.592.669.895.383.866.295.464.875.455.960.772.4
DF-SegyesyesnonononononononononoyesyesPartial Order Pruning: for Best Speed/Accuracy Trade-off in Neural Architecture SearchXin Li, Yiming Zhou, Zheng Pan, Jiashi FengCVPR 2019DF1-Seg-d8
more details
0.00771.497.981.690.644.850.453.262.768.091.768.393.978.858.993.450.571.477.456.466.4
DF-SegyesyesnonononononononononononoAnonymousDF2-Seg2
more details
0.01875.398.484.692.351.454.662.168.873.392.970.395.382.864.895.059.574.478.760.671.1
DDARyesyesyesyesnonononononononononoAnonymousDiDi Labs, AR Group
more details
n/a82.298.787.393.657.262.970.977.882.093.871.495.888.174.596.380.191.387.674.079.0
LDN-121yesyesnonononononononononononoEfficient Ladder-style DenseNets for Semantic Segmentation of Large ImagesIvan Kreso, Josip Krapac, Sinisa SegvicLadder DenseNet-121 trained on train+val, fine labels only. Single-scale inference.
more details
0.04879.398.685.893.158.159.466.074.177.793.571.595.785.769.495.971.386.583.566.475.2
TKCNyesyesnonononononononononoyesyesTree-structured Kronecker Convolutional Network for Semantic SegmentationTianyi Wu, Sheng Tang, Rui Zhang, Juan Cao, Jintao Li
more details
n/a79.598.485.893.051.761.767.675.880.093.672.795.486.970.995.964.586.981.869.677.6
RPNetyesyesnononononononono22yesyesResidual Pyramid Learning for Single-Shot Semantic SegmentationXiaoyu Chen, Xiaotian Lou, Lianfa Bai, Jing HanarXivwe put forward a method for single-shot segmentation in a feature residual pyramid network (RPNet), which learns the main and residuals of segmentation by decomposing the label at different levels of residual blocks.
more details
0.00868.397.981.289.840.245.756.361.667.891.768.094.578.257.492.948.357.856.149.662.2
naviyesyesnonononononononononononoyuxbmutil scale test
more details
n/a81.098.585.793.561.860.168.175.378.793.773.595.686.869.696.178.392.787.068.875.6
Auto-DeepLab-LyesyesyesyesnonononononononoyesyesAuto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image SegmentationChenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan Yuille, Li Fei-FeiarxivIn this work, we study Neural Architecture Search for semantic image segmentation, an important computer vision task that assigns a semantic label to every pixel in an image. Existing works often focus on searching the repeatable cell structure, while hand-designing the outer network structure that controls the spatial resolution changes. This choice simplifies the search space, but becomes increasingly problematic for dense image prediction which exhibits a lot more network level architectural variations. Therefore, we propose to search the network level structure in addition to the cell level structure, which forms a hierarchical architecture search space. We present a network level search space that includes many popular designs, and develop a formulation that allows efficient gradient-based architecture search (3 P100 GPU days on Cityscapes images). We demonstrate the effectiveness of the proposed method on the challenging Cityscapes, PASCAL VOC 2012, and ADE20K datasets. Without any ImageNet pretraining, our architecture searched specifically for semantic image segmentation attains state-of-the-art performance. Please refer to https://arxiv.org/abs/1901.02985 for details.
more details
n/a82.198.887.693.861.464.471.277.680.994.172.796.087.872.896.578.290.988.469.077.6
LiteSeg-Darknet19yesyesyesyesnonononononononoyesyesLiteSeg: A Litewiegth ConvNet for Semantic SegmentationTaha Emara, Hossam E. Abd El Munim, Hazem M. AbbasDICTA 2019
more details
0.010270.898.082.291.349.050.958.864.472.192.769.595.080.757.094.249.361.855.453.668.4
AdapNet++yesyesyesyesnonononononononoyesyes Self-Supervised Model Adaptation for Multimodal Semantic SegmentationAbhinav Valada, Rohit Mohan, Wolfram BurgardIJCV 2019In this work, we propose the AdapNet++ architecture for semantic segmentation that aims to achieve the right trade-off between performance and computational complexity of the model. AdapNet++ incorporates a new encoder with multiscale residual units and an efficient atrous spatial pyramid pooling (eASPP) module that has a larger effective receptive field with more than 10x fewer parameters compared to the standard ASPP, complemented with a strong decoder with a multi-resolution supervision scheme that recovers high-resolution details. Comprehensive empirical evaluations on the challenging Cityscapes, Synthia, SUN RGB-D, ScanNet and Freiburg Forest datasets demonstrate that our architecture achieves state-of-the-art performance while simultaneously being efficient in terms of both the number of parameters and inference time. Please refer to https://arxiv.org/abs/1808.03833 for details.

A live demo on various datasets can be viewed at http://deepscene.cs.uni-freiburg.de
more details
n/a81.398.686.293.357.862.067.375.079.693.672.395.386.472.296.281.592.488.071.276.6
SSMAyesyesyesyesnonoyesyesnonononoyesyes Self-Supervised Model Adaptation for Multimodal Semantic SegmentationAbhinav Valada, Rohit Mohan, Wolfram BurgardIJCV 2019Learning to reliably perceive and understand the scene is an integral enabler for robots to operate in the real-world. This problem is inherently challenging due to the multitude of object types as well as appearance changes caused by varying illumination and weather conditions. Leveraging
complementary modalities can enable learning of semantically richer representations that are resilient to such perturbations. Despite the tremendous progress in recent years, most multimodal convolutional neural network approaches directly concatenate feature maps from individual modality streams
rendering the model incapable of focusing only on the relevant complementary information for fusion. To address this limitation, we propose a mutimodal semantic segmentation framework that dynamically adapts the fusion of modality-specific features while being sensitive to the object category, spatial location and scene context in a self-supervised manner. Specifically, we propose an architecture consisting of two modality-specific encoder streams that fuse intermediate encoder representations into a single decoder using our proposed SSMA fusion mechanism which optimally combines complementary features. As intermediate representations are not aligned across modalities, we introduce an attention scheme for better correlation. Extensive experimental evaluations on the challenging Cityscapes, Synthia, SUN RGB-D, ScanNet and Freiburg Forest datasets demonstrate that our architecture achieves state-of-the-art performance in addition to providing exceptional robustness in adverse perceptual conditions. Please refer to https://arxiv.org/abs/1808.03833 for details.

A live demo on various datasets can be viewed at http://deepscene.cs.uni-freiburg.de
more details
n/a82.398.786.993.657.963.468.977.181.193.973.195.387.473.896.481.193.590.073.578.3
LiteSeg-MobilenetyesyesyesyesnonononononononoyesyesLiteSeg: A Litewiegth ConvNet for Semantic SegmentationTaha Emara, Hossam E. Abd El Munim, Hazem M. AbbasDICTA 2019
more details
0.006267.897.177.890.444.247.555.359.368.792.069.294.678.655.392.042.955.854.849.264.0
LiteSeg-ShufflenetyesyesyesyesnonononononononoyesyesLiteSeg: A Litewiegth ConvNet for Semantic SegmentationTaha Emara, Hossam E. Abd El Munim, Hazem M. AbbasDICTA 2019
more details
0.00751865.297.177.989.541.842.549.852.965.691.567.894.076.150.191.443.451.848.044.362.7
Fast OCNetyesyesnonononononononononononoAnonymous
more details
n/a82.198.787.393.759.962.670.578.481.093.873.495.687.675.196.475.890.188.971.978.6
ShuffleNet v2 + DPCyesyesyesyesnonononononononoyesyesAn efficient solution for semantic segmentation: ShuffleNet V2 with atrous separable convolutionsSercan Turkmen, Janne HeikkilaShuffleNet v2 with DPC at output_stride 16.
more details
n/a70.398.182.590.751.350.951.561.266.991.768.593.978.559.794.059.168.148.154.367.5
ERSNet-coarseyesyesyesyesnononononono44nonoAnonymous
more details
0.01267.697.981.389.845.043.954.456.563.391.568.094.575.153.692.950.266.955.044.460.3
MiniNet-v2-coarseyesyesyesyesnononononono22nonoAnonymous
more details
0.01267.898.081.590.044.644.154.657.763.991.668.294.575.454.593.250.166.853.144.061.3
SwiftNetRN-18 ensembleyesyesnonononononononononoyesyesIn Defense of Pre-trained ImageNet Architectures for Real-time Semantic Segmentation of Road-driving ImagesMarin Oršić, Ivan Krešo, Petra Bevandić, Siniša ŠegvićCVPR 2019
more details
n/a76.598.585.492.549.153.964.271.176.393.171.095.584.867.595.667.580.568.363.574.2
EFC_syncyesyesnonononononoyesyesnonononoAnonymous
more details
n/a80.298.686.293.358.062.966.474.479.293.673.095.686.571.596.075.888.677.469.476.7
PL-SegyesyesnonononononononononoyesyesPartial Order Pruning: for Best Speed/Accuracy Trade-of in Neural Architecture SearchXin Li, Yiming Zhou, Zheng Pan, Jiashi FengCVPR 2019Following "partial order pruning", we conduct architecture searching experiments on Snapdragon 845 platform, and obtained PL1A/PL1A-Seg.

1、Snapdragon 845
2、NCNN Library
3、latency evaluated at 640x384

more details
0.019269.197.980.890.242.044.452.359.866.391.868.794.777.856.993.254.467.760.648.465.0
MiniNet-v2-pretrainedyesyesyesyesnononononono22nonoAnonymous
more details
0.01268.098.081.689.943.544.354.858.064.191.768.894.575.954.393.249.167.757.045.061.0
GALD-NetyesyesyesyesyesyesyesyesnonononoyesyesGlobal Aggregation then Local Distribution in Fully Convolutional NetworksXiangtai Li, Li Zhang, Ansheng You, Maoke Yang, Kuiyuan Yang, Yunhai TongBMVC 2019We propose Global Aggregation then Local Distribution (GALD) scheme to distribute global information to each position adaptively according to the local information around the position. (Joint work: Key Laboratory of Machine Perception, School of EECS @Peking University and DeepMotion AI Research )
more details
n/a83.398.887.794.265.066.773.179.382.494.272.996.088.476.296.579.889.687.774.179.9
GALD-netyesyesyesyesnonononononononoyesyesGlobal Aggregation then Local Distribution in Fully Convolutional NetworksXiangtai Li, Li Zhang, Ansheng You, Maoke Yang, Kuiyuan Yang, Yunhai TongBMVC 2019We propose Global Aggregation then Local Distribution (GALD) scheme to distribute global information to each position adaptively according the local information surrounding the position.
more details
n/a83.198.887.694.264.666.572.879.082.294.273.196.088.375.296.579.389.787.573.880.0
ndnetyesyesnonononononononononononoAnonymous
more details
0.02464.897.779.589.039.641.247.051.858.790.766.893.774.653.392.448.261.943.344.058.5
HRNetV2yesyesnonononononononononoyesyesHigh-Resolution Representations for Labeling Pixels and RegionsKe Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, Jingdong WangThe high-resolution network (HRNet) recently developed for human pose estimation, maintains high-resolution representations through the whole process by connecting high-to-low resolution convolutions in parallel and produces strong high-resolution representations by repeatedly conducting fusions across parallel convolutions.
more details
n/a81.898.887.993.961.363.172.179.382.494.073.496.088.575.196.572.588.179.973.179.2
SPGNetyesyesnonononononononononononoSPGNet: Semantic Prediction Guidance for Scene ParsingBowen Cheng, Liang-Chieh Chen, Yunchao Wei, Yukun Zhu, Zilong Huang, Jinjun Xiong, Thomas Huang, Wen-Mei Hwu, Honghui ShiICCV 2019Multi-scale context module and single-stage encoder-decoder structure are commonly employed for semantic segmentation. The multi-scale context module refers to the operations to aggregate feature responses from a large spatial extent, while the single-stage encoder-decoder structure encodes the high-level semantic information in the encoder path and recovers the boundary information in the decoder path. In contrast, multi-stage encoder-decoder networks have been widely used in human pose estimation and show superior performance than their single-stage counterpart. However, few efforts have been attempted to bring this effective design to semantic segmentation. In this work, we propose a Semantic Prediction Guidance (SPG) module which learns to re-weight the local features through the guidance from pixel-wise semantic prediction. We find that by carefully re-weighting features across stages, a two-stage encoder-decoder network coupled with our proposed SPG module can significantly outperform its one-stage counterpart with similar parameters and computations. Finally, we report experimental results on the semantic segmentation benchmark Cityscapes, in which our SPGNet attains 81.1% on the test set using only 'fine' annotations.
more details
n/a81.198.887.693.856.561.971.980.082.194.173.596.188.774.996.567.384.881.871.179.4
LDN-161yesyesnonononononononononononoEfficient Ladder-style DenseNets for Semantic Segmentation of Large ImagesIvan Kreso, Josip Krapac, Sinisa SegvicLadder DenseNet-161 trained on train+val, fine labels only. Inference on multi-scale inputs.
more details
2.080.698.786.593.661.860.968.375.680.193.772.495.886.872.296.172.388.880.769.977.1
GGCFyesyesyesyesnonononononononononoAnonymous
more details
n/a83.298.887.894.166.066.171.178.482.294.174.595.888.174.096.579.992.490.871.878.4
GFF-NetyesyesnonononononononononononoGFF: Gated Fully Fusion for Semantic SegmentationXiangtai Li, Houlong Zhao, Yunhai Tong, Kuiyuan YangWe proposed Gated Fully Fusion (GFF) to fuse features from multiple levels through gates in a fully connected way. Specifically, features at each level are enhanced by higher-level features with stronger semantics and lower-level features with more details, and gates are used to control the pass of useful information which significantly reducing noise propagation during fusion. (Joint work: Key Laboratory of Machine Perception, School of EECS @Peking University and DeepMotion AI Research )
more details
n/a82.398.787.293.959.664.371.578.382.294.072.695.988.273.996.579.892.284.771.578.8
Gated-SCNNyesyesnonononononononononoyesyesGated-SCNN: Gated Shape CNNs for Semantic SegmentationTowaki Takikawa, David Acuna, Varun Jampani, Sanja Fidler
more details
n/a82.898.787.494.261.964.672.979.682.594.374.396.288.374.296.677.290.187.772.679.4
ESPNetv2yesyesnononononononono22yesyesESPNetv2: A Light-weight, Power Efficient, and General Purpose Convolutional Neural NetworkSachin Mehta, Mohammad Rastegari, Linda Shapiro, and Hannaneh HajishirziCVPR 2019We introduce a light-weight, power efficient, and general purpose convolutional neural network, ESPNetv2, for modeling visual and sequential data. Our network uses group point-wise and depth-wise dilated separable convolutions to learn representations from a large effective receptive field with fewer FLOPs and parameters. The performance of our network is evaluated on three different tasks: (1) object classification, (2) semantic segmentation, and (3) language modeling. Experiments on these tasks, including image classification on the ImageNet and language modeling on the PenTree bank dataset, demonstrate the superior performance of our method over the state-of-the-art methods. Our network has better generalization properties than ShuffleNetv2 when tested on the MSCOCO multi-object classification task and the Cityscapes urban scene semantic segmentation task. Our experiments show that ESPNetv2 is much more power efficient than existing state-of-the-art efficient methods including ShuffleNets and MobileNets. Our code is open-source and available at https://github.com/sacmehta/ESPNetv2
more details
n/a66.297.378.688.843.542.149.352.660.090.566.893.372.953.191.853.065.953.244.259.9
MRFMyesyesyesyesnonononononononononoMulti Receptive Field Network for Semantic SegmentationJianlong Yuan, Zelu Deng, Shu Wang, Zhenbo LuoWACV2020Semantic segmentation is one of the key tasks in comput-
er vision, which is to assign a category label to each pixel
in an image. Despite significant progress achieved recently,
most existing methods still suffer from two challenging is-
sues: 1) the size of objects and stuff in an image can be very
diverse, demanding for incorporating multi-scale features
into the fully convolutional networks (FCNs); 2) the pixel-
s close to or at the boundaries of object/stuff are hard to
classify due to the intrinsic weakness of convolutional net-
works. To address the first issue, we propose a new Multi-
Receptive Field Module (MRFM), explicitly taking multi-
scale features into account. For the second issue, we design
an edge-aware loss which is effective in distinguishing the
boundaries of object/stuff. With these two designs, our Mul-
ti Receptive Field Network achieves new state-of-the-art re-
sults on two widely-used semantic segmentation benchmark
datasets. Specifically, we achieve a mean IoU of 83.0% on
the Cityscapes dataset and 88.4% mean IoU on the Pascal
VOC2012 dataset.
more details
n/a83.098.888.094.263.864.772.278.381.894.273.995.788.374.696.479.592.288.172.878.6
DGCNetyesyesnonononononononononononoDual Graph Convolutional Network for Semantic SegmentationLi Zhang*, Xiangtai Li*, Anurag Arnab, Kuiyuan Yang, Yunhai Tong, Philip H.S. TorrBMVC 2019We propose Dual Graph Convolutional Network (DGCNet) models the global context of the input feature by modelling two orthogonal graphs in a single framework. (Joint work: University of Oxford, Peking University and DeepMotion AI Research)
more details
n/a82.098.787.493.962.463.470.978.781.394.073.395.887.873.796.476.091.681.671.578.7
dpcan_trainval_os16_225yesyesnonononononononononononoAnonymous
more details
n/a81.698.786.993.556.661.569.976.880.993.871.595.787.873.096.377.690.689.671.278.1
Learnable Tree FilteryesyesnonononononononononoyesyesLearnable Tree Filter for Structure-preserving Feature TransformLin Song; Yanwei Li; Zeming Li; Gang Yu; Hongbin Sun; Jian Sun; Nanning ZhengNeurIPS 2019Learnable Tree Filter for Structure-preserving Feature Transform
more details
n/a80.898.786.893.453.660.269.677.181.193.771.795.887.572.296.272.787.086.472.678.4
FreeNetyesyesnonononononononononononoAnonymous
more details
n/a65.896.977.389.341.946.447.953.063.291.569.093.577.154.393.343.659.846.745.760.8
HRNetV2 + OCRyesyesyesyesnonononononononoyesyesHigh-Resolution Representations for Labeling Pixels and Regions; OCNet: Object Context Network for Scene ParsingHRNet Team; OCR TeamHRNetV2W48 + OCR. OCR is an extension of object context networks https://arxiv.org/pdf/1809.00916.pdf
more details
n/a83.398.888.294.267.665.372.279.182.494.173.896.088.175.096.476.992.390.972.878.9
Valeo DAR GermanyyesyesyesyesnonononononononononoAnonymousValeo DAR Germany, New Algo Lab

more details
n/a82.898.786.993.858.663.571.078.282.294.073.295.488.574.596.581.293.790.574.279.1
GLNet_fineyesyesnonononononononononononoAnonymousThe proposed network architecture, combined with spatial information and multi scale context information, and repair the boundaries and details of the segmented object through channel attention modules.(Use the train-fine and the val-fine data)
more details
n/a80.898.786.793.456.960.568.375.579.893.772.695.987.071.696.073.590.585.771.177.4
MCDNyesyesnonononononononononononoAnonymous
more details
n/a81.198.787.093.560.865.769.877.180.893.872.495.887.171.996.277.688.777.470.476.6
AAF+GLRyesyesnonononononononononononoAnonymous
more details
n/a78.298.484.892.850.857.868.074.978.593.671.595.686.068.895.871.381.874.066.775.4
HRNetV2 + OCR (w/ ASP)yesyesyesyesnonononononononoyesyesopenseg-group (OCR team + HRNet team)Our approach is based on a single HRNet48V2 and an OCR module combined with ASPP. We apply depth based multi-scale ensemble weights during testing (provided by DeepMotion AI Research) .
more details
n/a83.798.888.394.366.966.773.380.283.094.274.196.088.575.896.578.591.890.173.479.3
CASIA_IVA_DRANet-101_NoCoarseyesyesnonononononononononononoAnonymous
more details
n/a82.998.887.694.161.762.772.980.083.094.273.896.088.876.196.676.589.888.073.880.0
Hyundai Mobis AD LabyesyesyesyesnonononononononononoHyundai Mobis AD Lab, DL-DB Group, AA (Automated Annotator) Team
more details
n/a83.898.988.494.365.265.972.879.583.094.374.396.188.675.996.679.393.891.574.879.7
EFRNet-13yesyesnonononononononononononoAnonymous
more details
0.014672.898.283.090.856.551.454.960.466.492.169.694.778.659.493.562.879.272.753.665.9
FarSee-Netyesyesnononononononono22nonoFarSee-Net: Real-Time Semantic Segmentation by Efficient Multi-scale Context Aggregation and Feature Space Super-resolutionZhanpeng Zhang and Kaipeng ZhangIEEE International Conference on Robotics and Automation (ICRA) 2020FarSee-Net: Real-Time Semantic Segmentation by Efficient Multi-scale Context Aggregation and Feature Space Super-resolution

Real-time semantic segmentation is desirable in many robotic applications with limited computation resources. One challenge of semantic segmentation is to deal with the objectscalevariationsandleveragethecontext.Howtoperform multi-scale context aggregation within limited computation budget is important. In this paper, firstly, we introduce a novel and efficient module called Cascaded Factorized Atrous Spatial Pyramid Pooling (CF-ASPP). It is a lightweight cascaded structure for Convolutional Neural Networks (CNNs) to efficiently leverage context information. On the other hand, for runtime efficiency, state-of-the-art methods will quickly decrease the spatial size of the inputs or feature maps in the early network stages. The final high-resolution result is usuallyobtainedbynon-parametricup-samplingoperation(e.g. bilinear interpolation). Differently, we rethink this pipeline and treat it as a super-resolution process. We use optimized superresolution operation in the up-sampling step and improve the accuracy, especially in sub-sampled input image scenario for real-time applications. By fusing the above two improvements, our methods provide better latency-accuracy trade-off than the other state-of-the-art methods. In particular, we achieve 68.4% mIoU at 84 fps on the Cityscapes test set with a single Nivida Titan X (Maxwell) GPU card. The proposed module can be plugged into any feature extraction CNN and benefits from the CNN structure development.
more details
0.011968.497.981.489.938.643.653.258.864.491.067.794.075.957.393.255.967.855.149.463.9
C3Net [2,3,7,13]nononononononononono22yesyesC3: Concentrated-Comprehensive Convolution and its application to semantic segmentationHyojin Park, Youngjoon Yoo, Geonseok Seo, Dongyoon Han, Sangdoo Yun, Nojun Kwak
more details
n/a64.897.279.088.740.943.652.854.861.390.867.693.572.350.890.747.356.143.641.958.0
Panoptic-DeepLab [Cityscapes-fine]yesyesnonononononononononononoPanoptic-DeepLabBowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh ChenOur proposed bottom-up Panoptic-DeepLab is conceptually simple yet delivers state-of-the-art results. The Panoptic-DeepLab adopts dual-ASPP and dual-decoder modules, specific to semantic segmentation and instance segmentation respectively. The semantic segmentation prediction follows the typical design of any semantic segmentation model (e.g., DeepLab), while the instance segmentation prediction involves a simple instance center regression, where the model learns to predict instance centers as well as the offset from each pixel to its corresponding center. This submission exploits only Cityscapes fine annotations.
more details
n/a79.498.787.293.657.760.870.878.081.293.874.195.788.276.496.055.375.179.672.174.0
EKENetyesyesnonononononononononononoAnonymous
more details
0.022974.398.283.591.159.553.555.361.166.992.270.394.778.959.893.764.385.581.055.666.3
SPSSNyesyesnonononononononononononoAnonymousStage Pooling Semantic Segmentation Network
more details
n/a69.497.780.889.843.946.553.158.864.791.568.794.276.259.092.753.571.059.252.963.8
FC-HarDNet-70yesyesnonononononononononoyesyesHarDNet: A Low Memory Traffic NetworkPing Chao, Chao-Yang Kao, Yu-Shan Ruan, Chien-Hsiang Huang, Youn-Long LinICCV 2019Fully Convolutional Harmonic DenseNet 70
U-shape encoder-decoder structure with HarDNet blocks
Trained with single scale loss at stride-4
validation mIoU=77.7
more details
0.01575.998.585.592.549.054.464.071.575.693.070.695.484.567.495.767.779.063.660.772.7
BFPyesyesnonononononononononononoBoundary-Aware Feature Propagation for Scene SegmentationHenghui Ding, Xudong Jiang, Ai Qun Liu, Nadia Magnenat Thalmann, and Gang WangIEEE International Conference on Computer Vision (ICCV), 2019Boundary-Aware Feature Propagation for Scene Segmentation
more details
n/a81.498.787.093.559.863.468.976.880.993.772.895.587.072.196.077.689.086.969.277.6
FasterSegyesyesnonononononononononoyesyesFasterSeg: Searching for Faster Real-time Semantic SegmentationWuyang Chen, Xinyu Gong, Xianming Liu, Qian Zhang, Yuan Li, Zhangyang WangICLR 2020We present FasterSeg, an automatically designed semantic segmentation network with not only state-of-the-art performance but also faster speed than current methods. Utilizing neural architecture search (NAS), FasterSeg is discovered from a novel and broader search space integrating multi-resolution branches, that has been recently found to be vital in manually designed segmentation models. To better calibrate the balance between the goals of high accuracy and low latency, we propose a decoupled and fine-grained latency regularization, that effectively overcomes our observed phenomenons that the searched networks are prone to "collapsing" to low-latency yet poor-accuracy models. Moreover, we seamlessly extend FasterSeg to a new collaborative search (co-searching) framework, simultaneously searching for a teacher and a student network in the same single run. The teacher-student distillation further boosts the student model's accuracy. Experiments on popular segmentation benchmarks demonstrate the competency of FasterSeg. For example, FasterSeg can run over 30% faster than the closest manually designed competitor on Cityscapes, while maintaining comparable accuracy.
more details
0.0061371.598.083.591.139.148.758.666.771.692.369.194.581.561.893.755.067.161.155.069.2
VCD-NoCoarseyesyesnonononononononononononoAnonymous
more details
n/a82.398.888.093.856.961.972.980.082.694.173.095.988.676.196.575.588.687.873.479.6
NAVINFO_DLRyesyesyesyesnonononononononononopengfei zhangweighted aspp+ohem+hard region refine
more details
n/a83.898.887.994.265.263.573.480.283.094.173.595.989.076.896.782.193.790.574.180.0
LBPSSyesyesnononononononono22nonoAnonymousCVPR 2020 submission #5455
more details
0.961.097.076.987.431.338.053.653.860.990.465.993.170.343.390.931.650.333.931.858.7
KANet_Res101yesyesnonononononononononononoAnonymous
more details
n/a81.898.686.493.661.963.569.677.581.093.772.895.787.574.796.275.688.287.272.778.4
Learnable Tree Filter V2yesyesnonononononononononoyesyesRethinking Learnable Tree Filter for Generic Feature TransformLin Song, Yanwei Li, Zhengkai Jiang, Zeming Li, Xiangyu Zhang, Hongbin Sun, Jian Sun, Nanning ZhengNeurIPS 2020Based on ResNet-101 backbone and FPN architecture.
more details
n/a82.198.786.593.657.561.271.979.682.794.072.595.988.575.396.676.288.487.974.579.6
GPSNetyesyesnonononononononononononoAnonymous
more details
n/a82.198.988.193.961.563.071.579.081.494.073.295.888.174.996.575.591.284.170.778.8
FTFNetyesyesyesyesnonononononononononoAnonymousAn Efficient Network Focused on Tiny Feature Maps for Real-Time Semantic Segmentation
more details
0.008872.498.283.391.047.947.855.463.068.792.369.594.780.461.694.562.575.963.056.568.6
iFLYTEK-CVyesyesyesyesnonononononononononoAnonymousiFLYTEK Research, CV Group
more details
n/a84.498.988.594.469.066.973.179.783.394.474.396.088.876.396.784.094.391.774.779.3
F2MF-shortyesyesnonononononoyesyesnonoyesyesWarp to the Future: Joint Forecasting of Features and Feature Motion Josip Saric, Marin Orsic, Tonci Antunovic, Sacha Vrazic, Sinisa SegvicThe IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020Our method forecasts semantic segmentation 3 timesteps into the future.
more details
n/a70.297.378.789.054.052.146.058.061.989.666.291.665.954.290.066.580.777.054.061.3
HPNetyesyesnonononononononononononoHigh-Order Paired-ASPP Networks for Semantic SegmentationYu Zhang, Xin Sun, Junyu Dong, Changrui Chen, Yue Shen
more details
n/a81.698.787.293.662.763.268.576.680.393.973.395.787.273.196.176.989.586.770.177.4
HANet (fine-train only)yesyesnonononononononononononoTBAAnonymousWe use only fine-training data.
more details
n/a80.998.787.293.662.362.267.675.279.293.873.195.887.171.796.272.788.685.769.076.7
F2MF-midyesyesnonononononoyesyesnonoyesyesWarp to the Future: Joint Forecasting of Features and Feature MotionJosip Saric, Marin Orsic, Tonci Antunovic, Sacha Vrazic, Sinisa SegvicThe IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020Our method forecasts semantic segmentation 9 timesteps into the future.
more details
n/a59.195.169.283.547.243.822.941.841.384.258.586.046.733.980.353.872.579.039.744.3
EMANetyesyesnonononononononononononoExpectation Maximization Attention Networks for Semantic SegmentationXia Li, Zhisheng Zhong, Jianlong Wu, Yibo Yang, Zhouchen Lin, Hong LiuICCV 2019
more details
n/a81.998.787.393.863.462.370.077.980.793.973.695.787.874.596.275.590.284.571.578.7
PartnerNetyesyesnonononononononononononoAnonymousPARTNERNET: A LIGHTWEIGHT AND EFFICIENT PARTNER NETWORK FOR SEMANTIC SEGMENTATION
more details
0.005872.498.183.091.247.752.059.267.270.492.570.294.682.263.594.453.467.759.858.369.8
SwiftNet RN18 pyr sepBN MVDyesyesnonononononononononoyesyesEfficient semantic segmentation with pyramidal fusionM Oršić, S ŠegvićPattern Recognition 2020
more details
0.02976.498.585.392.753.455.266.173.177.193.470.495.883.965.795.464.678.666.063.073.2
Tencent YYB VisualAlgoyesyesyesyesnonononononononononoAnonymousTencent YYB VisualAlgo Group
more details
n/a83.698.888.194.264.165.072.178.882.794.273.696.188.275.896.582.194.391.973.979.0
MoKu LabyesyesnonononononononononononoAnonymousAlibaba, MoKu AI Lab, CV Group
more details
n/a84.398.988.794.465.768.473.980.383.894.474.596.288.876.296.781.393.191.674.880.1
HRNetV2 + OCR + SegFixyesyesyesyesnonononononononoyesyesObject-Contextual Representations for Semantic SegmentationYuhui Yuan, Xilin Chen, Jingdong WangFirst, we pre-train "HRNet+OCR" method on the Mapillary training set (achieves 50.8% on the Mapillary val set). Second, we fine-tune the model with the Cityscapes training, validation and coarse set. Finally, we apply the "SegFix" scheme to further improve the results.
more details
n/a84.598.988.394.468.067.873.680.683.994.374.596.189.275.996.883.694.291.374.080.1
DecoupleSegNetyesyesnonononononononononoyesyesImproving Semantic Segmentation via Decoupled Body and Edge SupervisionXiangtai Li, Xia Li, Li Zhang, Guangliang Cheng, Jianping Shi, Zhouchen Lin, Shaohua Tan, and Yunhai TongECCV-2020In this paper, We propose a new paradigm for semantic segmentation. Our insight is that appealing performance of semantic segmentation re- quires explicitly modeling the object body and edge, which correspond to the high and low frequency of the image. To do so, we first warp the image feature by learning a flow field to make the object part more consistent. The resulting body feature and the residual edge feature are further optimized under decoupled supervision by explicitly sampling dif- ferent parts (body or edge) pixels. The code and models have been released.
more details
n/a83.798.887.894.466.164.872.378.882.694.274.096.188.775.996.680.293.891.674.379.5
LGE A&B Center: HANet (ResNet-101)yesyesnonononononononononoyesyesCars Can’t Fly up in the Sky: Improving Urban-Scene Segmentation via Height-driven Attention NetworksSungha Choi (LGE, Korea Univ.), Joanne T. Kim (Korea Univ.), Jaegul Choo (KAIST)CVPR 2020Dataset: "fine train + fine val", No coarse, Backbone: ImageNet pretrained ResNet-101
more details
n/a82.198.888.093.960.563.371.378.181.394.072.996.187.974.596.577.088.085.972.779.0
DCNASyesyesyesyesnonononononononononoDCNAS: Densely Connected Neural Architecture Search for Semantic Image SegmentationXiong Zhang, Hongmin Xu, Hong Mo, Jianchao Tan, Cheng Yang, Wenqi RenNeural Architecture Search (NAS) has shown great potentials in automatically designing scalable network architectures for dense image predictions. However, existing NAS algorithms usually compromise on restricted search space and search on proxy task to meet the achievable computational demands. To allow as wide as possible network architectures and avoid the gap between target and proxy dataset, we propose a Densely Connected NAS (DCNAS) framework, which directly searches the optimal network structures for the multi-scale representations of visual information, over a large-scale target dataset. Specifically, by connecting cells with each other using learnable weights, we introduce a densely connected search space to cover an abundance of mainstream network designs. Moreover, by combining both path-level and channel-level sampling strategies, we design a fusion module to reduce the memory consumption of ample search space.
more details
n/a83.698.888.094.266.066.172.278.782.794.273.994.088.275.196.582.694.190.973.679.3
GPNet-ResNet101yesyesnonononononononononononoAnonymous
more details
n/a82.598.887.993.961.963.370.478.981.894.072.595.988.374.996.580.591.185.572.178.6
Axial-DeepLab-XL [Cityscapes-fine]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a79.998.787.093.557.960.470.977.981.493.772.895.687.975.396.165.880.578.772.870.8
Axial-DeepLab-L [Mapillary Vistas]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a83.198.887.894.259.868.173.479.582.693.972.796.188.977.596.576.991.292.674.774.5
Axial-DeepLab-L [Cityscapes-fine]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.

more details
n/a79.598.686.593.452.061.370.277.581.193.572.395.787.975.696.068.581.477.171.070.7
LGE A&B Center: HANet (ResNext-101)yesyesyesyesnonononononononoyesyesCars Can’t Fly up in the Sky: Improving Urban-Scene Segmentation via Height-driven Attention NetworksSungha Choi (LGE, Korea Univ.), Joanne T. Kim (Korea Univ.), Jaegul Choo (KAIST)CVPR 2020Dataset: "fine train + fine val + coarse", Backbone: Mapillary pretrained ResNext-101
more details
n/a83.298.888.094.266.664.872.078.281.494.274.596.188.175.696.580.393.286.672.578.7
ERINet-v2yesyesnonononononononononononoEfficient Residual Inception NetworkMINJONG KIM, SUYOUNG CHIongoing
more details
0.0052631667.497.780.689.641.745.052.255.962.991.569.193.975.953.792.341.866.364.244.362.1
Naive-Student (iterative semi-supervised learning with Panoptic-DeepLab)yesyesnonononononoyesyesnonononoSemi-Supervised Learning in Video Sequences for Urban Scene SegmentationLiang-Chieh Chen, Raphael Gontijo Lopes, Bowen Cheng, Maxwell D. Collins, Ekin D. Cubuk, Barret Zoph, Hartwig Adam, Jonathon ShlensSupervised learning in large discriminative models is a mainstay for modern computer vision. Such an approach necessitates investing in large-scale human-annotated datasets for achieving state-of-the-art results. In turn, the efficacy of supervised learning may be limited by the size of the human annotated dataset. This limitation is particularly notable for image segmentation tasks, where the expense of human annotation is especially large, yet large amounts of unlabeled data may exist. In this work, we ask if we may leverage semi-supervised learning in unlabeled video sequences to improve the performance on urban scene segmentation, simultaneously tackling semantic, instance, and panoptic segmentation. The goal of this work is to avoid the construction of sophisticated, learned architectures specific to label propagation (e.g., patch matching and optical flow). Instead, we simply predict pseudo-labels for the unlabeled data and train subsequent models with both human-annotated and pseudo-labeled data. The procedure is iterated for several times. As a result, our Naive-Student model, trained with such simple yet effective iterative semi-supervised learning, attains state-of-the-art results at all three Cityscapes benchmarks, reaching the performance of 67.8% PQ, 42.6% AP, and 85.2% mIOU on the test set. We view this work as a notable step towards building a simple procedure to harness unlabeled video sequences to surpass state-of-the-art performance on core computer vision tasks.
more details
n/a85.298.888.394.665.369.675.280.984.494.374.596.290.079.796.783.095.693.478.479.6
Axial-DeepLab-XL [Mapillary Vistas]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a84.198.988.394.569.167.874.580.383.994.172.896.189.378.296.476.493.091.375.776.9
TUE-5LSM0-g23yesyesyesyesnonononononononononoAnonymousDeeplabv3+decoder
more details
n/a65.895.577.785.945.748.343.953.257.589.265.391.170.754.291.348.167.553.551.860.4
PBRNetyesyesnonononononononononononoAnonymousmodified MobileNetV2 backbone + Prediction and Boundary attention-based Refinement Module (PBRM)
more details
0.010772.498.081.891.346.351.558.569.174.192.971.094.782.664.794.557.762.549.061.972.9
ResNeSt200yesyesnonononononononononononoResNeSt: Split-Attention NetworksHang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Mueller, R. Manmatha, Mu Li, and Alexander SmolaDeepLabV3+ network with ResNeSt200 backbone.
more details
n/a83.398.988.494.466.066.072.578.682.594.272.996.388.474.896.677.092.390.073.279.1
Panoptic-DeepLab [Mapillary Vistas]yesyesnonononononononononononoPanoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic SegmentationBowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh ChenWe employ a stronger backbone, WR-41, in Panoptic-DeepLab.
For Panoptic-DeepLab, please refer to https://arxiv.org/abs/1911.10194.
For wide-ResNet-41 (WR-41) backbone, please refer to https://arxiv.org/abs/2005.10266.
more details
n/a84.598.888.494.464.368.375.381.084.294.273.796.189.778.696.782.293.790.276.479.8
EaNet-V1yesyesnonononononononononononoParsing Very High Resolution Urban Scene Images by Learning Deep ConvNets with Edge-Aware LossXianwei Zheng, Linxi Huan, Gui-Song Xia, Jianya GongParsing very high resolution (VHR) urban scene images into regions with semantic meaning, e.g. buildings and cars, is a fundamental task necessary for interpreting and understanding urban scenes. However, due to the huge quantity of details contained in an image and the large variations of objects in scale and appearance, the existing semantic segmentation methods often break one object into pieces, or confuse adjacent objects and thus fail to depict these objects consistently. To address this issue, we propose a concise and effective edge-aware neural network (EaNet) for urban scene semantic segmentation. The proposed EaNet model is deployed as a standard balanced encoder-decoder framework. Specifically, we devised two plug-and-play modules that append on top of the encoder and decoder respectively, i.e., the large kernel pyramid pooling (LKPP) and the edge-aware loss (EA loss) function, to extend the model ability in learning discriminating features. The LKPP module captures rich multi-scale context with strong continuous feature relations to promote coherent labeling of multi-scale urban objects. The EA loss module learns edge information directly from semantic segmentation prediction, which avoids costly post-processing or extra edge detection. During training, EA loss imposes a strong geometric awareness to guide object structure learning at both the pixel- and image-level, and thus effectively separates confusing objects with sharp contours.
more details
n/a81.798.887.293.867.364.167.575.679.993.872.495.786.973.496.080.489.480.571.777.6
EfficientPS [Mapillary Vistas]yesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaUnderstanding the scene in which an autonomous robot operates is critical for its competent functioning. Such scene comprehension necessitates recognizing instances of traffic participants along with general scene semantics which can be effectively addressed by the panoptic segmentation task. In this paper, we introduce the Efficient Panoptic Segmentation (EfficientPS) architecture that consists of a shared backbone which efficiently encodes and fuses semantically rich multi-scale features. We incorporate a new semantic head that aggregates fine and contextual features coherently and a new variant of Mask R-CNN as the instance head. We also propose a novel panoptic fusion module that congruously integrates the output logits from both the heads of our EfficientPS architecture to yield the final panoptic segmentation output. Additionally, we introduce the KITTI panoptic segmentation dataset that contains panoptic annotations for the popularly challenging KITTI benchmark. Extensive evaluations on Cityscapes, KITTI, Mapillary Vistas and Indian Driving Dataset demonstrate that our proposed architecture consistently sets the new state-of-the-art on all these four benchmarks while being the most efficient and fast panoptic segmentation architecture to date.
more details
n/a84.298.888.294.367.667.773.480.283.394.374.496.088.775.396.683.594.091.173.579.7
FSFNetyesyesnonononononononononoyesyesAccelerator-Aware Fast Spatial Feature Network for Real-Time Semantic SegmentationMinjong Kim, Byungjae Park, Suyoung ChiIEEE AccessSemantic segmentation is performed to understand an image at the pixel level; it is widely used in the field of autonomous driving. In recent years, deep neural networks achieve good accuracy performance; however, there exist few models that have a good trade-off between high accuracy and low inference time. In this paper, we propose a fast spatial feature network (FSFNet), an optimized lightweight semantic segmentation model using an accelerator, offering high performance as well as faster inference speed than current methods. FSFNet employs the FSF and MRA modules. The FSF module has three different types of subset modules to extract spatial features efficiently. They are designed in consideration of the size of the spatial domain. The multi-resolution aggregation module combines features that are extracted at different resolutions to reconstruct the segmentation image accurately. Our approach is able to run at over 203 FPS at full resolution 1024 x 2048) in a single NVIDIA 1080Ti GPU, and obtains a result of 69.13% mIoU on the Cityscapes test dataset. Compared with existing models in real-time semantic segmentation, our proposed model retains remarkable accuracy while having high FPS that is over 30% faster than the state-of-the-art model. The experimental results proved that our model is an ideal approach for the Cityscapes dataset.
more details
0.004926169.197.781.290.241.847.154.261.165.491.969.494.277.957.992.947.464.459.453.266.3
Hierarchical Multi-Scale Attention for Semantic SegmentationyesyesyesyesnonononononononoyesyesHierarchical Multi-Scale Attention for Semantic SegmentationAndrew Tao, Karan Sapra, Bryan Catanzaro Multi-scale inference is commonly used to improve the results of semantic segmentation. Multiple images scales are passed through a network and then the results are combined with averaging or max pooling. In this work, we present an attention-based approach to combining multi-scale predictions. We show that predictions at certain scales are better at resolving particular failures modes and that the network learns to favor those scales for such cases in order to generate better predictions. Our attention mechanism is hierarchical, which enables it to be roughly 4x more memory efficient to train than other recent approaches. In addition to enabling faster training, this allows us to train with larger crop sizes which leads to greater model accuracy. We demonstrate the result of our method on two datasets: Cityscapes and Mapillary Vistas. For Cityscapes, which has a large number of weakly labelled images, we also leverage auto-labelling to improve generalization. Using our approach we achieve a new state-of-the-art results in both Mapillary (61.1 IOU val) and Cityscapes (85.4 IOU test).
more details
n/a85.499.089.494.971.868.475.982.285.394.575.096.390.179.797.082.694.687.877.281.7
SANetyesyesnononononononono44nonoAnonymous
more details
25.080.998.787.193.661.662.468.175.979.593.873.195.887.371.596.271.988.186.169.477.2
SJTU_hpmyesyesyesyesnonoyesyesnonononononoHard Pixel Mining for Depth Privileged Semantic SegmentationZhangxuan Gu, Li Niu*, Haohua Zhao, and Liqing Zhang
more details
n/a81.398.787.393.661.761.969.475.580.793.874.495.986.772.796.275.091.382.969.576.5
FANetyesyesnonononononononononononoFANet: Feature Aggregation Network for Semantic SegmentationTanmay Singha, Duc-Son Pham, and Aneesh KrishnaFeature Aggregation Network for Semantic Segmentation
more details
n/a64.196.775.388.235.437.845.751.357.490.464.393.071.850.491.648.962.052.046.359.0
Hard Pixel Mining for Depth Privileged Semantic SegmentationyesyesyesyesnonoyesyesnonononononoHard Pixel Mining for Depth Privileged Semantic SegmentationZhangxuan Gu, Li Niu, Haohua Zhao, and Liqing ZhangSemantic segmentation has achieved remarkable progress but remains challenging due to the complex scene, object occlusion, and so on. Some research works have attempted to use extra information such as a depth map to help RGB based semantic segmentation because the depth map could provide complementary geometric cues. However, due to the inaccessibility of depth sensors, depth information is usually unavailable for the test images. In this paper, we leverage only the depth of training images as the privileged information to mine the hard pixels in semantic segmentation, in which depth information is only available for training images but not available for test images. Specifically, we propose a novel Loss Weight Module, which outputs a loss weight map by employing two depth-related measurements of hard pixels: Depth Prediction Error and Depthaware Segmentation Error. The loss weight map is then applied to segmentation loss, with the goal of learning a more robust model by paying more attention to the hard pixels. Besides, we also explore a curriculum learning strategy based on the loss weight map. Meanwhile, to fully mine the hard pixels on different scales, we apply our loss weight module to multi-scale side outputs. Our hard pixels mining method achieves the state-of-the-art results on three benchmark datasets, and even outperforms the methods which need depth input during testing.
more details
n/a83.498.887.894.365.665.272.979.582.394.374.496.188.675.896.677.993.288.873.079.2
MSeg1080_RVCyesyesnonononononononononoyesyesMSeg: A Composite Dataset for Multi-domain Semantic SegmentationJohn Lambert*, Zhuang Liu*, Ozan Sener, James Hays, Vladlen KoltunCVPR 2020We present MSeg, a composite dataset that unifies semantic segmentation datasets from different domains. A naive merge of the constituent datasets yields poor performance due to inconsistent taxonomies and annotation practices. We reconcile the taxonomies and bring the pixel-level annotations into alignment by relabeling more than 220,000 object masks in more than 80,000 images, requiring more than 1.34 years of collective annotator effort. The resulting composite dataset enables training a single semantic segmentation model that functions effectively across domains and generalizes to datasets that were not seen during training. We adopt zero-shot cross-dataset transfer as a benchmark to systematically evaluate a model’s robustness and show that MSeg training yields substantially more robust models in comparison to training on individual datasets or naive mixing of datasets without the presented contributions.
more details
0.4980.798.786.993.864.966.169.376.680.394.074.095.987.370.696.277.284.971.969.875.6
SA-Gate (ResNet-101,OS=16)yesyesnonononoyesyesnonononoyesyesBi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic SegmentationXiaokang Chen, Kwan-Yee Lin, Jingbo Wang, Wayne Wu, Chen Qian, Hongsheng Li, and Gang ZengEuropean Conference on Computer Vision (ECCV), 2020RGB+HHA input, input resolution = 800x800, output stride = 16, training 240 epochs, no coarse data is used.
more details
n/a82.898.787.393.963.862.770.877.982.293.972.895.988.275.296.580.491.689.073.278.9
seamseg_rvcsubsetnonononononononononononoyesyesSeamless Scene SegmentationPorzi, Lorenzo and Rota Bulò, Samuel and Colovic, Aleksander and Kontschieder, PeterThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2019Seamless Scene Segmentation Resnet101, pretrained on Imagenet; supplied with altered MVD to include WildDash2 classes; does not contain other RVC label policies (i.e. no ADE20K/COCO-specific classes -> rvcsubset and not a proper submission)
more details
n/a67.087.773.090.347.749.964.171.271.192.452.794.879.345.087.553.469.336.746.959.3
HRNet + LKPP + EA lossyesyesnonononononononononononoAnonymous
more details
n/a82.698.887.593.867.363.768.575.980.593.771.995.687.273.696.483.092.788.572.877.7
SN_RN152pyrx8_RVCyesyesnonononononononononoyesyesIn Defense of Pre-trained ImageNet Architectures for Real-time Semantic Segmentation of Road-driving ImagesMarin Oršić, Ivan Krešo, Petra Bevandić, Siniša ŠegvićCVPR 2019
more details
1.074.798.484.892.656.053.661.170.374.293.171.195.582.863.195.366.072.053.763.171.6
EffPS_b1bs4_RVCyesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaEfficientPS with EfficientNet-b1 backbone. Trained with a batch size of 4.
more details
n/a63.297.577.489.840.438.449.047.461.891.261.994.773.228.993.354.264.142.035.060.0
AttaNet_lightyesyesnonononononononononoyesyesAttaNet: Attention-Augmented Network for Fast and Accurate Scene Parsing(AAAI21)Anonymous
more details
n/a70.197.880.890.549.649.651.762.767.692.169.494.080.861.094.061.762.847.051.167.6
CFPNetyesyesnonononononononononononoAnonymous
more details
n/a70.197.881.490.546.450.656.461.567.792.168.994.380.460.793.951.468.050.851.267.7
Seg_UJSyesyesnonononononononononononoAnonymous
more details
n/a84.399.089.294.869.767.173.776.482.294.474.896.289.777.496.981.295.186.476.080.8
Bilateral_attention_semanticyesyesnonononononononononononoAnonymouswe use bilateral attention mechanism for semantic segmentation
more details
0.014176.598.484.992.548.155.465.773.977.293.371.794.985.668.995.462.777.571.061.574.7
Panoptic-DeepLab w/ SWideRNet [Cityscapes-fine]yesyesnonononononononononononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
more details
n/a80.498.887.893.953.964.173.080.182.993.973.696.089.378.496.467.076.371.674.175.8
ESANet RGB-D (small input)yesyesnonononoyesyesnono22yesyesEfficient RGB-D Semantic Segmentation for Indoor Scene AnalysisDaniel Seichter, Mona Köhler, Benjamin Lewandowski, Tim Wengefeld and Horst-Michael GrossEfficient RGB-D Semantic Segmentation for Indoor Scene Analysis.
ESANet-R34-NBt1D using RGB-D data with half the input resolution.
more details
0.042775.698.585.992.354.155.461.668.072.892.371.395.282.565.894.864.079.972.360.569.8
ESANet RGB (small input)yesyesnononononononono22yesyesEfficient RGB-D Semantic Segmentation for Indoor Scene AnalysisDaniel Seichter, Mona Köhler, Benjamin Lewandowski, Tim Wengefeld and Horst-Michael GrossESANet: Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis.
ESANet-R34-NBt1D using RGB images with half the input resolution.
more details
0.03172.998.284.191.257.152.655.761.366.891.669.694.679.362.893.964.971.664.857.067.7
ESANet RGB-DyesyesnonononoyesyesnonononoyesyesEfficient RGB-D Semantic Segmentation for Indoor Scene AnalysisDaniel Seichter, Mona Köhler, Benjamin Lewandowski, Tim Wengefeld and Horst-Michael GrossEfficient RGB-D Semantic Segmentation for Indoor Scene Analysis.
ESANet-R34-NBt1D using RGB-D data.
more details
0.161378.498.787.193.349.860.269.176.179.493.672.795.987.271.396.166.278.471.567.176.2
DAHUA-ARIyesyesyesyesnonononononononononoAnonymousmulti-scale and refineNet
more details
n/a85.899.089.595.072.068.875.982.385.194.575.396.390.079.497.083.995.291.978.281.7
ESANet RGByesyesnonononononononononoyesyesEfficient RGB-D Semantic Segmentation for Indoor Scene AnalysisDaniel Seichter, Mona Köhler, Benjamin Lewandowski, Tim Wengefeld and Horst-Michael GrossESANet: Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis.
ESANet-R34-NBt1D using RGB images only.
more details
0.120577.698.484.992.755.458.964.771.775.893.371.095.384.967.795.764.779.180.964.573.9
DCNAS+ASPP [Mapillary Vistas]yesyesyesyesnonononononononononoAnonymousExisting NAS algorithms usually compromise on restricted search space or search on proxy task to meet the achievable computational demands. To allow as wide as possible network architectures and avoid the gap between realistic and proxy setting, we propose a novel Densely Connected NAS (DCNAS) framework, which directly searches the optimal network structures for the multi-scale representations of visual information, over a large-scale target dataset without proxy. Specifically, by connecting cells with each other using learnable weights, we introduce a densely connected search space to cover an abundance of mainstream network designs. Moreover, by combining both path-level and channel-level sampling strategies, we design a fusion module and mixture layer to reduce the memory consumption of ample search space, hence favor the proxyless searching. Compared with contemporary works, experiments reveal that the proxyless searching scheme is capable of bridge the gap between searching and training environments.
more details
n/a85.399.089.494.971.769.175.682.084.994.575.396.389.979.196.981.695.387.077.181.0
Panoptic-DeepLab w/ SWideRNet [Mapillary Vistas]yesyesnonononononononononononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
more details
n/a84.198.888.494.666.068.575.581.584.694.474.396.289.679.696.677.589.386.277.478.3
DCNAS+ASPPyesyesyesyesnonononononononononoDCNAS: Densely Connected Neural Architecture Search for Semantic ImageSegmentationAnonymousExisting NAS algorithms usually compromise on restricted search space or search on proxy task to meet the achievable computational demands. To allow as wide as possible network architectures and avoid the gap between realistic and proxy setting, we propose a novel Densely Connected NAS (DCNAS) framework, which directly searches the optimal network structures for the multi-scale representations of visual information, over a large-scale target dataset without proxy. Specifically, by connecting cells with each other using learnable weights, we introduce a densely connected search space to cover an abundance of mainstream network designs. Moreover, by combining both path-level and channel-level sampling strategies, we design a fusion module and mixture layer to reduce the memory consumption of ample search space, hence favor the proxyless searching.
more details
n/a84.398.988.794.568.067.073.980.383.894.374.896.289.378.096.779.695.086.975.280.7
ddl_segyesyesnonononononononononononoAnonymous
more details
n/a84.699.089.294.869.867.273.776.482.294.474.996.289.777.496.981.495.191.976.180.8
CABiNetyesyesnonononononononononononoCABiNet: Efficient Context Aggregation Network for Low-Latency Semantic SegmentationSaumya Kumaar, Ye Lyu, Francesco Nex, Michael Ying YangWith the increasing demand of autonomous machines, pixel-wise semantic segmentation for visual scene understanding needs to be not only accurate but also efficient for any potential real-time applications. In this paper, we propose CABiNet (Context Aggregated Bi-lateral Network), a dual branch convolutional neural network (CNN), with significantly lower computational costs as compared to the state-of-the-art, while maintaining a competitive prediction accuracy. Building upon the existing multi-branch architectures for high-speed semantic segmentation, we design a cheap high resolution branch for effective spatial detailing and a context branch with light-weight versions of global aggregation and local distribution blocks, potent to capture both long-range and local contextual dependencies required for accurate semantic segmentation, with low computational overheads. Specifically, we achieve 76.6% and 75.9% mIOU on Cityscapes validation and test sets respectively, at 76 FPS on an NVIDIA RTX 2080Ti and 8 FPS on a Jetson Xavier NX. Codes and training models will be made publicly available.
more details
0.01376.098.283.292.944.162.071.178.480.893.970.995.784.567.195.660.070.357.761.475.5
Margin calibrationyesyesyesyesnonononononononononoAnonymousThe model is DeepLab v3+ backend on SEResNeXt50. We used the margin calibration with log-loss as the learning objective.
more details
n/a82.198.787.394.063.463.871.778.181.894.173.696.088.172.896.574.989.388.168.978.3
MT-SSSRyesyesnononononononono22nonoAnonymous
more details
n/a79.098.686.393.353.559.769.977.580.193.772.295.787.471.596.165.582.872.368.677.0
Panoptic-DeepLab w/ SWideRNet [Mapillary Vistas + Pseudo-labels]yesyesnonononononoyesyesnonononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime. Following Naive-Student, this model is additionally trained with pseudo-labels generated from Cityscapes Video and train-extra set (i.e., the coarse annotations are not used, but the images are).
more details
n/a85.198.988.494.768.268.676.081.384.794.474.196.289.779.896.882.194.292.177.279.2
DSANet: Dilated Spatial Attention for Real-time Semantic Segmentation in Urban Street ScenesyesyesnonononononononononononoAnonymouswe present computationally efficient network named DSANet, which follows a two-branch strategy to tackle the problem of real-time semantic segmentation in urban scenes. We first design a Context branch, which employs Depth-wise Asymmetric ShuffleNet DAS as main building block to acquire sufficient receptive fields. In addition, we propose a dual attention module consisting of dilated spatial attention and channel attention to make full use of the multi-level feature maps simultaneously, which helps predict the pixel-wise labels in each stage. Meanwhile, Spatial Encoding Network is used to enhance semantic information by preserving the spatial details. Finally, to better combine context information and spatial information, we introduce a Simple Feature Fusion Module to combine the features from the two branches.
more details
n/a71.496.878.591.250.550.859.464.071.792.670.094.581.361.892.956.175.650.651.066.9
UJS_modelyesyesnonononononononononononoAnonymous
more details
n/a85.399.088.994.871.267.875.682.085.294.575.096.290.079.397.081.195.189.376.881.2
Mobilenetv3-small-backbone real-time segmentationyesyesnonononononononononoyesyesAnonymousThe model is a dual-path network with mobilenetv3-small backbone. PSP module was used as the context aggregation block. We also use feature fusion module at x16, x32. The features of the two branches are then concatenated and fused with a bottleneck conv.
Only train data is used to train the model excluding validation data. And evaluation was done by single scale input images.
more details
0.0263.997.075.288.741.144.245.053.261.390.968.493.274.451.092.444.646.439.645.361.9
M2FANetyesyesnonononononononononoyesyesUrban street scene analysis using lightweight multi-level multi-path feature aggregation networkTanmay Singha; Duc-Son Pham; Aneesh KrishnaMultiagent and Grid Systems Journal
more details
n/a68.397.478.690.137.142.956.863.868.392.166.594.579.358.893.648.562.153.248.765.9
AFPNetyesyesnonononononononononononoAnonymous
more details
0.0376.498.384.192.050.855.061.270.674.592.971.594.983.566.595.368.081.278.460.972.2
YOLO V5s with Segmentation Headyesyesnononononononono22yesyesAnonymousMultitask model. fine tune from COCO detection pretrained model, train semantic segmentation and object detection(transfer from instance label) at the same time
more details
0.00771.398.081.390.241.944.944.862.667.890.767.493.578.461.694.165.377.170.258.566.3
FSFFNetyesyesyesyesnonononononononoyesyesA Lightweight Multi-scale Feature Fusion Network for Real-Time Semantic SegmentationTanmay Singha, Duc-Son Pham, Aneesh Krishna, Tom GedeonInternational Conference on Neural Information Processing 2021Feature Scaling Feature Fusion Network
more details
n/a69.497.478.590.741.846.157.865.368.592.064.094.479.257.093.955.465.754.450.465.8
Qualcomm AI ResearchyesyesyesyesnonononononononoyesyesInverseForm: A Loss Function for Structured Boundary-Aware SegmentationShubhankar Borse, Ying Wang, Yizhe Zhang, Fatih PorikliCVPR 2021 oral
more details
n/a85.898.889.694.871.869.275.882.385.594.375.096.390.279.897.084.495.790.577.281.7
HIK-CCSLTyesyesyesyesnonononononononononoAnonymous
more details
n/a86.198.989.195.072.570.476.082.285.994.675.296.390.479.897.184.195.092.079.382.1
BFNetyesyesnonononononononononononoBFNetJiaqi Fan
more details
n/a71.097.079.391.251.250.258.065.369.592.367.894.879.860.592.955.072.253.653.164.6
Hai Wang+Yingfeng Cai-research groupyesyesnonononononononononononoAnonymous
more details
0.0016485.399.089.194.971.868.775.682.184.894.574.496.390.079.497.081.395.289.476.981.3
Jiangsu_university_Intelligent_Drive_AIyesyesnonononononononononononoAnonymous
more details
n/a85.399.089.194.971.868.775.782.184.894.574.496.390.079.497.081.395.289.476.981.3
MCANetyesyesyesyesnonononononononoyesyesAnonymous
more details
n/a73.498.384.292.049.551.662.167.873.292.670.395.381.459.294.657.875.964.256.369.1
UFONet (half-resolution)yesyesnonononononononononononoUFO RPN: A Region Proposal Network for Ultra Fast Object DetectionWenkai Li, Andy SongThe 34th Australasian Joint Conference on Artificial Intelligence
more details
n/a48.095.065.183.021.425.936.340.548.388.760.290.659.413.185.54.814.213.917.348.8
SCMNetyesyesnonononononononononononoAnonymous
more details
n/a67.997.981.390.444.244.356.160.667.691.867.794.677.555.692.852.161.250.342.262.9
FsaNetyesyesyesyesnonononononononononoFsaNet: Frequency Self-attention for Semantic SegmentationAnonymous
more details
n/a83.098.887.894.167.765.170.177.580.894.074.095.988.075.196.479.293.991.869.078.9
SCMNet coarseyesyesyesyesnonononononononoyesyesSCMNet: Shared Context Mining Network for Real-time Semantic SegmentationTanmay Singha; Moritz Bergemann; Duc-Son Pham; Aneesh Krishna2021 Digital Image Computing: Techniques and Applications (DICTA)
more details
n/a68.397.981.690.539.844.757.162.268.791.967.494.778.056.192.952.462.051.844.163.7
SAIT SeeThroughNetyesyesyesyesnonononononononononoAnonymous
more details
n/a86.299.089.295.173.971.676.382.585.294.575.496.390.179.697.083.695.292.877.981.8
JSU_IDT_groupyesyesnonononononononononononoAnonymous
more details
n/a85.999.089.494.973.169.275.782.285.094.575.996.390.179.397.083.794.992.377.481.5
DLA_HRNet48OCR_MSFLIP_000yesyesyesyesnonononononononononoAnonymousThis set of predictions is from DLA (differentiable lattice assignment network) with "HRNet48+OCR-Head" as base segmentation model. The model is, first trained on coarse-data, and then trained on fine-annotated train/val sets. Multi-scale (0.5, 0.75, 1.0, 1.25, 1.5, 1.75) and flip scheme is adopted during inference.
more details
n/a84.898.989.294.871.067.575.581.984.994.474.796.289.778.996.980.193.087.076.380.9
MYBank-AIoTyesyesyesyesnonononononononononoAnonymous
more details
n/a86.399.089.495.073.471.276.182.285.394.675.896.490.780.897.084.394.590.979.982.6
kMaX-DeepLab [Cityscapes-fine]yesyesnonononononononononoyesyesk-means Mask TransformerQihang Yu, Huiyu Wang, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, and Liang-Chieh ChenECCV 2022kMaX-DeepLab w/ ConvNeXt-L backbone (ImageNet-22k + 1k pretrained). This result is obtained by the kMaX-DeepLab trained for Panoptic Segmentation task. No test-time augmentation or other external dataset.
more details
n/a83.298.888.394.363.567.571.979.783.394.173.596.288.777.596.783.190.382.675.474.5
LeapAIyesyesyesyesnonononononononononoAnonymousUsing advanced AI techniques.
more details
n/a86.499.089.995.173.871.676.082.485.994.676.096.490.380.397.084.295.192.179.382.5
adlab_iiau_ldzyesyesnonononononononononononoAnonymousmeticulous-caiman_2022.05.01_03.32
more details
n/a85.699.089.595.073.068.575.782.185.394.575.796.389.979.396.984.395.188.476.681.4
SFRSegyesyesnonononononononononoyesyesA Real-Time Semantic Segmentation Model Using Iteratively Shared Features In Multiple Sub-EncodersTanmay Singha, Duc-Son Pham, Aneesh KrishnaPattern Recognition
more details
n/a70.698.082.690.945.949.949.062.166.891.468.394.778.054.593.157.969.167.854.766.0
PIDNet-SyesyesnonononononononononoyesyesPIDNet: A Real-time Semantic Segmentation Network Inspired from PID ControllerAnonymous
more details
0.010778.698.585.692.852.859.065.573.876.493.571.895.685.470.395.568.184.285.264.774.8
Vision Transformer Adapter for Dense PredictionsyesyesnonononononononononoyesyesVision Transformer Adapter for Dense PredictionsZhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu QiaoViT-Adapter-L, BEiT pre-train, multi-scale testing
more details
n/a85.298.988.594.566.770.274.580.283.694.473.796.289.779.096.785.594.490.579.981.8
SSNetyesyesnonononoyesyesnonononononoAnonymous
more details
n/a72.597.981.792.144.950.765.072.376.692.668.095.184.865.394.752.263.747.058.373.6
SDBNetyesyesnonononononononononoyesyesSDBNet: Lightweight Real-time Semantic Segmentation Using Short-term Dense BottleneckTanmay Singha, Duc-Son Pham, Aneesh Krishna2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)
more details
n/a70.897.981.491.047.447.655.264.569.792.068.594.479.459.193.653.270.462.951.666.1
MeiTuan-BaseModelyesyesyesyesnonononononononononoAnonymous
more details
n/a86.599.089.695.173.572.676.782.285.394.775.696.490.680.996.884.595.592.478.882.7
SDBNetV2yesyesnonononononononononoyesyesImproved Short-term Dense Bottleneck network for efficient scene analysisTanmay Singha; Duc-Son Pham; Aneesh KrishnaComputer Vision and Image Understanding
more details
n/a72.498.183.091.653.350.757.066.671.492.268.694.880.561.493.858.269.161.856.066.9
mogo_semanticyesyesnonononononononononononoAnonymous
more details
n/a85.399.089.094.971.368.176.182.385.094.474.596.390.279.496.982.295.588.676.181.5
UDSSEG_RVCyesyesnonononononononononononoAnonymousUDSSEG_RVC
more details
n/a79.498.486.793.463.464.567.373.676.993.772.495.985.167.995.973.483.677.166.074.0
MIX6D_RVCyesyesnonononononononononononoAnonymousMIX6D_RVC
more details
n/a79.898.182.592.661.962.361.770.874.793.169.695.084.468.995.078.492.390.669.674.4
FAN_NV_RVCyesyesnonononononononononononoAnonymousHybrid-Base + Segformer
more details
n/a82.098.484.893.865.166.867.374.578.094.072.296.186.871.396.381.193.388.672.276.9
UNIV_CNP_RVCyesyesnonononononononononononoAnonymousRVC 2022
more details
n/a75.197.580.291.662.458.646.866.870.191.965.894.379.764.692.076.085.076.959.866.8
AntGroup-AI-VisionAlgoyesyesyesyesnonononoyesyesnonononoAnonymousAntGroup AI vision algo
more details
n/a86.498.989.295.074.672.676.382.184.894.675.296.390.080.596.884.995.393.578.882.2
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable ConvolutionsyesyesyesyesnonononononononoyesyesInternImage: Exploring Large-Scale Vision Foundation Models with Deformable ConvolutionsWenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, Yu QiaoCVPR 2023We use Mask2Former as the segmentation framework, and initialize our InternImage-H model with the pre-trained weights on the 427M joint dataset of public Laion-400M, YFCC-15M, and CC12M. Following common practices, we first pre-train on Mapillary Vistas for 80k iterations, and then fine-tune on Cityscapes for 80k iterations. The crop size is set to 1024×1024 in this experiment. As a result, our InternImage-H achieves 87.0 multi-scale mIoU on the validation set, and 86.1 multi-scale mIoU on the test set.
more details
n/a86.198.988.894.972.571.275.480.984.794.575.596.390.179.996.885.395.592.680.082.2
Dense Prediction with Attentive Feature aggregationyesyesyesyesnonononononononoyesyesDense Prediction with Attentive Feature AggregationYung-Hsu Yang, Thomas E. Huang, Min Sun, Samuel Rota Bulò, Peter Kontschieder, Fisher YuWACV 2023We propose Attentive Feature Aggregation (AFA) to exploit both spatial and channel information for semantic segmentation and boundary detection.
more details
n/a83.698.988.694.468.464.972.780.783.494.274.596.189.176.996.776.490.489.073.379.6
W3_FAFMyesyesnonononononononononononoJunyan Yang, Qian Xu, Lei LaTeam: BOSCH-XC-DX-WAVE3
more details
0.02930980.598.786.593.459.459.467.476.979.293.672.195.586.873.695.970.388.186.169.777.3
HRNyesyesnonononononononononononoAnonymousHierarchical residual network
more details
45.077.798.485.192.655.956.863.972.275.993.272.395.384.368.195.566.780.781.663.373.5
HRN+DCNv2_for_DOASyesyesnonononononononononononoAnonymousHRN with DCNv2 for DOAS in paper "Dynamic Obstacle Avoidance System based on Rapid Instance Segmentation Network"
more details
0.03281.298.686.793.762.163.970.477.079.593.872.995.687.773.496.071.287.782.971.478.0
GEELY-ATC-SEGyesyesyesyesnonononononononononoAnonymous
more details
n/a86.798.989.295.074.073.876.381.584.994.675.696.390.380.496.886.995.793.380.482.5
PMSDSENyesyesnonononononononononoyesyesEfficient Parallel Multi-Scale Detail and Semantic Encoding Network for Lightweight Semantic SegmentationXiao Liu, Xiuya Shi, Lufei Chen, Linbo Qing, Chao RenACM International Conference on Multimedia 2023MM '23: Proceedings of the 31th ACM International Conference on Multimedia
more details
n/a74.098.283.991.444.453.162.867.472.692.670.294.881.263.194.465.275.267.058.170.1
ECFDyesyesnonononononononononoyesyesAnonymousbackbone: ConvNext-Large
more details
n/a82.198.787.694.163.667.272.178.682.594.173.296.088.876.496.379.385.071.375.679.5
DWGSeg-L75yesyesnononononononono1.31.3nonoAnonymous
more details
0.0075576.798.585.592.454.355.661.670.574.392.970.795.283.566.695.471.782.172.061.172.5
VLTSegyesyesnonononononononononononoVLTSeg: Simple Transfer of CLIP-Based Vision-Language Representations for Domain Generalized Semantic SegmentationChristoph Hümmer, Manuel Schwonberg, Liangwei Zhou, Hu Cao, Alois Knoll, Hanno Gottschalk
more details
n/a86.498.989.094.870.972.875.681.384.694.575.796.390.681.496.884.295.493.282.482.7
CGMANet_v1yesyesnonononononononononononoContext Guided Multi-scale Attention for Real-time Semantic Segmentation of Road-sceneSaquib MazharContext Guided Multi-scale Attention for Real-time Semantic Segmentation of Road-scene
more details
n/a73.397.882.191.652.149.961.868.372.691.968.494.382.763.793.956.475.360.859.070.8
SERNet-Former_v2yesyesyesyesnonononononononononoAnonymous
more details
n/a84.498.787.494.368.768.572.078.281.993.971.795.988.778.396.183.295.492.678.580.1

iIoU on class-level

namefinefinecoarsecoarse16-bit16-bitdepthdepthvideovideosubsubcodecodetitleauthorsvenuedescriptionRuntime [s]averagepersonridercartruckbustrainmotorcyclebicycle
FCN 8syesyesnonononononononononoyesyesFully Convolutional Networks for Semantic SegmentationJ. Long, E. Shelhamer, and T. DarrellCVPR 2015Trained by Marius Cordts on a pre-release version of the dataset
more details
0.541.755.933.483.922.230.826.731.149.6
RRR-ResNet152-MultiScaleyesyesyesyesnonononononononononoAnonymousupdate: this submission actually used the coarse labels, which was previously not marked accordingly
more details
n/a48.560.638.088.426.139.643.138.853.6
Dilation10yesyesnonononononononononoyesyesMulti-Scale Context Aggregation by Dilated ConvolutionsFisher Yu and Vladlen KoltunICLR 2016Dilation10 is a convolutional network that consists of a front-end prediction module and a context aggregation module. Both are described in the paper. The combined network was trained jointly. The context module consists of 10 layers, each of which has C=19 feature maps. The larger number of layers in the context module (10 for Cityscapes versus 8 for Pascal VOC) is due to the high input resolution. The Dilation10 model is a pure convolutional network: there is no CRF and no structured prediction. Dilation10 can therefore be used as the baseline input for structured prediction models. Note that the reported results were produced by training on the training set only; the network was not retrained on train+val.
more details
4.042.056.334.585.821.832.727.628.049.1
AdelaideyesyesnonononononononononononoEfficient Piecewise Training of Deep Structured Models for Semantic SegmentationG. Lin, C. Shen, I. Reid, and A. van den HengelarXiv preprint 2015Trained on a pre-release version of the dataset
more details
35.046.756.238.077.134.047.033.438.149.9
DeepLab LargeFOV StrongWeakyesyesyesyesnononononono22yesyesWeakly- and Semi-Supervised Learning of a DCNN for Semantic Image SegmentationG. Papandreou, L.-C. Chen, K. Murphy, and A. L. YuilleICCV 2015Trained on a pre-release version of the dataset
more details
4.034.940.723.178.621.432.427.620.834.6
DeepLab LargeFOV Strongyesyesnononononononono22yesyesSemantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFsL.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. YuilleICLR 2015Trained on a pre-release version of the dataset
more details
4.034.540.523.378.820.331.924.821.135.2
DPNyesyesyesyesnononononono33nonoSemantic Image Segmentation via Deep Parsing NetworkZ. Liu, X. Li, P. Luo, C. C. Loy, and X. TangICCV 2015Trained on a pre-release version of the dataset
more details
n/a28.138.912.878.613.424.019.210.727.2
Segnet basicyesyesnononononononono44yesyesSegNet: A Deep Convolutional Encoder-Decoder Architecture for Image SegmentationV. Badrinarayanan, A. Kendall, and R. CipollaarXiv preprint 2015Trained on a pre-release version of the dataset
more details
0.0632.044.322.778.416.124.320.715.833.6
Segnet extendedyesyesnononononononono44yesyesSegNet: A Deep Convolutional Encoder-Decoder Architecture for Image SegmentationV. Badrinarayanan, A. Kendall, and R. CipollaarXiv preprint 2015Trained on a pre-release version of the dataset
more details
0.0634.249.927.181.115.323.718.519.638.4
CRFasRNNyesyesnononononononono22yesyesConditional Random Fields as Recurrent Neural NetworksS. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. TorrICCV 2015Trained on a pre-release version of the dataset
more details
0.734.450.617.881.118.025.030.322.330.1
Scale invariant CNN + CRFyesyesnonononoyesyesnonononoyesyesConvolutional Scale Invariance for Semantic SegmentationI. Kreso, D. Causevic, J. Krapac, and S. SegvicGCPR 2016We propose an effective technique to address large scale variation in images taken from a moving car by cross-breeding deep learning with stereo reconstruction. Our main contribution is a novel scale selection layer which extracts convolutional features at the scale which matches the corresponding reconstructed depth. The recovered scaleinvariant representation disentangles appearance from scale and frees the pixel-level classifier from the need to learn the laws of the perspective. This results in improved segmentation results due to more effi- cient exploitation of representation capacity and training data. We perform experiments on two challenging stereoscopic datasets (KITTI and Cityscapes) and report competitive class-level IoU performance.
more details
n/a44.959.040.084.019.735.833.036.051.4
DPNyesyesnonononononononononononoSemantic Image Segmentation via Deep Parsing NetworkZ. Liu, X. Li, P. Luo, C. C. Loy, and X. TangICCV 2015DPN trained on full resolution images
more details
n/a39.153.628.985.020.128.324.924.846.9
Pixel-level Encoding for Instance SegmentationyesyesnonononoyesyesnonononononoPixel-level Encoding and Depth Layering for Instance-level Semantic LabelingJ. Uhrig, M. Cordts, U. Franke, and T. BroxGCPR 2016We predict three encoding channels from a single image using an FCN: semantic labels, depth classes, and an instance-aware representation based on directions towards instance centers. Using low-level computer vision techniques, we obtain pixel-level and instance-level semantic labeling paired with a depth estimate of the instances.
more details
n/a41.660.633.486.719.525.625.830.550.5
Adelaide_contextyesyesnonononononononononononoEfficient Piecewise Training of Deep Structured Models for Semantic SegmentationGuosheng Lin, Chunhua Shen, Anton van den Hengel, Ian ReidCVPR 2016We explore contextual information to improve semantic image segmentation. Details are described in the paper. We trained contextual networks for coarse level prediction and a refinement network for refining the coarse prediction. Our models are trained on the training set only (2975 images) without adding the validation set.
more details
n/a51.761.541.286.335.847.742.042.157.4
NVSegNetyesyesnonononononononononononoAnonymousIn the inference, we use the image of 2 different scales. The same for training!
more details
0.441.451.933.385.025.634.525.327.847.6
ENetyesyesnononononononono22yesyesENet: A Deep Neural Network Architecture for Real-Time Semantic SegmentationAdam Paszke, Abhishek Chaurasia, Sangpil Kim, Eugenio Culurciello
more details
0.01334.447.620.880.017.526.821.820.939.4
DeepLabv2-CRFyesyesnonononononononononoyesyesDeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFsLiang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, Alan L. YuillearXiv preprintDeepLabv2-CRF is based on three main methods. First, we employ convolution with upsampled filters, or ‘atrous convolution’, as a powerful tool to repurpose ResNet-101 (trained on image classification task) in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within DCNNs. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and fully connected Conditional Random Fields (CRFs). The model is only trained on train set.
more details
n/a42.651.531.285.426.537.834.527.446.5
m-TCFsyesyesyesyesnonononononononononoAnonymousConvolutional Neural Network
more details
1.043.655.833.686.624.836.531.930.548.8
DeepLab+DynamicCRFyesyesnonononononononononononoru.nl
more details
n/a38.344.529.881.319.831.330.428.740.7
LRR-4xyesyesnonononononononononoyesyesLaplacian Pyramid Reconstruction and Refinement for Semantic SegmentationGolnaz Ghiasi, Charless C. FowlkesECCV 2016We introduce a CNN architecture that reconstructs high-resolution class label predictions from low-resolution feature maps using class-specific basis functions. Our multi-resolution architecture also uses skip connections from higher resolution feature maps to successively refine segment boundaries reconstructed from lower resolution maps. The model used for this submission is based on VGG-16 and it was trained on the training set (2975 images). The segmentation predictions were not post-processed using CRF. (This is a revision of a previous submission in which we didn't use the correct basis functions; the method name changed from 'LLR-4x' to 'LRR-4x')
more details
n/a48.061.540.187.331.241.928.436.457.3
LRR-4xyesyesyesyesnonononononononoyesyesLaplacian Pyramid Reconstruction and Refinement for Semantic SegmentationGolnaz Ghiasi, Charless C. FowlkesECCV 2016We introduce a CNN architecture that reconstructs high-resolution class label predictions from low-resolution feature maps using class-specific basis functions. Our multi-resolution architecture also uses skip connections from higher resolution feature maps to successively refine segment boundaries reconstructed from lower resolution maps. The model used for this submission is based on VGG-16 and it was trained using both coarse and fine annotations. The segmentation predictions were not post-processed using CRF.
more details
n/a47.961.039.786.730.440.134.535.455.2
Le_Selfdriving_VGGyesyesnonononononononononononoAnonymous
more details
n/a35.648.525.381.016.526.821.625.039.9
SQyesyesnonononononononononononoSpeeding up Semantic Segmentation for Autonomous DrivingMichael Treml, José Arjona-Medina, Thomas Unterthiner, Rupesh Durgesh, Felix Friedmann, Peter Schuberth, Andreas Mayr, Martin Heusel, Markus Hofmarcher, Michael Widrich, Bernhard Nessler, Sepp HochreiterNIPS 2016 Workshop - MLITS Machine Learning for Intelligent Transportation Systems Neural Information Processing Systems 2016, Barcelona, Spain
more details
0.0632.347.821.784.67.821.016.614.943.6
SAITyesyesyesyesnonononononononononoAnonymousAnonymous
more details
4.051.863.441.288.933.448.244.438.756.4
FoveaNetyesyesnonononononononononononoFoveaNetXin Li, Jiashi Feng1.caffe-master
2.resnet-101
3.single scale testing

Previously listed as "LXFCRN".
more details
n/a52.466.746.688.532.841.240.942.959.9
RefineNetyesyesnonononononononononoyesyesRefineNet: Multi-Path Refinement Networks for High-Resolution Semantic SegmentationGuosheng Lin; Anton Milan; Chunhua Shen; Ian Reid;Please refer to our technical report for details: "RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation" (https://arxiv.org/abs/1611.06612). Our source code is available at: https://github.com/guosheng/refinenet
2975 images (training set with fine labels) are used for training.
more details
n/a47.255.635.886.930.142.642.434.350.0
SegModelyesyesnonononononononononononoAnonymousBoth train set (2975) and val set (500) are used to train model for this submission.
more details
0.856.163.347.089.841.456.248.445.456.7
TuSimpleyesyesnonononononononononoyesyesUnderstanding Convolution for Semantic SegmentationPanqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, Garrison Cottrell
more details
n/a53.662.743.988.238.549.740.343.761.4
Global-Local-RefinementyesyesnonononononononononononoGlobal-residual and Local-boundary Refinement Networks for Rectifying Scene Parsing PredictionsRui Zhang, Sheng Tang, Min Lin, Jintao Li, Shuicheng YanInternational Joint Conference on Artificial Intelligence (IJCAI) 2017global-residual and local-boundary refinement

The method was previously listed as "RefineNet". To avoid confusions with a recently appeared and similarly named approach, the submission name was updated.
more details
n/a53.465.345.389.136.550.542.739.458.2
XPARSEyesyesnonononononononononononoAnonymous
more details
n/a49.261.740.587.531.342.735.938.355.4
ResNet-38yesyesnonononononononononoyesyesWider or Deeper: Revisiting the ResNet Model for Visual RecognitionZifeng Wu, Chunhua Shen, Anton van den Hengelarxivsingle model, single scale, no post-processing with CRFs
Model A2, 2 conv., fine only, single scale testing

The submissions was previously listed as "Model A2, 2 conv.". The name was changed for consistency with the other submission of the same work.
more details
n/a59.171.950.690.542.054.251.448.164.2
SegModelyesyesyesyesnonononononononononoAnonymous
more details
n/a56.465.246.790.141.555.148.345.359.2
Deep Layer Cascade (LC)yesyesnonononononononononononoNot All Pixels Are Equal: Difficulty-aware Semantic Segmentation via Deep Layer CascadeXiaoxiao Li, Ziwei Liu, Ping Luo, Chen Change Loy, Xiaoou TangCVPR 2017We propose a novel deep layer cascade (LC) method to improve the accuracy and speed of semantic segmentation. Unlike the conventional model cascade (MC) that is composed of multiple independent models, LC treats a single deep model as a cascade of several sub-models. Earlier sub-models are trained to handle easy and confident regions, and they progressively feed-forward harder regions to the next sub-model for processing. Convolutions are only calculated on these regions to reduce computations. The proposed method possesses several advantages. First, LC classifies most of the easy regions in the shallow stage and makes deeper stage focuses on a few hard regions. Such an adaptive and 'difficulty-aware' learning improves segmentation performance. Second, LC accelerates both training and testing of deep network thanks to early decisions in the shallow stage. Third, in comparison to MC, LC is an end-to-end trainable framework, allowing joint learning of all sub-models. We evaluate our method on PASCAL VOC and
more details
n/a47.060.536.187.926.142.132.335.156.0
FRRNyesyesnononononononono22yesyesFull-Resolution Residual Networks for Semantic Segmentation in Street ScenesTobias Pohlen, Alexander Hermans, Markus Mathias, Bastian LeibeArxivFull-Resolution Residual Networks (FRRN) combine multi-scale context with pixel-level accuracy by using two processing streams within one network: One stream carries information at the full image resolution, enabling precise adherence to segment boundaries. The other stream undergoes a sequence of pooling operations to obtain robust features for recognition.
more details
n/a45.562.939.087.922.035.628.135.153.0
MNet_MPRGyesyesnonononononononononononoChubu University, MPRGwithout val dataset, external dataset (e.g. image net) and post-processing
more details
0.646.666.940.689.222.532.930.133.357.2
ResNet-38yesyesyesyesnonononononononoyesyesWider or Deeper: Revisiting the ResNet Model for Visual RecognitionZifeng Wu, Chunhua Shen, Anton van den Hengelarxivsingle model, no post-processing with CRFs
Model A2, 2 conv., fine+coarse, multi scale testing
more details
n/a57.868.548.890.542.051.950.747.062.8
FCN8s-QunjieYuyesyesnonononononononononononoAnonymous
more details
n/a34.553.228.184.99.418.213.322.046.6
RGB-D FCNyesyesyesyesnonoyesyesnonononononoAnonymousGoogLeNet + depth branch, single model
no data augmentation, no training on validation set, no graphical model
Used coarse labels to initialize depth branch
more details
n/a42.156.035.585.822.333.523.330.050.6
MultiBoostyesyesyesyesnonoyesyesnono22nonoAnonymousBoosting based solution.
Publication is under review.
more details
0.2532.541.225.877.811.223.624.921.833.4
GoogLeNet FCNyesyesnonononononononononononoGoing Deeper with ConvolutionsChristian Szegedy , Wei Liu , Yangqing Jia , Pierre Sermanet , Scott Reed , Dragomir Anguelov , Dumitru Erhan , Vincent Vanhoucke , Andrew RabinovichCVPR 2015GoogLeNet
No data augmentation, no graphical model
Trained by Lukas Schneider, following "Fully Convolutional Networks for Semantic Segmentation", Long et al. CVPR 2015
more details
n/a38.654.028.685.016.929.619.325.749.8
ERFNet (pretrained)yesyesnononononononono22yesyesERFNet: Efficient Residual Factorized ConvNet for Real-time Semantic SegmentationEduardo Romera, Jose M. Alvarez, Luis M. Bergasa and Roberto ArroyoTransactions on Intelligent Transportation Systems (T-ITS)ERFNet pretrained on ImageNet and trained only on the fine train (2975) annotated images


more details
0.0244.160.134.786.122.637.631.229.051.4
ERFNet (from scratch)yesyesnononononononono22yesyesEfficient ConvNet for Real-time Semantic SegmentationEduardo Romera, Jose M. Alvarez, Luis M. Bergasa and Roberto ArroyoIV2017ERFNet trained entirely on the fine train set (2975 images) without any pretraining nor coarse labels
more details
0.0240.456.731.584.919.435.125.024.346.6
TuSimple_CoarseyesyesyesyesnonononononononoyesyesUnderstanding Convolution for Semantic SegmentationPanqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, Garrison CottrellHere we show how to improve pixel-wise semantic segmentation by manipulating convolution-related operations that are better for practical use. First, we implement dense upsampling convolution (DUC) to generate pixel-level prediction, which is able to capture and decode more detailed information that is generally missing in bilinear upsampling. Second, we propose a hybrid dilated convolution (HDC) framework in the encoding phase. This framework 1) effectively enlarges the receptive fields of the network to aggregate global information; 2) alleviates what we call the "gridding issue" caused by the standard dilated convolution operation. We evaluate our approaches thoroughly on the Cityscapes dataset, and achieve a new state-of-art result of 80.1% mIOU in the test set. We also are state-of-the-art overall on the KITTI road estimation benchmark and the
PASCAL VOC2012 segmentation task. Pretrained models are available at https://goo.gl/DQMeun.
more details
n/a56.967.647.389.238.352.554.844.860.9
SAC-multipleyesyesnonononononononononononoScale-adaptive Convolutions for Scene ParsingRui Zhang, Sheng Tang, Yongdong Zhang, Jintao Li, and Shuicheng YanInternational Conference on Computer Vision (ICCV) 2017
more details
n/a55.267.048.490.339.450.642.043.560.7
NetWarpyesyesyesyesnonononoyesyesnonononoAnonymous
more details
n/a59.569.752.390.841.955.752.751.461.8
depthAwareSeg_RNN_ffyesyesnonononononononononoyesyesAnonymoustraining with fine-annotated training images only (val set is not used); flip-augmentation only in training; single GPU for train&test; softmax loss; resnet101 as front end; multiscale test.
more details
n/a56.066.346.788.437.350.752.045.560.8
Ladder DenseNetyesyesnonononononononononoyesyesLadder-style DenseNets for Semantic Segmentation of Large Natural ImagesIvan Krešo, Josip Krapac, Siniša ŠegvićICCV 2017https://ivankreso.github.io/publication/ladder-densenet/
more details
0.4551.668.842.990.129.042.540.738.260.4
Real-time FCNyesyesyesyesnonononononononononoUnderstanding Cityscapes: Efficient Urban Semantic Scene UnderstandingMarius CordtsDissertationCombines the following concepts:
Network architecture: "Going deeper with convolutions". Szegedy et al., CVPR 2015
Framework and skip connections: "Fully convolutional networks for semantic segmentation". Long et al., CVPR 2015
Context modules: "Multi-scale context aggregation by dilated convolutions". Yu and Kolutin, ICLR 2016
more details
0.04445.559.236.185.225.640.035.632.749.7
GridNetyesyesnonononononononononononoAnonymousConv-Deconv Grid-Network for semantic segmentation.
Using only the training set without extra coarse annotated data (only 2975 images).
No pre-training (ImageNet).
No post-processing (like CRF).
more details
n/a44.157.336.485.621.437.829.332.052.7
PEARLyesyesnonononononoyesyesnonononoVideo Scene Parsing with Predictive Feature LearningXiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe Lin, Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu, Zequn Jie, Jiashi Feng, and Shuicheng YanICCV 2017We proposed a novel Parsing with prEdictive feAtuRe Learning (PEARL) model to address the following two problems in video scene parsing: firstly, how to effectively learn meaningful video representations for producing the temporally consistent labeling maps; secondly, how to overcome the problem of insufficient labeled video training data, i.e. how to effectively conduct unsupervised deep learning. To our knowledge, this is the first model to employ predictive feature learning in the video scene parsing.
more details
n/a51.663.242.588.033.245.842.640.856.8
pruned & dilated inception-resnet-v2 (PD-IR2)yesyesyesyesnonononononononoyesyesAnonymous
more details
0.6942.153.135.382.920.836.532.128.547.4
PSPNetyesyesyesyesnonononononononoyesyesPyramid Scene Parsing NetworkHengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, Jiaya JiaCVPR 2017This submission is trained on coarse+fine(train+val set, 2975+500 images).

Former submission is trained on coarse+fine(train set, 2975 images) which gets 80.2 mIoU: https://www.cityscapes-dataset.com/method-details/?submissionID=314

Previous versions of this method were listed as "SenseSeg_1026".
more details
n/a59.669.351.290.342.655.156.248.763.5
motovisyesyesyesyesnonononononononononomotovis.com
more details
n/a57.771.348.191.140.950.850.546.662.1
ML-CRNNyesyesnonononononononononononoMulti-level Contextual RNNs with Attention Model for Scene LabelingHeng Fan, Xue Mei, Danil Prokhorov, Haibin LingarXivA framework based on CNNs and RNNs is proposed, in which the RNNs are used to model spatial dependencies among image units. Besides, to enrich deep features, we use different features from multiple levels, and adopt a novel attention model to fuse them.
more details
n/a47.159.139.085.830.340.134.134.753.6
Hybrid ModelyesyesnonononononononononononoAnonymous
more details
n/a41.254.132.283.322.033.424.331.249.4
tek-IflyyesyesnonononononononononononoIflytekIflytek-yinusing a fusion strategy of three single models, the best result of a single model is 80.01%,multi-scale
more details
n/a60.169.948.390.144.655.457.750.264.5
GridNetyesyesnonononononononononoyesyesResidual Conv-Deconv Grid Network for Semantic SegmentationDamien Fourure, Rémi Emonet, Elisa Fromont, Damien Muselet, Alain Tremeau & Christian WolfBMVC 2017We used a new architecture for semantic image segmentation called GridNet, following a grid pattern allowing multiple interconnected streams to work at different resolutions (see paper).
We used only the training set without extra coarse annotated data (only 2975 images) and no pre-training (ImageNet) nor pre or post-processing.
more details
n/a44.557.737.185.922.038.829.232.053.2
firenetyesyesnononononononono22nonoAnonymous
more details
n/a47.864.940.185.927.640.531.737.454.1
DeepLabv3yesyesyesyesnonononononononononoRethinking Atrous Convolution for Semantic Image SegmentationLiang-Chieh Chen, George Papandreou, Florian Schroff, Hartwig AdamarXiv preprintIn this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter’s field-of-view as well as control the resolution of feature responses computed by Deep Convolutional Neural Networks, in the application of semantic image segmentation. To handle the problem of segmenting objects
at multiple scales, we employ a module, called Atrous Spatial Pyrmid Pooling (ASPP), which adopts atrous convolution in parallel to capture multi-scale context with multiple atrous rates. Furthermore, we propose to augment ASPP module with image-level features encoding global context and further boost performance.
Results obtained with a single model (no ensemble), trained with fine + coarse annotations. More details will be shown in the updated arXiv report.
more details
n/a62.172.953.991.246.057.855.952.965.9
EdgeSenseSegyesyesnonononononononononononoAnonymousDeep segmentation network with hard negative mining and other tricks.
more details
n/a57.168.449.188.841.853.246.946.762.3
ScaleNetyesyesyesyesnonononononononononoScaleNet: Scale Invariant Network for Semantic Segmentation in Urban Driving ScenesMohammad Dawud Ansari, Stephan Krarß, Oliver Wasenmüller and Didier StrickerInternational Conference on Computer Vision Theory and Applications, Funchal, Portugal, 2018The scale difference in driving scenarios is one of the essential challenges in semantic scene segmentation.
Close objects cover significantly more pixels than far objects. In this paper, we address this challenge with a
scale invariant architecture. Within this architecture, we explicitly estimate the depth and adapt the pooling
field size accordingly. Our model is compact and can be extended easily to other research domains. Finally,
the accuracy of our approach is comparable to the state-of-the-art and superior for scale problems. We evaluate
on the widely used automotive dataset Cityscapes as well as a self-recorded dataset.
more details
n/a53.165.644.788.634.647.242.443.258.6
K-netyesyesnonononononononononononoXinLiang Zhong
more details
n/a52.862.846.788.337.051.742.638.655.0
MSNETyesyesnonononononononononononoAnonymouspreviously also listed as "MultiPathJoin" and "MultiPath_Scale".
more details
0.257.174.251.489.737.652.838.346.666.6
Multitask LearningyesyesnonononononononononononoMulti-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and SemanticsAlex Kendall, Yarin Gal and Roberto CipollaNumerous deep learning applications benefit from multi-task learning with multiple regression and classification objectives. In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task's loss. Tuning these weights by hand is a difficult and expensive process, making multi-task learning prohibitive in practice. We propose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the homoscedastic uncertainty of each task. This allows us to simultaneously learn various quantities with different units or scales in both classification and regression settings. We demonstrate our model learning per-pixel depth regression, semantic and instance segmentation from a monocular input image. Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate models trained individually on each task.
more details
n/a57.466.949.389.440.050.854.147.261.8
DeepMotionyesyesnonononononononononononoAnonymousWe propose a novel method based on convnets to extract multi-scale features in a large range particularly for solving street scene segmentation.
more details
n/a58.667.549.289.944.255.255.646.360.9
SR-AICyesyesyesyesnonononononononononoAnonymous
more details
n/a60.769.351.390.747.057.754.250.964.2
Roadstar.ai_CV(SFNet)yesyesnonononononononononononoRoadstar.ai-CVMaosheng Ye, Guang Zhou, Tongyi Cao, YongTao Huang, Yinzi Chensame foucs net(SFNet), based on only fine labels, with focus on the loss distribution and same focus on the every layer of feature map
more details
0.260.875.852.289.948.959.544.847.867.8
DFNyesyesyesyesnonononononononononoLearning a Discriminative Feature Network for Semantic SegmentationChangqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, Nong SangarxivMost existing methods of semantic segmentation still suffer from two aspects of challenges: intra-class inconsistency and inter-class indistinction. To tackle these two problems, we propose a Discriminative Feature Network (DFN), which contains two sub-networks: Smooth Network and Border Network. Specifically, to handle the intra-class inconsistency problem, we specially design a Smooth Network with Channel Attention Block and global average pooling to select the more discriminative features. Furthermore, we propose a Border Network to make the bilateral features of boundary distinguishable with deep semantic boundary supervision. Based on our proposed DFN, we achieve state-of-the-art performance 86.2% mean IOU on PASCAL VOC 2012 and 80.3% mean IOU on Cityscapes dataset.
more details
n/a58.369.747.590.343.654.152.446.262.5
RelationNet_CoarseyesyesyesyesnonononononononononoRelationNet: Learning Deep-Aligned Representation for Semantic Image SegmentationYueqing ZhuangICPR Semantic image segmentation, which assigns labels in pixel level, plays a central role in image understanding. Recent approaches have attempted to harness the capabilities of deep learning. However, one central problem of these methods is that deep convolution neural network gives little consideration to the correlation among pixels. To handle this issue, in this paper, we propose a novel deep neural network named RelationNet, which utilizes CNN and RNN to aggregate context information. Besides, a spatial correlation loss is applied to supervise RelationNet to align features of spatial pixels belonging to same category. Importantly, since it is expensive to obtain pixel-wise annotations, we exploit a new training method for combining the coarsely and finely labeled data. Separate experiments show the detailed improvements of each proposal. Experimental results demonstrate the effectiveness of our proposed method to the problem of semantic image segmentation.
more details
n/a61.971.955.891.244.158.655.552.665.5
ARSAITyesyesnonononononononononononoAnonymousanonymous
more details
1.048.261.738.488.227.642.833.637.056.7
Mapillary Research: In-Place Activated BatchNormyesyesyesyesnonononononononoyesyesIn-Place Activated BatchNorm for Memory-Optimized Training of DNNsSamuel Rota Bulò, Lorenzo Porzi, Peter KontschiederarXivIn-Place Activated Batch Normalization (InPlace-ABN) is a novel approach to drastically reduce the training memory footprint of modern deep neural networks in a computationally efficient way. Our solution substitutes the conventionally used succession of BatchNorm + Activation layers with a single plugin layer, hence avoiding invasive framework surgery while providing straightforward applicability for existing deep learning frameworks. We obtain memory savings of up to 50% by dropping intermediate results and by recovering required information during the backward pass through the inversion of stored forward results, with only minor increase (0.8-2%) in computation time. Test results are obtained using a single model.
more details
n/a65.973.654.290.155.965.366.556.465.6
EFBNETyesyesnonononononononononononoAnonymous
more details
n/a59.968.749.890.243.055.561.049.062.3
Ladder DenseNet v2yesyesnonononononononononononoJournal submissionAnonymousDenseNet-121 model used in downsampling path with ladder-style skip connections upsampling path on top of it.
more details
1.054.668.345.590.535.548.545.441.761.2
ESPNetyesyesnononononononono22yesyesESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation Sachin Mehta, Mohammad Rastegari, Anat Caspi, Linda Shapiro, and Hannaneh HajishirziWe introduce a fast and efficient convolutional neural network, ESPNet, for semantic segmentation of high resolution images under resource constraints. ESPNet is based on a new convolutional module, efficient spatial pyramid (ESP), which is efficient in terms of computation, memory, and power. ESPNet is 22 times faster (on a standard GPU) and 180 times smaller than the state-of-the-art semantic segmentation network PSPNet, while its category-wise accuracy is only 8% less. We evaluated EPSNet on a variety of semantic segmentation datasets including Cityscapes, PASCAL VOC, and a breast biopsy whole slide image dataset. Under the same constraints on memory and computation, ESPNet outperforms all the current efficient CNN networks such as MobileNet, ShuffleNet, and ENet on both standard metrics and our newly introduced performance metrics that measure efficiency on edge devices. Our network can process high resolution images at a rate of 112 and 9 frames per second on a standard GPU and edge device, respectively
more details
0.008931.845.819.281.715.224.316.816.235.5
ENet with the Lovász-Softmax lossyesyesnononononononono22yesyesThe Lovász-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networksMaxim Berman, Amal Rannen Triki, Matthew B. BlaschkoarxivThe Lovász-Softmax loss is a novel surrogate for optimizing the IoU measure in neural networks. Here we finetune the weights provided by the authors of ENet (arXiv:1606.02147) with this loss, for 10'000 iterations on training dataset. The runtimes are unchanged with respect to the ENet architecture.
more details
0.01334.142.826.379.418.125.921.722.136.3
DRN_CRL_CoarseyesyesyesyesnonononononononoyesyesDense Relation Network: Learning Consistent and Context-Aware Representation For Semantic Image SegmentationYueqing ZhuangICIPDRN_CoarseSemantic image segmentation, which aims at assigning pixel-wise category, is one of challenging image understanding problems. Global context plays an important role on local pixel-wise category assignment. To make the best of global context, in this paper, we propose dense relation network (DRN) and context-restricted loss (CRL) to aggregate global and local information. DRN uses Recurrent Neural Network (RNN) with different skip lengths in spatial directions to get context-aware representations while CRL helps aggregate them to learn consistency. Compared with previous methods, our proposed method takes full advantage of hierarchical contextual representations to produce high-quality results. Extensive experiments demonstrate that our methods achieves significant state-of-the-art performances on Cityscapes and Pascal Context benchmarks, with mean-IoU of 82.8% and 49.0% respectively.
more details
n/a61.171.453.990.948.253.956.050.264.3
ShuffleSegyesyesyesyesnonononononononononoShuffleSeg: Real-time Semantic Segmentation NetworkMostafa Gamal, Mennatullah Siam, Mo'men Abdel-RazekUnder Review by ICIP 2018ShuffleSeg: An efficient realtime semantic segmentation network with skip connections and ShuffleNet units
more details
n/a32.444.019.980.016.222.722.216.237.7
SkipNet-MobileNetyesyesyesyesnonononononononononoRTSeg: Real-time Semantic Segmentation FrameworkMennatullah Siam, Mostafa Gamal, Moemen Abdel-Razek, Senthil Yogamani, Martin JagersandUnder Review by ICIP 2018An efficient realtime semantic segmentation network with skip connections based on MobileNet.

more details
n/a35.245.426.180.117.627.623.522.138.9
ThunderNetyesyesnononononononono22nonoAnonymous
more details
0.010440.453.730.384.021.034.932.626.540.6
PAC: Perspective-adaptive ConvolutionsyesyesnonononononononononononoPerspective-adaptive Convolutions for Scene ParsingRui Zhang, Sheng Tang, Yongdong Zhang, Jintao Li, and Shuicheng YanIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)Many existing scene parsing methods adopt Convolutional Neural Networks with receptive fields of fixed sizes and shapes, which frequently results in inconsistent predictions of large objects and invisibility of small objects. To tackle this issue, we propose perspective-adaptive convolutions to acquire receptive fields of flexible sizes and shapes during scene parsing. Through adding a new perspective regression layer, we can dynamically infer the position-adaptive perspective coefficient vectors utilized to reshape the convolutional patches. Consequently, the receptive fields can be adjusted automatically according to the various sizes and perspective deformations of the objects in scene images. Our proposed convolutions are differentiable to learn the convolutional parameters and perspective coefficients in an end-to-end way without any extra training supervision of object sizes. Furthermore, considering that the standard convolutions lack contextual information and spatial dependencies, we propose a context adaptive bias to capture both local and global contextual information through average pooling on the local feature patches and global feature maps, followed by flexible attentive summing to the convolutional results. The attentive weights are position-adaptive and context-aware, and can be learned through adding an additional context regression layer. Experiments on Cityscapes and ADE20K datasets well demonstrate the effectiveness of the proposed methods.
more details
n/a55.767.248.590.442.351.742.243.260.0
SU_NetnonononononononononononononoAnonymous
more details
n/a52.362.442.789.035.546.643.442.456.4
MobileNetV2PlusyesyesnonononononononononononoHuijun LiuMobileNetV2Plus
more details
n/a46.859.840.985.730.337.131.334.454.8
DeepLabv3+yesyesyesyesnonononononononoyesyes Encoder-Decoder with Atrous Separable Convolution for Semantic Image SegmentationLiang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, Hartwig AdamarXivSpatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information. In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. We will provide more details in the coming update on the arXiv report.
more details
n/a62.473.153.791.447.158.856.353.265.8
RFMobileNetV2PlusyesyesnonononononononononononoHuijun LiuReceptive Filed MobileNetV2Plus for Semantic Segmentation
more details
n/a49.464.244.387.331.439.634.137.657.1
GoogLeNetV1_ROByesyesnonononononononononononoAnonymousGoogLeNet-v1 FCN trained on Cityscapes, KITTI, and ScanNet, as required by the Robust Vision Challenge at CVPR'18 (http://robustvision.net/)
more details
n/a35.146.927.283.416.124.418.322.642.2
SAITv2yesyesyesyesnonononononononononoAnonymous
more details
0.02536.544.424.181.523.532.630.019.136.4
GUNetyesyesnononononononono22nonoGuided Upsampling Network for Real-Time Semantic SegmentationDavide MazziniarxivGuided Upsampling Network for Real-Time Semantic Segmentation
more details
0.0340.853.729.985.621.333.830.425.047.1
RMNetyesyesnonononononononononononoAnonymousA fast and light net for semantic segmentation.
more details
0.01437.351.529.083.919.428.824.221.740.2
ContextNetyesyesnonononononononononononoContextNet: Exploring Context and Detail for Semantic Segmentation in Real-timeRudra PK Poudel, Ujwal Bonde, Stephan Liwicki, Christopher ZacharXivModern deep learning architectures produce highly accurate results on many challenging semantic segmentation datasets. State-of-the-art methods are, however, not directly transferable to real-time applications or embedded devices, since naive adaptation of such systems to reduce computational cost (speed, memory and energy) causes a significant drop in accuracy. We propose ContextNet, a new deep neural network architecture which builds on factorized convolution, network compression and pyramid representations to produce competitive semantic segmentation in real-time with low memory requirements. ContextNet combines a deep branch at low resolution that captures global context information efficiently with a shallow branch that focuses on high-resolution segmentation details. We analyze our network in a thorough ablation study and present results on the Cityscapes dataset, achieving 66.1% accuracy at 18.3 frames per second at full (1024x2048) resolution.
more details
0.023836.847.124.782.919.330.628.321.539.9
RFLRyesyesyesyesyesyesnononono44nonoRandom Forest with Learned Representations for Semantic SegmentationByeongkeun Kang, Truong Q. NguyenIEEE Transactions on Image ProcessingRandom Forest with Learned Representations for Semantic Segmentation
more details
0.037.818.50.423.62.44.62.00.610.0
DPCyesyesyesyesnonononononononoyesyesSearching for Efficient Multi-Scale Architectures for Dense Image PredictionLiang-Chieh Chen, Maxwell D. Collins, Yukun Zhu, George Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, Jonathon ShlensNIPS 2018In this work we explore the construction of meta-learning techniques for dense image prediction focused on the tasks of scene parsing. Constructing viable search spaces in this domain is challenging because of the multi-scale representation of visual information and the necessity to operate on high resolution imagery. Based on a survey of techniques in dense image prediction, we construct a recursive search space and demonstrate that even with efficient random search, we can identify architectures that achieve state-of-the-art performance. Additionally, the resulting architecture (called DPC for Dense Prediction Cell) is more computationally efficient, requiring half the parameters and half the computational cost as previous state of the art systems.
more details
n/a63.373.954.091.649.359.558.852.966.7
NV-ADLRyesyesyesyesnonononononononononoAnonymousNVIDIA Applied Deep Learning Research
more details
n/a64.272.852.492.451.863.958.354.467.3
Adaptive Affinity Field on PSPNetyesyesnonononononononononoyesyesAdaptive Affinity Field for Semantic SegmentationTsung-Wei Ke*, Jyh-Jing Hwang*, Ziwei Liu, Stella X. YuECCV 2018Existing semantic segmentation methods mostly rely on per-pixel supervision, unable to capture structural regularity present in natural images. Instead of learning to enforce semantic labels on individual pixels, we propose to enforce affinity field patterns in individual pixel neighbourhoods, i.e., the semantic label patterns of whether neighbouring pixels are in the same segment should match between the prediction and the ground-truth. The affinity fields characterize geometric relationships within the image, such as "motorcycles have round wheels". We further develop a novel method for learning the optimal neighbourhood size for each semantic category, with an adversarial loss that optimizes over worst-case scenarios. Unlike the common Conditional Random Field (CRF) approaches, our adaptive affinity field (AAF) method has no extra parameters during inference, and is less sensitive to appearance changes in the image.
more details
n/a56.168.148.589.938.849.349.442.362.8
APMoE_seg_ROByesyesnonononononononononoyesyesPixel-wise Attentional Gating for Parsimonious Pixel LabelingShu Kong, Charless FowlkesarxivThe Pixel-level Attentional Gating (PAG) unit is trained to choose for each pixel the pooling size to adopt to aggregate contextual region around it. There are multiple branches with different dilate rates for varied pooling size, thus varying receptive field. For this ROB challenge, PAG is expected to robustly aggregate information for final prediction.

This is our entry for Robust Vision Challenge 2018 workshop (ROB). The model is based on ResNet50, trained over mixed dataset of Cityscapes, ScanNet and Kitti.
more details
0.930.647.210.884.423.325.39.111.832.7
BatMAN_ROByesyesyesyesnonononononononononoAnonymousbatch-normalized multistage attention network
more details
1.029.344.49.483.510.624.88.213.739.7
HiSS_ROByesyesnononononononono22nonoAnonymous
more details
0.0632.140.919.879.917.527.221.613.436.7
VENUS_ROByesyesnonononononononononononoAnonymousVENUS_ROB
more details
n/a37.151.022.983.321.031.824.720.142.2
VlocNet++_ROBnonononononononononononononoAnonymous
more details
n/a33.940.823.982.418.130.016.121.138.6
AHiSS_ROByesyesyesyesnononononono22nonoAnonymousAugmented Hierarchical Semantic Segmentation
more details
0.0639.845.027.082.729.734.931.724.143.1
IBN-PSP-SA_ROByesyesnonononononononononononoAnonymousIBN-PSP-SA_ROB
more details
n/a46.357.535.188.030.244.833.229.452.4
LDN2_ROByesyesnonononononononononononoAnonymousLadder DenseNet: https://ivankreso.github.io/publication/ladder-densenet/
more details
1.052.365.142.990.134.948.239.838.958.2
MiniNetyesyesnononononononono44nonoAnonymous
more details
0.00415.822.75.867.34.47.13.83.811.6
AdapNetv2_ROByesyesnonononononononononononoAnonymous
more details
n/a34.942.923.783.518.431.617.021.740.5
MapillaryAI_ROByesyesnonononononononononononoAnonymous
more details
n/a60.170.250.690.542.958.554.849.463.5
FCN101_ROByesyesnonononononononononononoAnonymous
more details
n/a11.312.20.065.54.10.04.10.04.2
MaskRCNN_BOSHyesyesnonononononononononononoJin shengtao, Yi zhihao, Liu wei [Our team name is firefly]Bosh autodrive challenge
more details
n/a48.558.341.881.934.343.141.531.555.4
EnsembleModel_BoschyesyesnonononononononononononoJin shengtao, Yi zhihao, Liu wei [Our team name was MaskRCNN_BOSH,firefly]we've ensembled three model(erfnet,deeplab-mobilenet,tusimple) and gained 0.57 improvment of IoU Classes value. The best single model is 73.8549
more details
n/a48.960.441.186.230.644.040.133.755.4
EVANetyesyesnonononononononononononoAnonymous
more details
n/a44.060.335.586.623.035.631.231.048.8
CLRCNetyesyesnonononononononononononoCLRCNet: Cascaded Low-Rank Convolutions for Semantic Segmentation in Real-timeAnonymousA lightweight and real-time semantic segmentation method.
more details
0.01335.951.826.684.213.627.522.018.343.4
Edgenetyesyesnononononononono22nonoAnonymousA lightweight semantic segmentation network combined with edge information and channel-wise attention mechanism.
more details
0.0346.662.538.188.425.438.333.034.452.5
L2-SPyesyesyesyesnonononononononoyesyesExplicit Inductive Bias for Transfer Learning with Convolutional NetworksXuhong Li, Yves Grandvalet, Franck DavoineICML-2018With a simple variant of weight decay, L2-SP regularization (see the paper for details), we reproduced PSPNet based on the original ResNet-101 using "train_fine + val_fine + train_extra" set (2975 + 500 + 20000 images), with a small batch size 8. The sync batch normalization layer is implemented in Tensorflow (see the code).
more details
n/a58.167.949.890.142.652.851.947.762.3
ALV303yesyesnonononononononononononoAnonymous
more details
0.252.067.747.490.231.146.134.040.159.5
NCTU-ITRIyesyesnononononononono22nonoAnonymousFor the purpose of fast semantic segmentation, we design a CNN-based encoder-decoder architecture, which is called DSNet. The encoder part is constructed based on the concept of DenseNet, and a simple decoder is adopted to make the network more efficient without degrading the accuracy. We pre-train the encoder network on the ImageNet dataset. Then, only the fine-annotated Cityscapes dataset (2975 training images) is used to train the complete DSNet. The DSNet demonstrates a good trade-off between accuracy and speed. It can process 68 frames per second on 1024x512 resolution images on a single GTX 1080 Ti GPU.
more details
0.014741.456.933.285.619.333.331.927.044.4
ADSCNetyesyesnonononononononononononoADSCNet: Asymmetric Depthwise Separable Convolution for Semantic Segmentation in Real-timeAnonymousA lightweight and real-time semantic segmentation method for mobile devices.
more details
0.01336.853.525.284.416.625.524.419.944.6
SRC-B-MachineLearningLabyesyesyesyesnonononononononononoJianlong Yuan, Zelu Deng, Shu Wang, Zhenbo LuoSamsung Research Center MachineLearningLab. The result is tested by multi scale and filp. The paper is in preparing.
more details
n/a60.772.951.491.044.257.053.949.565.3
Tencent AI LabyesyesyesyesnonononononononononoAnonymous
more details
n/a63.971.555.890.150.663.256.157.166.5
ERINetyesyesnononononononono22nonoAnonymousEfficient residual inception networks for real-time semantic segmentation
more details
0.02344.159.735.487.623.231.932.133.249.6
PGCNet_Res101_fineyesyesnonononononononononononoAnonymouswe choose the ResNet101 pretrained on ImageNet as our backbone, then we use both the train-fine and the val-fine data to train our model with batch size=8 for 8w iterations without any bells and whistles. We will release our paper latter.
more details
n/a60.772.253.890.643.757.749.552.066.5
EDANetyesyesnononononononono22yesyesEfficient Dense Modules of Asymmetric Convolution for Real-Time Semantic SegmentationShao-Yuan Lo (NCTU), Hsueh-Ming Hang (NCTU), Sheng-Wei Chan (ITRI), Jing-Jhih Lin (ITRI)Training data: Fine annotations only (train+val. set, 2975+500 images) without any pretraining nor coarse annotations.
For training on fine annotations (train set only, 2975 images), it attains a mIoU of 66.3%.

Runtime: (resolution 512x1024) 0.0092s on a single GTX 1080Ti, 0.0123s on a single Titan X.
more details
0.009241.854.932.785.118.635.431.928.447.1
OCNet_ResNet101_fineyesyesnonononononononononononoAnonymousContext is essential for various computer vision tasks.
The state-of-the-art scene parsing methods define the context as the prior of the scene categories (e.g., bathroom, badroom, street).
Such scene context is not suitable for the street scene parsing tasks as most of the scenes are similar.

In this work, we propose the Object Context that captures the prior of the object's category that the pixel belongs to.
We compute the object context by aggregating all the pixels' features according to a attention map that encodes the probability of each pixel that it belongs to the same category with the associated pixel.
Specifically, We employ the self-attention method to compute the pixel-wise attention map.

We further propose the Pyramid Object Context and Atrous Spatial Pyramid Object Context to handle the problem of multi-scales.
more details
n/a61.372.154.690.743.957.553.351.366.7
Knowledge-AwareyesyesnonononononononononononoAnonymousKnowledge-Aware Semantic Segmentation
more details
n/a55.667.645.689.336.349.452.643.659.9
CASIA_IVA_DANet_NoCoarseyesyesnonononononononononoyesyesDual Attention Network for Scene SegmentationJun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang,and Hanqing LuCVPR2019we address the scene segmentation task by capturing rich contextual dependencies based on the selfattention mechanism. Unlike previous works that capture contexts by multi-scale features fusion, we propose a Dual Attention Networks (DANet) to adaptively integrate local features with their global dependencies. Specifically, we append two types of attention modules on top of traditional dilated FCN, which model the semantic interdependencies in spatial and channel dimensions respectively. The position attention module selectively aggregates the features at each
position by a weighted sum of the features at all positions. Similar features would be related to each other regardless of their distances. Meanwhile, the channel attention module selectively emphasizes interdependent channel maps by integrating associated features among all channel maps.
We sum the outputs of the two attention modules to further improve feature representation which contributes to more precise segmentation results
more details
n/a62.373.854.691.946.260.352.551.966.7
LDFNetyesyesnonononoyesyesnono22yesyesIncorporating Luminance, Depth and Color Information by Fusion-based Networks for Semantic SegmentationShang-Wei Hung, Shao-Yuan LoWe propose a preferred solution, which incorporates Luminance, Depth and color information by a Fusion-based network named LDFNet. It includes a distinctive encoder sub-network to process the depth maps and further employs the luminance images to assist the depth information in a process. LDFNet achieves very competitive results compared to the other state-of-art systems on the challenging Cityscapes dataset, while it maintains an inference speed faster than most of the existing top-performing networks. The experimental results show the effectiveness of the proposed information-fused approach and the potential of LDFNet for road scene understanding tasks.
more details
n/a46.361.438.487.727.939.131.433.051.8
CGNetyesyesnonononononononononoyesyesTianyi Wu et alwe propose a novel Context Guided Network for semantic segmentation on mobile devices. We first design a Context Guided (CG) block by considering the inherent characteristic of semantic segmentation. CG Block aggregates local feature, surrounding context feature and global context feature effectively and efficiently. Based on the CG block, we develop Context Guided Network (CGNet), which not only has a strong capacity of localization and recognition, but also has a low computational and memory footprint. Under a similar number of parameters, the proposed
CGNet significantly outperforms existing segmentation networks. Extensive experiments on Cityscapes and CamVid datasets verify the effectiveness of the proposed approach.
Specifically, without any post-processing, the proposed approach achieves 64.8% mean IoU on Cityscapes test set with less than 0.5 M parameters, and has a frame-rate of 50 fps on one NVIDIA Tesla K80 card for 2048 × 1024 high-resolution image.
more details
0.0235.952.129.384.415.830.016.021.038.7
SAITv2-lightyesyesyesyesnonononononononononoAnonymous
more details
0.02544.053.434.986.027.238.235.329.247.8
Deform_ResNet_BalancedyesyesnonononononononononononoAnonymous
more details
0.25831.248.619.377.47.923.214.618.840.0
NfS-SegyesyesyesyesnonoyesyesyesyesnonononoUncertainty-Aware Knowledge Distillation for Real-Time Scene Segmentation: 7.43 GFLOPs at Full-HD Image with 120 fpsAnonymous
more details
0.0083731244.454.435.786.326.237.936.529.948.5
Improving Semantic Segmentation via Video Propagation and Label RelaxationyesyesyesyesnonononoyesyesnonoyesyesImproving Semantic Segmentation via Video Propagation and Label RelaxationYi Zhu, Karan Sapra, Fitsum A. Reda, Kevin J. Shih, Shawn Newsam, Andrew Tao, Bryan CatanzaroCVPR 2019Semantic segmentation requires large amounts of pixel-wise annotations to learn accurate models. In this paper, we present a video prediction-based methodology to scale up training sets by synthesizing new training samples in order to improve the accuracy of semantic segmentation networks. We exploit video prediction models' ability to predict future frames in order to also predict future labels. A joint propagation strategy is also proposed to alleviate mis-alignments in synthesized samples. We demonstrate that training segmentation models on datasets augmented by the synthesized samples lead to significant improvements in accuracy. Furthermore, we introduce a novel boundary label relaxation technique that makes training robust to annotation noise and propagation artifacts along object boundaries. Our proposed methods achieve state-of-the-art mIoUs of 83.5% on Cityscapes and 82.9% on CamVid. Our single model, without model ensembles, achieves 72.8% mIoU on the KITTI semantic segmentation test set, which surpasses the winning entry of the ROB challenge 2018.
more details
n/a64.472.955.292.250.161.063.053.866.8
Spatial Sampling Netyesyesnononononononono22nonoSpatial Sampling Network for Fast Scene UnderstandingDavide Mazzini, Raimondo SchettiniCVPR 2019 Workshop on Autonomous DrivingWe propose a network architecture to perform efficient scene understanding. This work presents three main novelties: the first is an Improved Guided Upsampling Module that can replace in toto the decoder part in common semantic segmentation networks.
Our second contribution is the introduction of a new module based on spatial sampling to perform Instance Segmentation. It provides a very fast instance segmentation, needing only thresholding as post-processing step at inference time. Finally, we propose a novel efficient network design that includes the new modules and we test it against different datasets for outdoor scene understanding.
more details
0.0088439.051.229.483.620.334.024.625.343.6
SwiftNetRN-18yesyesnonononononononononoyesyesIn Defense of Pre-trained ImageNet Architectures for Real-time Semantic Segmentation of Road-driving ImagesMarin Oršić, Ivan Krešo, Petra Bevandić, Siniša ŠegvićCVPR 2019
more details
0.024352.065.543.589.233.043.844.437.758.8
Fast-SCNNyesyesyesyesnonononononononononoFast-SCNN: Fast Semantic Segmentation NetworkRudra PK Poudel, Stephan Liwicki, Roberto CipollaThe encoder-decoder framework is state-of-the-art for offline semantic image segmentation. Since the rise in autonomous systems, real-time computation is increasingly desirable. In this paper, we introduce fast segmentation convolutional neural network (Fast-SCNN), an above real-time semantic segmentation model on high resolution image data (1024x2048px) suited to efficient computation on embedded devices with low memory. Building on existing two-branch methods for fast segmentation, we introduce our `learning to downsample' module which computes low-level features for multiple resolution branches simultaneously. Our network combines spatial detail at high resolution with deep features extracted at lower resolution, yielding an accuracy of 68.0% mean intersection over union at 123.5 frames per second on Cityscapes. We also show that large scale pre-training is unnecessary. We thoroughly validate our metric in experiments with ImageNet pre-training and the coarse labeled data of Cityscapes. Finally, we show even faster computation with competitive results on subsampled inputs, without any network modifications.
more details
0.008137.945.125.583.320.032.034.220.742.3
Fast-SCNN (Half-resolution)yesyesyesyesnononononono22nonoFast-SCNN: Fast Semantic Segmentation NetworkRudra P K Poudel, Stephan Liwicki, Roberto CipollaThe encoder-decoder framework is state-of-the-art for offline semantic image segmentation. Since the rise in autonomous systems, real-time computation is increasingly desirable. In this paper, we introduce fast segmentation convolutional neural network (Fast-SCNN), an above real-time semantic segmentation model on high resolution image data (1024x2048px) suited to efficient computation on embedded devices with low memory. Building on existing two-branch methods for fast segmentation, we introduce our `learning to downsample' module which computes low-level features for multiple resolution branches simultaneously. Our network combines spatial detail at high resolution with deep features extracted at lower resolution, yielding an accuracy of 68.0% mean intersection over union at 123.5 frames per second on Cityscapes. We also show that large scale pre-training is unnecessary. We thoroughly validate our metric in experiments with ImageNet pre-training and the coarse labeled data of Cityscapes. Finally, we show even faster computation with competitive results on subsampled inputs, without any network modifications.
more details
0.003531.936.717.279.616.827.029.914.933.1
Fast-SCNN (Quarter-resolution)yesyesnononononononono44nonoFast-SCNN: Fast Semantic Segmentation NetworkRudra P K Poudel, Stephan Liwicki, Roberto CipollaThe encoder-decoder framework is state-of-the-art for offline semantic image segmentation. Since the rise in autonomous systems, real-time computation is increasingly desirable. In this paper, we introduce fast segmentation convolutional neural network (Fast-SCNN), an above real-time semantic segmentation model on high resolution image data (1024x2048px) suited to efficient computation on embedded devices with low memory. Building on existing two-branch methods for fast segmentation, we introduce our `learning to downsample' module which computes low-level features for multiple resolution branches simultaneously. Our network combines spatial detail at high resolution with deep features extracted at lower resolution, yielding an accuracy of 68.0% mean intersection over union at 123.5 frames per second on Cityscapes. We also show that large scale pre-training is unnecessary. We thoroughly validate our metric in experiments with ImageNet pre-training and the coarse labeled data of Cityscapes. Finally, we show even faster computation with competitive results on subsampled inputs, without any network modifications.
more details
0.0020623.028.210.270.710.418.118.26.921.5
DSNetyesyesyesyesnononononono22yesyesDSNet for Real-Time Driving Scene Semantic SegmentationWenfu WangDSNet for Real-Time Driving Scene Semantic Segmentation
more details
0.02742.857.135.085.421.937.128.430.147.5
SwiftNetRN-18 pyramidyesyesnonononononononononononoAnonymous
more details
n/a48.462.741.387.730.339.934.335.555.6
DF-SegyesyesnonononononononononoyesyesPartial Order Pruning: for Best Speed/Accuracy Trade-off in Neural Architecture SearchXin Li, Yiming Zhou, Zheng Pan, Jiashi FengCVPR 2019DF1-Seg-d8
more details
0.00745.055.934.784.326.239.337.929.951.8
DF-SegyesyesnonononononononononononoAnonymousDF2-Seg2
more details
0.01850.259.441.387.931.246.045.234.855.7
DDARyesyesyesyesnonononononononononoAnonymousDiDi Labs, AR Group
more details
n/a62.773.054.590.847.758.757.552.666.7
LDN-121yesyesnonononononononononononoEfficient Ladder-style DenseNets for Semantic Segmentation of Large ImagesIvan Kreso, Josip Krapac, Sinisa SegvicLadder DenseNet-121 trained on train+val, fine labels only. Single-scale inference.
more details
0.04854.767.345.290.038.146.748.041.160.9
TKCNyesyesnonononononononononoyesyesTree-structured Kronecker Convolutional Network for Semantic SegmentationTianyi Wu, Sheng Tang, Rui Zhang, Juan Cao, Jintao Li
more details
n/a61.372.452.990.943.558.354.551.166.7
RPNetyesyesnononononononono22yesyesResidual Pyramid Learning for Single-Shot Semantic SegmentationXiaoyu Chen, Xiaotian Lou, Lianfa Bai, Jing HanarXivwe put forward a method for single-shot segmentation in a feature residual pyramid network (RPNet), which learns the main and residuals of segmentation by decomposing the label at different levels of residual blocks.
more details
0.00843.657.734.287.224.534.028.730.951.8
naviyesyesnonononononononononononoyuxbmutil scale test
more details
n/a60.167.049.791.547.452.260.747.564.8
Auto-DeepLab-LyesyesyesyesnonononononononoyesyesAuto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image SegmentationChenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan Yuille, Li Fei-FeiarxivIn this work, we study Neural Architecture Search for semantic image segmentation, an important computer vision task that assigns a semantic label to every pixel in an image. Existing works often focus on searching the repeatable cell structure, while hand-designing the outer network structure that controls the spatial resolution changes. This choice simplifies the search space, but becomes increasingly problematic for dense image prediction which exhibits a lot more network level architectural variations. Therefore, we propose to search the network level structure in addition to the cell level structure, which forms a hierarchical architecture search space. We present a network level search space that includes many popular designs, and develop a formulation that allows efficient gradient-based architecture search (3 P100 GPU days on Cityscapes images). We demonstrate the effectiveness of the proposed method on the challenging Cityscapes, PASCAL VOC 2012, and ADE20K datasets. Without any ImageNet pretraining, our architecture searched specifically for semantic image segmentation attains state-of-the-art performance. Please refer to https://arxiv.org/abs/1901.02985 for details.
more details
n/a61.073.451.391.643.856.855.750.165.3
LiteSeg-Darknet19yesyesyesyesnonononononononoyesyesLiteSeg: A Litewiegth ConvNet for Semantic SegmentationTaha Emara, Hossam E. Abd El Munim, Hazem M. AbbasDICTA 2019
more details
0.010249.164.340.988.329.741.135.737.055.6
AdapNet++yesyesyesyesnonononononononoyesyes Self-Supervised Model Adaptation for Multimodal Semantic SegmentationAbhinav Valada, Rohit Mohan, Wolfram BurgardIJCV 2019In this work, we propose the AdapNet++ architecture for semantic segmentation that aims to achieve the right trade-off between performance and computational complexity of the model. AdapNet++ incorporates a new encoder with multiscale residual units and an efficient atrous spatial pyramid pooling (eASPP) module that has a larger effective receptive field with more than 10x fewer parameters compared to the standard ASPP, complemented with a strong decoder with a multi-resolution supervision scheme that recovers high-resolution details. Comprehensive empirical evaluations on the challenging Cityscapes, Synthia, SUN RGB-D, ScanNet and Freiburg Forest datasets demonstrate that our architecture achieves state-of-the-art performance while simultaneously being efficient in terms of both the number of parameters and inference time. Please refer to https://arxiv.org/abs/1808.03833 for details.

A live demo on various datasets can be viewed at http://deepscene.cs.uni-freiburg.de
more details
n/a59.570.350.090.344.655.154.448.263.3
SSMAyesyesyesyesnonoyesyesnonononoyesyes Self-Supervised Model Adaptation for Multimodal Semantic SegmentationAbhinav Valada, Rohit Mohan, Wolfram BurgardIJCV 2019Learning to reliably perceive and understand the scene is an integral enabler for robots to operate in the real-world. This problem is inherently challenging due to the multitude of object types as well as appearance changes caused by varying illumination and weather conditions. Leveraging
complementary modalities can enable learning of semantically richer representations that are resilient to such perturbations. Despite the tremendous progress in recent years, most multimodal convolutional neural network approaches directly concatenate feature maps from individual modality streams
rendering the model incapable of focusing only on the relevant complementary information for fusion. To address this limitation, we propose a mutimodal semantic segmentation framework that dynamically adapts the fusion of modality-specific features while being sensitive to the object category, spatial location and scene context in a self-supervised manner. Specifically, we propose an architecture consisting of two modality-specific encoder streams that fuse intermediate encoder representations into a single decoder using our proposed SSMA fusion mechanism which optimally combines complementary features. As intermediate representations are not aligned across modalities, we introduce an attention scheme for better correlation. Extensive experimental evaluations on the challenging Cityscapes, Synthia, SUN RGB-D, ScanNet and Freiburg Forest datasets demonstrate that our architecture achieves state-of-the-art performance in addition to providing exceptional robustness in adverse perceptual conditions. Please refer to https://arxiv.org/abs/1808.03833 for details.

A live demo on various datasets can be viewed at http://deepscene.cs.uni-freiburg.de
more details
n/a62.372.652.491.447.858.158.651.765.4
LiteSeg-MobilenetyesyesyesyesnonononononononoyesyesLiteSeg: A Litewiegth ConvNet for Semantic SegmentationTaha Emara, Hossam E. Abd El Munim, Hazem M. AbbasDICTA 2019
more details
0.006245.360.637.284.026.435.033.833.251.9
LiteSeg-ShufflenetyesyesyesyesnonononononononoyesyesLiteSeg: A Litewiegth ConvNet for Semantic SegmentationTaha Emara, Hossam E. Abd El Munim, Hazem M. AbbasDICTA 2019
more details
0.00751841.052.733.381.722.733.730.726.646.7
Fast OCNetyesyesnonononononononononononoAnonymous
more details
n/a61.071.654.391.043.357.554.551.864.0
ShuffleNet v2 + DPCyesyesyesyesnonononononononoyesyesAn efficient solution for semantic segmentation: ShuffleNet V2 with atrous separable convolutionsSercan Turkmen, Janne HeikkilaShuffleNet v2 with DPC at output_stride 16.
more details
n/a43.655.534.985.727.035.630.430.649.0
ERSNet-coarseyesyesyesyesnononononono44nonoAnonymous
more details
0.01239.251.628.185.120.433.529.524.041.3
MiniNet-v2-coarseyesyesyesyesnononononono22nonoAnonymous
more details
0.01239.252.927.884.719.634.030.623.141.3
SwiftNetRN-18 ensembleyesyesnonononononononononoyesyesIn Defense of Pre-trained ImageNet Architectures for Real-time Semantic Segmentation of Road-driving ImagesMarin Oršić, Ivan Krešo, Petra Bevandić, Siniša ŠegvićCVPR 2019
more details
n/a51.465.043.589.132.544.041.237.858.3
EFC_syncyesyesnonononononoyesyesnonononoAnonymous
more details
n/a56.766.547.390.639.851.052.643.861.7
PL-SegyesyesnonononononononononoyesyesPartial Order Pruning: for Best Speed/Accuracy Trade-of in Neural Architecture SearchXin Li, Yiming Zhou, Zheng Pan, Jiashi FengCVPR 2019Following "partial order pruning", we conduct architecture searching experiments on Snapdragon 845 platform, and obtained PL1A/PL1A-Seg.

1、Snapdragon 845
2、NCNN Library
3、latency evaluated at 640x384

more details
0.019241.251.532.585.122.135.930.224.748.0
MiniNet-v2-pretrainedyesyesyesyesnononononono22nonoAnonymous
more details
0.01239.752.129.785.420.133.530.823.442.8
GALD-NetyesyesyesyesyesyesyesyesnonononoyesyesGlobal Aggregation then Local Distribution in Fully Convolutional NetworksXiangtai Li, Li Zhang, Ansheng You, Maoke Yang, Kuiyuan Yang, Yunhai TongBMVC 2019We propose Global Aggregation then Local Distribution (GALD) scheme to distribute global information to each position adaptively according to the local information around the position. (Joint work: Key Laboratory of Machine Perception, School of EECS @Peking University and DeepMotion AI Research )
more details
n/a64.573.756.791.152.859.660.453.768.1
GALD-netyesyesyesyesnonononononononoyesyesGlobal Aggregation then Local Distribution in Fully Convolutional NetworksXiangtai Li, Li Zhang, Ansheng You, Maoke Yang, Kuiyuan Yang, Yunhai TongBMVC 2019We propose Global Aggregation then Local Distribution (GALD) scheme to distribute global information to each position adaptively according the local information surrounding the position.
more details
n/a63.572.955.890.951.258.558.752.567.3
ndnetyesyesnonononononononononononoAnonymous
more details
0.02434.747.424.183.919.226.418.718.939.3
HRNetV2yesyesnonononononononononoyesyesHigh-Resolution Representations for Labeling Pixels and RegionsKe Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, Jingdong WangThe high-resolution network (HRNet) recently developed for human pose estimation, maintains high-resolution representations through the whole process by connecting high-to-low resolution convolutions in parallel and produces strong high-resolution representations by repeatedly conducting fusions across parallel convolutions.
more details
n/a61.273.856.391.341.656.351.552.266.2
SPGNetyesyesnonononononononononononoSPGNet: Semantic Prediction Guidance for Scene ParsingBowen Cheng, Liang-Chieh Chen, Yunchao Wei, Yukun Zhu, Zilong Huang, Jinjun Xiong, Thomas Huang, Wen-Mei Hwu, Honghui ShiICCV 2019Multi-scale context module and single-stage encoder-decoder structure are commonly employed for semantic segmentation. The multi-scale context module refers to the operations to aggregate feature responses from a large spatial extent, while the single-stage encoder-decoder structure encodes the high-level semantic information in the encoder path and recovers the boundary information in the decoder path. In contrast, multi-stage encoder-decoder networks have been widely used in human pose estimation and show superior performance than their single-stage counterpart. However, few efforts have been attempted to bring this effective design to semantic segmentation. In this work, we propose a Semantic Prediction Guidance (SPG) module which learns to re-weight the local features through the guidance from pixel-wise semantic prediction. We find that by carefully re-weighting features across stages, a two-stage encoder-decoder network coupled with our proposed SPG module can significantly outperform its one-stage counterpart with similar parameters and computations. Finally, we report experimental results on the semantic segmentation benchmark Cityscapes, in which our SPGNet attains 81.1% on the test set using only 'fine' annotations.
more details
n/a61.473.554.891.642.657.653.751.366.6
LDN-161yesyesnonononononononononononoEfficient Ladder-style DenseNets for Semantic Segmentation of Large ImagesIvan Kreso, Josip Krapac, Sinisa SegvicLadder DenseNet-161 trained on train+val, fine labels only. Inference on multi-scale inputs.
more details
2.056.468.248.491.239.150.248.843.861.6
GGCFyesyesyesyesnonononononononononoAnonymous
more details
n/a63.072.254.291.150.259.459.751.665.6
GFF-NetyesyesnonononononononononononoGFF: Gated Fully Fusion for Semantic SegmentationXiangtai Li, Houlong Zhao, Yunhai Tong, Kuiyuan YangWe proposed Gated Fully Fusion (GFF) to fuse features from multiple levels through gates in a fully connected way. Specifically, features at each level are enhanced by higher-level features with stronger semantics and lower-level features with more details, and gates are used to control the pass of useful information which significantly reducing noise propagation during fusion. (Joint work: Key Laboratory of Machine Perception, School of EECS @Peking University and DeepMotion AI Research )
more details
n/a62.172.753.791.146.058.156.851.667.1
Gated-SCNNyesyesnonononononononononoyesyesGated-SCNN: Gated Shape CNNs for Semantic SegmentationTowaki Takikawa, David Acuna, Varun Jampani, Sanja Fidler
more details
n/a64.373.755.692.349.161.961.452.967.4
ESPNetv2yesyesnononononononono22yesyesESPNetv2: A Light-weight, Power Efficient, and General Purpose Convolutional Neural NetworkSachin Mehta, Mohammad Rastegari, Linda Shapiro, and Hannaneh HajishirziCVPR 2019We introduce a light-weight, power efficient, and general purpose convolutional neural network, ESPNetv2, for modeling visual and sequential data. Our network uses group point-wise and depth-wise dilated separable convolutions to learn representations from a large effective receptive field with fewer FLOPs and parameters. The performance of our network is evaluated on three different tasks: (1) object classification, (2) semantic segmentation, and (3) language modeling. Experiments on these tasks, including image classification on the ImageNet and language modeling on the PenTree bank dataset, demonstrate the superior performance of our method over the state-of-the-art methods. Our network has better generalization properties than ShuffleNetv2 when tested on the MSCOCO multi-object classification task and the Cityscapes urban scene semantic segmentation task. Our experiments show that ESPNetv2 is much more power efficient than existing state-of-the-art efficient methods including ShuffleNets and MobileNets. Our code is open-source and available at https://github.com/sacmehta/ESPNetv2
more details
n/a36.050.523.583.419.331.522.519.238.3
MRFMyesyesyesyesnonononononononononoMulti Receptive Field Network for Semantic SegmentationJianlong Yuan, Zelu Deng, Shu Wang, Zhenbo LuoWACV2020Semantic segmentation is one of the key tasks in comput-
er vision, which is to assign a category label to each pixel
in an image. Despite significant progress achieved recently,
most existing methods still suffer from two challenging is-
sues: 1) the size of objects and stuff in an image can be very
diverse, demanding for incorporating multi-scale features
into the fully convolutional networks (FCNs); 2) the pixel-
s close to or at the boundaries of object/stuff are hard to
classify due to the intrinsic weakness of convolutional net-
works. To address the first issue, we propose a new Multi-
Receptive Field Module (MRFM), explicitly taking multi-
scale features into account. For the second issue, we design
an edge-aware loss which is effective in distinguishing the
boundaries of object/stuff. With these two designs, our Mul-
ti Receptive Field Network achieves new state-of-the-art re-
sults on two widely-used semantic segmentation benchmark
datasets. Specifically, we achieve a mean IoU of 83.0% on
the Cityscapes dataset and 88.4% mean IoU on the Pascal
VOC2012 dataset.
more details
n/a62.274.154.791.347.757.053.951.467.6
DGCNetyesyesnonononononononononononoDual Graph Convolutional Network for Semantic SegmentationLi Zhang*, Xiangtai Li*, Anurag Arnab, Kuiyuan Yang, Yunhai Tong, Philip H.S. TorrBMVC 2019We propose Dual Graph Convolutional Network (DGCNet) models the global context of the input feature by modelling two orthogonal graphs in a single framework. (Joint work: University of Oxford, Peking University and DeepMotion AI Research)
more details
n/a61.772.053.991.247.157.654.051.966.2
dpcan_trainval_os16_225yesyesnonononononononononononoAnonymous
more details
n/a61.172.152.190.945.056.457.849.564.8
Learnable Tree FilteryesyesnonononononononononoyesyesLearnable Tree Filter for Structure-preserving Feature TransformLin Song; Yanwei Li; Zeming Li; Gang Yu; Hongbin Sun; Jian Sun; Nanning ZhengNeurIPS 2019Learnable Tree Filter for Structure-preserving Feature Transform
more details
n/a60.771.951.191.144.255.456.451.564.4
FreeNetyesyesnonononononononononononoAnonymous
more details
n/a40.356.833.984.525.031.225.326.639.4
HRNetV2 + OCRyesyesyesyesnonononononononoyesyesHigh-Resolution Representations for Labeling Pixels and Regions; OCNet: Object Context Network for Scene ParsingHRNet Team; OCR TeamHRNetV2W48 + OCR. OCR is an extension of object context networks https://arxiv.org/pdf/1809.00916.pdf
more details
n/a62.073.054.191.243.859.157.851.065.6
Valeo DAR GermanyyesyesyesyesnonononononononononoAnonymousValeo DAR Germany, New Algo Lab

more details
n/a62.973.355.191.647.358.657.853.266.8
GLNet_fineyesyesnonononononononononononoAnonymousThe proposed network architecture, combined with spatial information and multi scale context information, and repair the boundaries and details of the segmented object through channel attention modules.(Use the train-fine and the val-fine data)
more details
n/a58.369.450.090.539.754.652.747.362.1
MCDNyesyesnonononononononononononoAnonymous
more details
n/a62.471.253.891.550.562.953.451.364.7
AAF+GLRyesyesnonononononononononononoAnonymous
more details
n/a54.968.848.590.435.647.145.841.262.0
HRNetV2 + OCR (w/ ASP)yesyesyesyesnonononononononoyesyesopenseg-group (OCR team + HRNet team)Our approach is based on a single HRNet48V2 and an OCR module combined with ASPP. We apply depth based multi-scale ensemble weights during testing (provided by DeepMotion AI Research) .
more details
n/a64.876.057.591.749.662.058.455.368.2
CASIA_IVA_DRANet-101_NoCoarseyesyesnonononononononononononoAnonymous
more details
n/a66.176.958.292.350.265.558.656.570.4
Hyundai Mobis AD LabyesyesyesyesnonononononononononoHyundai Mobis AD Lab, DL-DB Group, AA (Automated Annotator) Team
more details
n/a65.073.755.392.252.261.363.954.667.0
EFRNet-13yesyesnonononononononononononoAnonymous
more details
0.014643.656.232.085.423.438.636.729.847.0
FarSee-Netyesyesnononononononono22nonoFarSee-Net: Real-Time Semantic Segmentation by Efficient Multi-scale Context Aggregation and Feature Space Super-resolutionZhanpeng Zhang and Kaipeng ZhangIEEE International Conference on Robotics and Automation (ICRA) 2020FarSee-Net: Real-Time Semantic Segmentation by Efficient Multi-scale Context Aggregation and Feature Space Super-resolution

Real-time semantic segmentation is desirable in many robotic applications with limited computation resources. One challenge of semantic segmentation is to deal with the objectscalevariationsandleveragethecontext.Howtoperform multi-scale context aggregation within limited computation budget is important. In this paper, firstly, we introduce a novel and efficient module called Cascaded Factorized Atrous Spatial Pyramid Pooling (CF-ASPP). It is a lightweight cascaded structure for Convolutional Neural Networks (CNNs) to efficiently leverage context information. On the other hand, for runtime efficiency, state-of-the-art methods will quickly decrease the spatial size of the inputs or feature maps in the early network stages. The final high-resolution result is usuallyobtainedbynon-parametricup-samplingoperation(e.g. bilinear interpolation). Differently, we rethink this pipeline and treat it as a super-resolution process. We use optimized superresolution operation in the up-sampling step and improve the accuracy, especially in sub-sampled input image scenario for real-time applications. By fusing the above two improvements, our methods provide better latency-accuracy trade-off than the other state-of-the-art methods. In particular, we achieve 68.4% mIoU at 84 fps on the Cityscapes test set with a single Nivida Titan X (Maxwell) GPU card. The proposed module can be plugged into any feature extraction CNN and benefits from the CNN structure development.
more details
0.011939.353.929.086.119.330.228.123.444.7
C3Net [2,3,7,13]nononononononononono22yesyesC3: Concentrated-Comprehensive Convolution and its application to semantic segmentationHyojin Park, Youngjoon Yoo, Geonseok Seo, Dongyoon Han, Sangdoo Yun, Nojun Kwak
more details
n/a37.352.128.483.818.329.623.720.642.2
Panoptic-DeepLab [Cityscapes-fine]yesyesnonononononononononononoPanoptic-DeepLabBowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh ChenOur proposed bottom-up Panoptic-DeepLab is conceptually simple yet delivers state-of-the-art results. The Panoptic-DeepLab adopts dual-ASPP and dual-decoder modules, specific to semantic segmentation and instance segmentation respectively. The semantic segmentation prediction follows the typical design of any semantic segmentation model (e.g., DeepLab), while the instance segmentation prediction involves a simple instance center regression, where the model learns to predict instance centers as well as the offset from each pixel to its corresponding center. This submission exploits only Cityscapes fine annotations.
more details
n/a58.570.655.186.734.555.255.752.057.8
EKENetyesyesnonononononononononononoAnonymous
more details
0.022944.156.031.885.324.241.237.830.146.7
SPSSNyesyesnonononononononononononoAnonymousStage Pooling Semantic Segmentation Network
more details
n/a41.855.332.285.922.035.929.728.545.0
FC-HarDNet-70yesyesnonononononononononoyesyesHarDNet: A Low Memory Traffic NetworkPing Chao, Chao-Yang Kao, Yu-Shan Ruan, Chien-Hsiang Huang, Youn-Long LinICCV 2019Fully Convolutional Harmonic DenseNet 70
U-shape encoder-decoder structure with HarDNet blocks
Trained with single scale loss at stride-4
validation mIoU=77.7
more details
0.01551.465.043.089.234.045.237.438.858.4
BFPyesyesnonononononononononononoBoundary-Aware Feature Propagation for Scene SegmentationHenghui Ding, Xudong Jiang, Ai Qun Liu, Nadia Magnenat Thalmann, and Gang WangIEEE International Conference on Computer Vision (ICCV), 2019Boundary-Aware Feature Propagation for Scene Segmentation
more details
n/a62.372.152.491.249.958.459.549.965.1
FasterSegyesyesnonononononononononoyesyesFasterSeg: Searching for Faster Real-time Semantic SegmentationWuyang Chen, Xinyu Gong, Xianming Liu, Qian Zhang, Yuan Li, Zhangyang WangICLR 2020We present FasterSeg, an automatically designed semantic segmentation network with not only state-of-the-art performance but also faster speed than current methods. Utilizing neural architecture search (NAS), FasterSeg is discovered from a novel and broader search space integrating multi-resolution branches, that has been recently found to be vital in manually designed segmentation models. To better calibrate the balance between the goals of high accuracy and low latency, we propose a decoupled and fine-grained latency regularization, that effectively overcomes our observed phenomenons that the searched networks are prone to "collapsing" to low-latency yet poor-accuracy models. Moreover, we seamlessly extend FasterSeg to a new collaborative search (co-searching) framework, simultaneously searching for a teacher and a student network in the same single run. The teacher-student distillation further boosts the student model's accuracy. Experiments on popular segmentation benchmarks demonstrate the competency of FasterSeg. For example, FasterSeg can run over 30% faster than the closest manually designed competitor on Cityscapes, while maintaining comparable accuracy.
more details
0.0061344.361.037.987.621.034.328.530.553.9
VCD-NoCoarseyesyesnonononononononononononoAnonymous
more details
n/a64.275.557.091.446.959.761.553.768.0
NAVINFO_DLRyesyesyesyesnonononononononononopengfei zhangweighted aspp+ohem+hard region refine
more details
n/a65.676.557.891.453.261.861.254.668.3
LBPSSyesyesnononononononono22nonoAnonymousCVPR 2020 submission #5455
more details
0.934.452.022.282.314.526.920.914.841.3
KANet_Res101yesyesnonononononononononononoAnonymous
more details
n/a63.174.155.791.549.057.057.852.667.1
Learnable Tree Filter V2yesyesnonononononononononoyesyesRethinking Learnable Tree Filter for Generic Feature TransformLin Song, Yanwei Li, Zhengkai Jiang, Zeming Li, Xiangyu Zhang, Hongbin Sun, Jian Sun, Nanning ZhengNeurIPS 2020Based on ResNet-101 backbone and FPN architecture.
more details
n/a64.076.357.491.549.258.456.654.467.9
GPSNetyesyesnonononononononononononoAnonymous
more details
n/a62.472.955.191.344.758.357.952.167.2
FTFNetyesyesyesyesnonononononononononoAnonymousAn Efficient Network Focused on Tiny Feature Maps for Real-Time Semantic Segmentation
more details
0.008845.857.235.486.126.441.239.331.050.1
iFLYTEK-CVyesyesyesyesnonononononononononoAnonymousiFLYTEK Research, CV Group
more details
n/a64.973.955.791.851.361.562.954.767.8
F2MF-shortyesyesnonononononoyesyesnonoyesyesWarp to the Future: Joint Forecasting of Features and Feature Motion Josip Saric, Marin Orsic, Tonci Antunovic, Sacha Vrazic, Sinisa SegvicThe IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020Our method forecasts semantic segmentation 3 timesteps into the future.
more details
n/a43.646.331.377.730.443.047.030.443.1
HPNetyesyesnonononononononononononoHigh-Order Paired-ASPP Networks for Semantic SegmentationYu Zhang, Xin Sun, Junyu Dong, Changrui Chen, Yue Shen
more details
n/a60.670.551.691.045.358.153.249.565.4
HANet (fine-train only)yesyesnonononononononononononoTBAAnonymousWe use only fine-training data.
more details
n/a58.668.848.691.539.054.155.947.863.3
F2MF-midyesyesnonononononoyesyesnonoyesyesWarp to the Future: Joint Forecasting of Features and Feature MotionJosip Saric, Marin Orsic, Tonci Antunovic, Sacha Vrazic, Sinisa SegvicThe IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020Our method forecasts semantic segmentation 9 timesteps into the future.
more details
n/a32.931.717.165.524.134.341.820.628.3
EMANetyesyesnonononononononononononoExpectation Maximization Attention Networks for Semantic SegmentationXia Li, Zhisheng Zhong, Jianlong Wu, Yibo Yang, Zhouchen Lin, Hong LiuICCV 2019
more details
n/a61.371.353.690.444.559.654.051.166.1
PartnerNetyesyesnonononononononononononoAnonymousPARTNERNET: A LIGHTWEIGHT AND EFFICIENT PARTNER NETWORK FOR SEMANTIC SEGMENTATION
more details
0.005848.361.040.186.828.842.636.234.956.0
SwiftNet RN18 pyr sepBN MVDyesyesnonononononononononoyesyesEfficient semantic segmentation with pyramidal fusionM Oršić, S ŠegvićPattern Recognition 2020
more details
0.02952.966.243.690.033.248.342.341.658.1
Tencent YYB VisualAlgoyesyesyesyesnonononononononononoAnonymousTencent YYB VisualAlgo Group
more details
n/a64.872.855.592.252.461.763.153.866.7
MoKu LabyesyesnonononononononononononoAnonymousAlibaba, MoKu AI Lab, CV Group
more details
n/a65.174.655.992.450.663.062.653.768.1
HRNetV2 + OCR + SegFixyesyesyesyesnonononononononoyesyesObject-Contextual Representations for Semantic SegmentationYuhui Yuan, Xilin Chen, Jingdong WangFirst, we pre-train "HRNet+OCR" method on the Mapillary training set (achieves 50.8% on the Mapillary val set). Second, we fine-tune the model with the Cityscapes training, validation and coarse set. Finally, we apply the "SegFix" scheme to further improve the results.
more details
n/a65.976.356.991.951.365.262.754.968.3
DecoupleSegNetyesyesnonononononononononoyesyesImproving Semantic Segmentation via Decoupled Body and Edge SupervisionXiangtai Li, Xia Li, Li Zhang, Guangliang Cheng, Jianping Shi, Zhouchen Lin, Shaohua Tan, and Yunhai TongECCV-2020In this paper, We propose a new paradigm for semantic segmentation. Our insight is that appealing performance of semantic segmentation re- quires explicitly modeling the object body and edge, which correspond to the high and low frequency of the image. To do so, we first warp the image feature by learning a flow field to make the object part more consistent. The resulting body feature and the residual edge feature are further optimized under decoupled supervision by explicitly sampling dif- ferent parts (body or edge) pixels. The code and models have been released.
more details
n/a64.472.155.692.050.961.762.154.466.6
LGE A&B Center: HANet (ResNet-101)yesyesnonononononononononoyesyesCars Can’t Fly up in the Sky: Improving Urban-Scene Segmentation via Height-driven Attention NetworksSungha Choi (LGE, Korea Univ.), Joanne T. Kim (Korea Univ.), Jaegul Choo (KAIST)CVPR 2020Dataset: "fine train + fine val", No coarse, Backbone: ImageNet pretrained ResNet-101
more details
n/a62.271.454.791.646.260.557.050.665.5
DCNASyesyesyesyesnonononononononononoDCNAS: Densely Connected Neural Architecture Search for Semantic Image SegmentationXiong Zhang, Hongmin Xu, Hong Mo, Jianchao Tan, Cheng Yang, Wenqi RenNeural Architecture Search (NAS) has shown great potentials in automatically designing scalable network architectures for dense image predictions. However, existing NAS algorithms usually compromise on restricted search space and search on proxy task to meet the achievable computational demands. To allow as wide as possible network architectures and avoid the gap between target and proxy dataset, we propose a Densely Connected NAS (DCNAS) framework, which directly searches the optimal network structures for the multi-scale representations of visual information, over a large-scale target dataset. Specifically, by connecting cells with each other using learnable weights, we introduce a densely connected search space to cover an abundance of mainstream network designs. Moreover, by combining both path-level and channel-level sampling strategies, we design a fusion module to reduce the memory consumption of ample search space.
more details
n/a65.074.657.091.246.161.565.355.768.7
GPNet-ResNet101yesyesnonononononononononononoAnonymous
more details
n/a62.473.154.291.746.859.154.552.267.3
Axial-DeepLab-XL [Cityscapes-fine]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a57.369.055.185.237.354.452.350.754.4
Axial-DeepLab-L [Mapillary Vistas]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a64.071.258.287.051.364.163.857.958.6
Axial-DeepLab-L [Cityscapes-fine]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.

more details
n/a57.568.355.584.839.855.951.251.053.5
LGE A&B Center: HANet (ResNext-101)yesyesyesyesnonononononononoyesyesCars Can’t Fly up in the Sky: Improving Urban-Scene Segmentation via Height-driven Attention NetworksSungha Choi (LGE, Korea Univ.), Joanne T. Kim (Korea Univ.), Jaegul Choo (KAIST)CVPR 2020Dataset: "fine train + fine val + coarse", Backbone: Mapillary pretrained ResNext-101
more details
n/a62.670.655.191.847.761.156.253.265.4
ERINet-v2yesyesnonononononononononononoEfficient Residual Inception NetworkMINJONG KIM, SUYOUNG CHIongoing
more details
0.0052631639.255.429.585.318.031.626.122.345.8
Naive-Student (iterative semi-supervised learning with Panoptic-DeepLab)yesyesnonononononoyesyesnonononoSemi-Supervised Learning in Video Sequences for Urban Scene SegmentationLiang-Chieh Chen, Raphael Gontijo Lopes, Bowen Cheng, Maxwell D. Collins, Ekin D. Cubuk, Barret Zoph, Hartwig Adam, Jonathon ShlensSupervised learning in large discriminative models is a mainstay for modern computer vision. Such an approach necessitates investing in large-scale human-annotated datasets for achieving state-of-the-art results. In turn, the efficacy of supervised learning may be limited by the size of the human annotated dataset. This limitation is particularly notable for image segmentation tasks, where the expense of human annotation is especially large, yet large amounts of unlabeled data may exist. In this work, we ask if we may leverage semi-supervised learning in unlabeled video sequences to improve the performance on urban scene segmentation, simultaneously tackling semantic, instance, and panoptic segmentation. The goal of this work is to avoid the construction of sophisticated, learned architectures specific to label propagation (e.g., patch matching and optical flow). Instead, we simply predict pseudo-labels for the unlabeled data and train subsequent models with both human-annotated and pseudo-labeled data. The procedure is iterated for several times. As a result, our Naive-Student model, trained with such simple yet effective iterative semi-supervised learning, attains state-of-the-art results at all three Cityscapes benchmarks, reaching the performance of 67.8% PQ, 42.6% AP, and 85.2% mIOU on the test set. We view this work as a notable step towards building a simple procedure to harness unlabeled video sequences to surpass state-of-the-art performance on core computer vision tasks.
more details
n/a68.875.562.189.057.271.067.262.266.3
Axial-DeepLab-XL [Mapillary Vistas]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a66.072.059.787.551.769.366.759.661.2
TUE-5LSM0-g23yesyesyesyesnonononononononononoAnonymousDeeplabv3+decoder
more details
n/a42.453.832.682.923.437.832.928.847.1
PBRNetyesyesnonononononononononononoAnonymousmodified MobileNetV2 backbone + Prediction and Boundary attention-based Refinement Module (PBRM)
more details
0.010751.965.946.089.433.945.230.944.659.3
ResNeSt200yesyesnonononononononononononoResNeSt: Split-Attention NetworksHang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Mueller, R. Manmatha, Mu Li, and Alexander SmolaDeepLabV3+ network with ResNeSt200 backbone.
more details
n/a63.071.953.791.847.263.458.952.165.0
Panoptic-DeepLab [Mapillary Vistas]yesyesnonononononononononononoPanoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic SegmentationBowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh ChenWe employ a stronger backbone, WR-41, in Panoptic-DeepLab.
For Panoptic-DeepLab, please refer to https://arxiv.org/abs/1911.10194.
For wide-ResNet-41 (WR-41) backbone, please refer to https://arxiv.org/abs/2005.10266.
more details
n/a68.775.861.990.157.669.564.161.768.4
EaNet-V1yesyesnonononononononononononoParsing Very High Resolution Urban Scene Images by Learning Deep ConvNets with Edge-Aware LossXianwei Zheng, Linxi Huan, Gui-Song Xia, Jianya GongParsing very high resolution (VHR) urban scene images into regions with semantic meaning, e.g. buildings and cars, is a fundamental task necessary for interpreting and understanding urban scenes. However, due to the huge quantity of details contained in an image and the large variations of objects in scale and appearance, the existing semantic segmentation methods often break one object into pieces, or confuse adjacent objects and thus fail to depict these objects consistently. To address this issue, we propose a concise and effective edge-aware neural network (EaNet) for urban scene semantic segmentation. The proposed EaNet model is deployed as a standard balanced encoder-decoder framework. Specifically, we devised two plug-and-play modules that append on top of the encoder and decoder respectively, i.e., the large kernel pyramid pooling (LKPP) and the edge-aware loss (EA loss) function, to extend the model ability in learning discriminating features. The LKPP module captures rich multi-scale context with strong continuous feature relations to promote coherent labeling of multi-scale urban objects. The EA loss module learns edge information directly from semantic segmentation prediction, which avoids costly post-processing or extra edge detection. During training, EA loss imposes a strong geometric awareness to guide object structure learning at both the pixel- and image-level, and thus effectively separates confusing objects with sharp contours.
more details
n/a59.665.949.490.744.954.858.948.863.3
EfficientPS [Mapillary Vistas]yesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaUnderstanding the scene in which an autonomous robot operates is critical for its competent functioning. Such scene comprehension necessitates recognizing instances of traffic participants along with general scene semantics which can be effectively addressed by the panoptic segmentation task. In this paper, we introduce the Efficient Panoptic Segmentation (EfficientPS) architecture that consists of a shared backbone which efficiently encodes and fuses semantically rich multi-scale features. We incorporate a new semantic head that aggregates fine and contextual features coherently and a new variant of Mask R-CNN as the instance head. We also propose a novel panoptic fusion module that congruously integrates the output logits from both the heads of our EfficientPS architecture to yield the final panoptic segmentation output. Additionally, we introduce the KITTI panoptic segmentation dataset that contains panoptic annotations for the popularly challenging KITTI benchmark. Extensive evaluations on Cityscapes, KITTI, Mapillary Vistas and Indian Driving Dataset demonstrate that our proposed architecture consistently sets the new state-of-the-art on all these four benchmarks while being the most efficient and fast panoptic segmentation architecture to date.
more details
n/a65.275.756.792.050.064.461.553.967.6
FSFNetyesyesnonononononononononoyesyesAccelerator-Aware Fast Spatial Feature Network for Real-Time Semantic SegmentationMinjong Kim, Byungjae Park, Suyoung ChiIEEE AccessSemantic segmentation is performed to understand an image at the pixel level; it is widely used in the field of autonomous driving. In recent years, deep neural networks achieve good accuracy performance; however, there exist few models that have a good trade-off between high accuracy and low inference time. In this paper, we propose a fast spatial feature network (FSFNet), an optimized lightweight semantic segmentation model using an accelerator, offering high performance as well as faster inference speed than current methods. FSFNet employs the FSF and MRA modules. The FSF module has three different types of subset modules to extract spatial features efficiently. They are designed in consideration of the size of the spatial domain. The multi-resolution aggregation module combines features that are extracted at different resolutions to reconstruct the segmentation image accurately. Our approach is able to run at over 203 FPS at full resolution 1024 x 2048) in a single NVIDIA 1080Ti GPU, and obtains a result of 69.13% mIoU on the Cityscapes test dataset. Compared with existing models in real-time semantic segmentation, our proposed model retains remarkable accuracy while having high FPS that is over 30% faster than the state-of-the-art model. The experimental results proved that our model is an ideal approach for the Cityscapes dataset.
more details
0.004926143.059.034.486.722.433.729.027.851.1
Hierarchical Multi-Scale Attention for Semantic SegmentationyesyesyesyesnonononononononoyesyesHierarchical Multi-Scale Attention for Semantic SegmentationAndrew Tao, Karan Sapra, Bryan Catanzaro Multi-scale inference is commonly used to improve the results of semantic segmentation. Multiple images scales are passed through a network and then the results are combined with averaging or max pooling. In this work, we present an attention-based approach to combining multi-scale predictions. We show that predictions at certain scales are better at resolving particular failures modes and that the network learns to favor those scales for such cases in order to generate better predictions. Our attention mechanism is hierarchical, which enables it to be roughly 4x more memory efficient to train than other recent approaches. In addition to enabling faster training, this allows us to train with larger crop sizes which leads to greater model accuracy. We demonstrate the result of our method on two datasets: Cityscapes and Mapillary Vistas. For Cityscapes, which has a large number of weakly labelled images, we also leverage auto-labelling to improve generalization. Using our approach we achieve a new state-of-the-art results in both Mapillary (61.1 IOU val) and Cityscapes (85.4 IOU test).
more details
n/a70.478.463.592.758.770.065.662.372.3
SANetyesyesnononononononono44nonoAnonymous
more details
25.059.669.949.691.640.155.457.249.064.1
SJTU_hpmyesyesyesyesnonoyesyesnonononononoHard Pixel Mining for Depth Privileged Semantic SegmentationZhangxuan Gu, Li Niu*, Haohua Zhao, and Liqing Zhang
more details
n/a59.168.348.890.845.455.649.551.362.8
FANetyesyesnonononononononononononoFANet: Feature Aggregation Network for Semantic SegmentationTanmay Singha, Duc-Son Pham, and Aneesh KrishnaFeature Aggregation Network for Semantic Segmentation
more details
n/a33.242.122.381.814.226.420.820.138.0
Hard Pixel Mining for Depth Privileged Semantic SegmentationyesyesyesyesnonoyesyesnonononononoHard Pixel Mining for Depth Privileged Semantic SegmentationZhangxuan Gu, Li Niu, Haohua Zhao, and Liqing ZhangSemantic segmentation has achieved remarkable progress but remains challenging due to the complex scene, object occlusion, and so on. Some research works have attempted to use extra information such as a depth map to help RGB based semantic segmentation because the depth map could provide complementary geometric cues. However, due to the inaccessibility of depth sensors, depth information is usually unavailable for the test images. In this paper, we leverage only the depth of training images as the privileged information to mine the hard pixels in semantic segmentation, in which depth information is only available for training images but not available for test images. Specifically, we propose a novel Loss Weight Module, which outputs a loss weight map by employing two depth-related measurements of hard pixels: Depth Prediction Error and Depthaware Segmentation Error. The loss weight map is then applied to segmentation loss, with the goal of learning a more robust model by paying more attention to the hard pixels. Besides, we also explore a curriculum learning strategy based on the loss weight map. Meanwhile, to fully mine the hard pixels on different scales, we apply our loss weight module to multi-scale side outputs. Our hard pixels mining method achieves the state-of-the-art results on three benchmark datasets, and even outperforms the methods which need depth input during testing.
more details
n/a65.274.057.891.949.463.860.256.468.5
MSeg1080_RVCyesyesnonononononononononoyesyesMSeg: A Composite Dataset for Multi-domain Semantic SegmentationJohn Lambert*, Zhuang Liu*, Ozan Sener, James Hays, Vladlen KoltunCVPR 2020We present MSeg, a composite dataset that unifies semantic segmentation datasets from different domains. A naive merge of the constituent datasets yields poor performance due to inconsistent taxonomies and annotation practices. We reconcile the taxonomies and bring the pixel-level annotations into alignment by relabeling more than 220,000 object masks in more than 80,000 images, requiring more than 1.34 years of collective annotator effort. The resulting composite dataset enables training a single semantic segmentation model that functions effectively across domains and generalizes to datasets that were not seen during training. We adopt zero-shot cross-dataset transfer as a benchmark to systematically evaluate a model’s robustness and show that MSeg training yields substantially more robust models in comparison to training on individual datasets or naive mixing of datasets without the presented contributions.
more details
0.4957.769.947.890.443.557.947.346.458.1
SA-Gate (ResNet-101,OS=16)yesyesnonononoyesyesnonononoyesyesBi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic SegmentationXiaokang Chen, Kwan-Yee Lin, Jingbo Wang, Wayne Wu, Chen Qian, Hongsheng Li, and Gang ZengEuropean Conference on Computer Vision (ECCV), 2020RGB+HHA input, input resolution = 800x800, output stride = 16, training 240 epochs, no coarse data is used.
more details
n/a63.575.255.491.649.259.855.254.866.5
seamseg_rvcsubsetnonononononononononononoyesyesSeamless Scene SegmentationPorzi, Lorenzo and Rota Bulò, Samuel and Colovic, Aleksander and Kontschieder, PeterThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2019Seamless Scene Segmentation Resnet101, pretrained on Imagenet; supplied with altered MVD to include WildDash2 classes; does not contain other RVC label policies (i.e. no ADE20K/COCO-specific classes -> rvcsubset and not a proper submission)
more details
n/a44.857.634.277.240.046.422.039.541.6
HRNet + LKPP + EA lossyesyesnonononononononononononoAnonymous
more details
n/a61.866.651.091.347.061.660.651.464.7
SN_RN152pyrx8_RVCyesyesnonononononononononoyesyesIn Defense of Pre-trained ImageNet Architectures for Real-time Semantic Segmentation of Road-driving ImagesMarin Oršić, Ivan Krešo, Petra Bevandić, Siniša ŠegvićCVPR 2019
more details
1.050.563.736.688.842.347.932.338.254.3
EffPS_b1bs4_RVCyesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaEfficientPS with EfficientNet-b1 backbone. Trained with a batch size of 4.
more details
n/a33.551.17.486.726.630.012.115.838.2
AttaNet_lightyesyesnonononononononononoyesyesAttaNet: Attention-Augmented Network for Fast and Accurate Scene Parsing(AAAI21)Anonymous
more details
n/a42.958.033.588.222.534.527.228.451.1
CFPNetyesyesnonononononononononononoAnonymous
more details
n/a43.758.837.086.323.937.925.727.652.6
Seg_UJSyesyesnonononononononononononoAnonymous
more details
n/a69.077.160.791.957.670.662.061.270.7
Bilateral_attention_semanticyesyesnonononononononononononoAnonymouswe use bilateral attention mechanism for semantic segmentation
more details
0.014155.969.747.389.734.850.647.244.064.2
Panoptic-DeepLab w/ SWideRNet [Cityscapes-fine]yesyesnonononononononononononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
more details
n/a62.274.960.789.245.957.150.856.462.3
ESANet RGB-D (small input)yesyesnonononoyesyesnono22yesyesEfficient RGB-D Semantic Segmentation for Indoor Scene AnalysisDaniel Seichter, Mona Köhler, Benjamin Lewandowski, Tim Wengefeld and Horst-Michael GrossEfficient RGB-D Semantic Segmentation for Indoor Scene Analysis.
ESANet-R34-NBt1D using RGB-D data with half the input resolution.
more details
0.042744.956.135.487.423.038.139.632.147.6
ESANet RGB (small input)yesyesnononononononono22yesyesEfficient RGB-D Semantic Segmentation for Indoor Scene AnalysisDaniel Seichter, Mona Köhler, Benjamin Lewandowski, Tim Wengefeld and Horst-Michael GrossESANet: Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis.
ESANet-R34-NBt1D using RGB images with half the input resolution.
more details
0.03140.549.631.085.623.333.333.025.842.5
ESANet RGB-DyesyesnonononoyesyesnonononoyesyesEfficient RGB-D Semantic Segmentation for Indoor Scene AnalysisDaniel Seichter, Mona Köhler, Benjamin Lewandowski, Tim Wengefeld and Horst-Michael GrossEfficient RGB-D Semantic Segmentation for Indoor Scene Analysis.
ESANet-R34-NBt1D using RGB-D data.
more details
0.161356.468.448.790.737.049.848.746.261.9
DAHUA-ARIyesyesyesyesnonononononononononoAnonymousmulti-scale and refineNet
more details
n/a70.678.363.292.758.971.465.762.672.4
ESANet RGByesyesnonononononononononoyesyesEfficient RGB-D Semantic Segmentation for Indoor Scene AnalysisDaniel Seichter, Mona Köhler, Benjamin Lewandowski, Tim Wengefeld and Horst-Michael GrossESANet: Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis.
ESANet-R34-NBt1D using RGB images only.
more details
0.120553.164.143.689.034.146.745.842.958.8
DCNAS+ASPP [Mapillary Vistas]yesyesyesyesnonononononononononoAnonymousExisting NAS algorithms usually compromise on restricted search space or search on proxy task to meet the achievable computational demands. To allow as wide as possible network architectures and avoid the gap between realistic and proxy setting, we propose a novel Densely Connected NAS (DCNAS) framework, which directly searches the optimal network structures for the multi-scale representations of visual information, over a large-scale target dataset without proxy. Specifically, by connecting cells with each other using learnable weights, we introduce a densely connected search space to cover an abundance of mainstream network designs. Moreover, by combining both path-level and channel-level sampling strategies, we design a fusion module and mixture layer to reduce the memory consumption of ample search space, hence favor the proxyless searching. Compared with contemporary works, experiments reveal that the proxyless searching scheme is capable of bridge the gap between searching and training environments.
more details
n/a70.078.362.892.658.071.562.861.971.8
Panoptic-DeepLab w/ SWideRNet [Mapillary Vistas]yesyesnonononononononononononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
more details
n/a68.676.763.490.255.269.565.161.966.3
DCNAS+ASPPyesyesyesyesnonononononononononoDCNAS: Densely Connected Neural Architecture Search for Semantic ImageSegmentationAnonymousExisting NAS algorithms usually compromise on restricted search space or search on proxy task to meet the achievable computational demands. To allow as wide as possible network architectures and avoid the gap between realistic and proxy setting, we propose a novel Densely Connected NAS (DCNAS) framework, which directly searches the optimal network structures for the multi-scale representations of visual information, over a large-scale target dataset without proxy. Specifically, by connecting cells with each other using learnable weights, we introduce a densely connected search space to cover an abundance of mainstream network designs. Moreover, by combining both path-level and channel-level sampling strategies, we design a fusion module and mixture layer to reduce the memory consumption of ample search space, hence favor the proxyless searching.
more details
n/a68.577.361.092.555.569.762.159.370.7
ddl_segyesyesnonononononononononononoAnonymous
more details
n/a69.477.160.792.057.870.865.261.170.7
CABiNetyesyesnonononononononononononoCABiNet: Efficient Context Aggregation Network for Low-Latency Semantic SegmentationSaumya Kumaar, Ye Lyu, Francesco Nex, Michael Ying YangWith the increasing demand of autonomous machines, pixel-wise semantic segmentation for visual scene understanding needs to be not only accurate but also efficient for any potential real-time applications. In this paper, we propose CABiNet (Context Aggregated Bi-lateral Network), a dual branch convolutional neural network (CNN), with significantly lower computational costs as compared to the state-of-the-art, while maintaining a competitive prediction accuracy. Building upon the existing multi-branch architectures for high-speed semantic segmentation, we design a cheap high resolution branch for effective spatial detailing and a context branch with light-weight versions of global aggregation and local distribution blocks, potent to capture both long-range and local contextual dependencies required for accurate semantic segmentation, with low computational overheads. Specifically, we achieve 76.6% and 75.9% mIOU on Cityscapes validation and test sets respectively, at 76 FPS on an NVIDIA RTX 2080Ti and 8 FPS on a Jetson Xavier NX. Codes and training models will be made publicly available.
more details
0.01349.061.341.790.927.938.332.734.864.2
Margin calibrationyesyesyesyesnonononononononononoAnonymousThe model is DeepLab v3+ backend on SEResNeXt50. We used the margin calibration with log-loss as the learning objective.
more details
n/a62.572.852.191.646.561.660.549.865.4
MT-SSSRyesyesnononononononono22nonoAnonymous
more details
n/a57.673.650.790.338.653.045.844.064.8
Panoptic-DeepLab w/ SWideRNet [Mapillary Vistas + Pseudo-labels]yesyesnonononononoyesyesnonononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime. Following Naive-Student, this model is additionally trained with pseudo-labels generated from Cityscapes Video and train-extra set (i.e., the coarse annotations are not used, but the images are).
more details
n/a71.279.464.391.360.073.268.264.668.5
DSANet: Dilated Spatial Attention for Real-time Semantic Segmentation in Urban Street ScenesyesyesnonononononononononononoAnonymouswe present computationally efficient network named DSANet, which follows a two-branch strategy to tackle the problem of real-time semantic segmentation in urban scenes. We first design a Context branch, which employs Depth-wise Asymmetric ShuffleNet DAS as main building block to acquire sufficient receptive fields. In addition, we propose a dual attention module consisting of dilated spatial attention and channel attention to make full use of the multi-level feature maps simultaneously, which helps predict the pixel-wise labels in each stage. Meanwhile, Spatial Encoding Network is used to enhance semantic information by preserving the spatial details. Finally, to better combine context information and spatial information, we introduce a Simple Feature Fusion Module to combine the features from the two branches.
more details
n/a42.961.234.685.723.839.224.527.646.9
UJS_modelyesyesnonononononononononononoAnonymous
more details
n/a70.578.063.092.460.072.965.261.371.1
Mobilenetv3-small-backbone real-time segmentationyesyesnonononononononononoyesyesAnonymousThe model is a dual-path network with mobilenetv3-small backbone. PSP module was used as the context aggregation block. We also use feature fusion module at x16, x32. The features of the two branches are then concatenated and fused with a bottleneck conv.
Only train data is used to train the model excluding validation data. And evaluation was done by single scale input images.
more details
0.0237.852.129.183.917.027.920.123.149.2
M2FANetyesyesnonononononononononoyesyesUrban street scene analysis using lightweight multi-level multi-path feature aggregation networkTanmay Singha; Duc-Son Pham; Aneesh KrishnaMultiagent and Grid Systems Journal
more details
n/a38.752.730.484.718.429.923.423.646.7
AFPNetyesyesnonononononononononononoAnonymous
more details
0.0350.163.341.188.230.542.442.636.256.1
YOLO V5s with Segmentation Headyesyesnononononononono22yesyesAnonymousMultitask model. fine tune from COCO detection pretrained model, train semantic segmentation and object detection(transfer from instance label) at the same time
more details
0.00746.356.036.486.128.538.840.932.950.5
FSFFNetyesyesyesyesnonononononononoyesyesA Lightweight Multi-scale Feature Fusion Network for Real-Time Semantic SegmentationTanmay Singha, Duc-Son Pham, Aneesh Krishna, Tom GedeonInternational Conference on Neural Information Processing 2021Feature Scaling Feature Fusion Network
more details
n/a40.454.031.585.022.332.026.526.145.5
Qualcomm AI ResearchyesyesyesyesnonononononononoyesyesInverseForm: A Loss Function for Structured Boundary-Aware SegmentationShubhankar Borse, Ying Wang, Yizhe Zhang, Fatih PorikliCVPR 2021 oral
more details
n/a72.078.964.892.661.873.268.263.273.0
HIK-CCSLTyesyesyesyesnonononononononononoAnonymous
more details
n/a70.377.963.192.556.269.767.263.971.5
BFNetyesyesnonononononononononononoBFNetJiaqi Fan
more details
n/a43.157.834.586.024.837.729.530.344.1
Hai Wang+Yingfeng Cai-research groupyesyesnonononononononononononoAnonymous
more details
0.0016470.177.162.992.259.572.764.360.970.9
Jiangsu_university_Intelligent_Drive_AIyesyesnonononononononononononoAnonymous
more details
n/a70.177.162.992.259.572.764.360.970.9
MCANetyesyesyesyesnonononononononoyesyesAnonymous
more details
n/a45.858.536.287.727.241.529.732.652.7
UFONet (half-resolution)yesyesnonononononononononononoUFO RPN: A Region Proposal Network for Ultra Fast Object DetectionWenkai Li, Andy SongThe 34th Australasian Joint Conference on Artificial Intelligence
more details
n/a22.135.25.477.92.110.76.48.231.1
SCMNetyesyesnonononononononononononoAnonymous
more details
n/a37.152.227.684.916.931.022.217.844.2
FsaNetyesyesyesyesnonononononononononoFsaNet: Frequency Self-attention for Semantic SegmentationAnonymous
more details
n/a62.269.553.490.846.860.363.048.165.5
SCMNet coarseyesyesyesyesnonononononononoyesyesSCMNet: Shared Context Mining Network for Real-time Semantic SegmentationTanmay Singha; Moritz Bergemann; Duc-Son Pham; Aneesh Krishna2021 Digital Image Computing: Techniques and Applications (DICTA)
more details
n/a38.353.728.785.618.231.223.220.445.3
SAIT SeeThroughNetyesyesyesyesnonononononononononoAnonymous
more details
n/a71.578.765.292.661.772.464.863.473.4
JSU_IDT_groupyesyesnonononononononononononoAnonymous
more details
n/a69.077.563.392.448.274.063.561.671.9
DLA_HRNet48OCR_MSFLIP_000yesyesyesyesnonononononononononoAnonymousThis set of predictions is from DLA (differentiable lattice assignment network) with "HRNet48+OCR-Head" as base segmentation model. The model is, first trained on coarse-data, and then trained on fine-annotated train/val sets. Multi-scale (0.5, 0.75, 1.0, 1.25, 1.5, 1.75) and flip scheme is adopted during inference.
more details
n/a68.676.962.092.355.567.963.260.170.7
MYBank-AIoTyesyesyesyesnonononononononononoAnonymous
more details
n/a72.979.364.992.162.876.269.065.673.5
kMaX-DeepLab [Cityscapes-fine]yesyesnonononononononononoyesyesk-means Mask TransformerQihang Yu, Huiyu Wang, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, and Liang-Chieh ChenECCV 2022kMaX-DeepLab w/ ConvNeXt-L backbone (ImageNet-22k + 1k pretrained). This result is obtained by the kMaX-DeepLab trained for Panoptic Segmentation task. No test-time augmentation or other external dataset.
more details
n/a65.976.158.489.454.467.060.458.762.4
LeapAIyesyesyesyesnonononononononononoAnonymousUsing advanced AI techniques.
more details
n/a70.976.763.592.662.972.663.463.871.7
adlab_iiau_ldzyesyesnonononononononononononoAnonymousmeticulous-caiman_2022.05.01_03.32
more details
n/a69.577.962.592.758.270.261.561.371.9
SFRSegyesyesnonononononononononoyesyesA Real-Time Semantic Segmentation Model Using Iteratively Shared Features In Multiple Sub-EncodersTanmay Singha, Duc-Son Pham, Aneesh KrishnaPattern Recognition
more details
n/a41.350.229.882.424.136.833.827.146.5
PIDNet-SyesyesnonononononononononoyesyesPIDNet: A Real-time Semantic Segmentation Network Inspired from PID ControllerAnonymous
more details
0.010755.067.346.689.232.347.153.143.360.9
Vision Transformer Adapter for Dense PredictionsyesyesnonononononononononoyesyesVision Transformer Adapter for Dense PredictionsZhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu QiaoViT-Adapter-L, BEiT pre-train, multi-scale testing
more details
n/a68.375.958.991.554.668.266.760.469.9
SSNetyesyesnonononoyesyesnonononononoAnonymous
more details
n/a51.767.648.088.632.143.931.742.659.0
SDBNetyesyesnonononononononononoyesyesSDBNet: Lightweight Real-time Semantic Segmentation Using Short-term Dense BottleneckTanmay Singha, Duc-Son Pham, Aneesh Krishna2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)
more details
n/a42.054.732.185.521.234.231.827.848.9
MeiTuan-BaseModelyesyesyesyesnonononononononononoAnonymous
more details
n/a73.278.965.891.762.474.872.865.973.3
SDBNetV2yesyesnonononononononononoyesyesImproved Short-term Dense Bottleneck network for efficient scene analysisTanmay Singha; Duc-Son Pham; Aneesh KrishnaComputer Vision and Image Understanding
more details
n/a43.856.135.486.126.036.231.829.549.7
mogo_semanticyesyesnonononononononononononoAnonymous
more details
n/a69.678.163.392.556.469.164.560.672.3
UDSSEG_RVCyesyesnonononononononononononoAnonymousUDSSEG_RVC
more details
n/a55.964.942.789.945.455.050.242.556.7
MIX6D_RVCyesyesnonononononononononononoAnonymousMIX6D_RVC
more details
n/a58.664.145.186.045.958.263.946.758.9
FAN_NV_RVCyesyesnonononononononononononoAnonymousHybrid-Base + Segformer
more details
n/a59.067.246.290.644.258.455.149.360.9
UNIV_CNP_RVCyesyesnonononononononononononoAnonymousRVC 2022
more details
n/a50.257.542.783.336.854.349.932.644.9
AntGroup-AI-VisionAlgoyesyesyesyesnonononoyesyesnonononoAnonymousAntGroup AI vision algo
more details
n/a71.377.363.391.962.572.369.164.469.9
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable ConvolutionsyesyesyesyesnonononononononoyesyesInternImage: Exploring Large-Scale Vision Foundation Models with Deformable ConvolutionsWenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, Yu QiaoCVPR 2023We use Mask2Former as the segmentation framework, and initialize our InternImage-H model with the pre-trained weights on the 427M joint dataset of public Laion-400M, YFCC-15M, and CC12M. Following common practices, we first pre-train on Mapillary Vistas for 80k iterations, and then fine-tune on Cityscapes for 80k iterations. The crop size is set to 1024×1024 in this experiment. As a result, our InternImage-H achieves 87.0 multi-scale mIoU on the validation set, and 86.1 multi-scale mIoU on the test set.
more details
n/a73.678.866.890.866.976.969.266.273.3
Dense Prediction with Attentive Feature aggregationyesyesyesyesnonononononononoyesyesDense Prediction with Attentive Feature AggregationYung-Hsu Yang, Thomas E. Huang, Min Sun, Samuel Rota Bulò, Peter Kontschieder, Fisher YuWACV 2023We propose Attentive Feature Aggregation (AFA) to exploit both spatial and channel information for semantic segmentation and boundary detection.
more details
n/a65.173.557.291.151.164.057.357.969.0
W3_FAFMyesyesnonononononononononononoJunyan Yang, Qian Xu, Lei LaTeam: BOSCH-XC-DX-WAVE3
more details
0.02930959.371.152.990.340.154.054.346.365.0
HRNyesyesnonononononononononononoAnonymousHierarchical residual network
more details
45.053.466.344.889.033.444.249.942.257.7
HRN+DCNv2_for_DOASyesyesnonononononononononononoAnonymousHRN with DCNv2 for DOAS in paper "Dynamic Obstacle Avoidance System based on Rapid Instance Segmentation Network"
more details
0.03259.872.051.990.340.352.955.748.666.5
GEELY-ATC-SEGyesyesyesyesnonononononononononoAnonymous
more details
n/a73.979.165.191.366.874.873.566.873.8
PMSDSENyesyesnonononononononononoyesyesEfficient Parallel Multi-Scale Detail and Semantic Encoding Network for Lightweight Semantic SegmentationXiao Liu, Xiuya Shi, Lufei Chen, Linbo Qing, Chao RenACM International Conference on Multimedia 2023MM '23: Proceedings of the 31th ACM International Conference on Multimedia
more details
n/a49.963.740.688.129.445.043.834.653.7
ECFDyesyesnonononononononononoyesyesAnonymousbackbone: ConvNext-Large
more details
n/a63.674.055.290.848.361.955.656.067.0
DWGSeg-L75yesyesnononononononono1.31.3nonoAnonymous
more details
0.0075550.362.542.788.529.344.139.938.956.3
VLTSegyesyesnonononononononononononoVLTSeg: Simple Transfer of CLIP-Based Vision-Language Representations for Domain Generalized Semantic SegmentationChristoph Hümmer, Manuel Schwonberg, Liangwei Zhou, Hu Cao, Alois Knoll, Hanno Gottschalk
more details
n/a73.379.466.591.463.073.371.567.973.4
CGMANet_v1yesyesnonononononononononononoContext Guided Multi-scale Attention for Real-time Semantic Segmentation of Road-sceneSaquib MazharContext Guided Multi-scale Attention for Real-time Semantic Segmentation of Road-scene
more details
n/a48.363.740.186.926.541.834.736.956.0
SERNet-Former_v2yesyesyesyesnonononononononononoAnonymous
more details
n/a67.172.758.889.753.366.067.860.368.2

IoU on category-level

namefinefinecoarsecoarse16-bit16-bitdepthdepthvideovideosubsubcodecodetitleauthorsvenuedescriptionRuntime [s]averageflatnatureobjectskyconstructionhumanvehicle
FCN 8syesyesnonononononononononoyesyesFully Convolutional Networks for Semantic SegmentationJ. Long, E. Shelhamer, and T. DarrellCVPR 2015Trained by Marius Cordts on a pre-release version of the dataset
more details
0.585.798.291.157.093.989.678.691.3
RRR-ResNet152-MultiScaleyesyesyesyesnonononononononononoAnonymousupdate: this submission actually used the coarse labels, which was previously not marked accordingly
more details
n/a89.398.592.569.095.092.383.294.3
Dilation10yesyesnonononononononononoyesyesMulti-Scale Context Aggregation by Dilated ConvolutionsFisher Yu and Vladlen KoltunICLR 2016Dilation10 is a convolutional network that consists of a front-end prediction module and a context aggregation module. Both are described in the paper. The combined network was trained jointly. The context module consists of 10 layers, each of which has C=19 feature maps. The larger number of layers in the context module (10 for Cityscapes versus 8 for Pascal VOC) is due to the high input resolution. The Dilation10 model is a pure convolutional network: there is no CRF and no structured prediction. Dilation10 can therefore be used as the baseline input for structured prediction models. Note that the reported results were produced by training on the training set only; the network was not retrained on train+val.
more details
4.086.598.391.460.593.790.279.891.8
AdelaideyesyesnonononononononononononoEfficient Piecewise Training of Deep Structured Models for Semantic SegmentationG. Lin, C. Shen, I. Reid, and A. van den HengelarXiv preprint 2015Trained on a pre-release version of the dataset
more details
35.082.897.889.748.292.288.773.189.6
DeepLab LargeFOV StrongWeakyesyesyesyesnononononono22yesyesWeakly- and Semi-Supervised Learning of a DCNN for Semantic Image SegmentationG. Papandreou, L.-C. Chen, K. Murphy, and A. L. YuilleICCV 2015Trained on a pre-release version of the dataset
more details
4.081.397.889.040.492.888.270.990.0
DeepLab LargeFOV Strongyesyesnononononononono22yesyesSemantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFsL.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. YuilleICLR 2015Trained on a pre-release version of the dataset
more details
4.081.297.889.040.492.788.071.089.7
DPNyesyesyesyesnononononono33nonoSemantic Image Segmentation via Deep Parsing NetworkZ. Liu, X. Li, P. Luo, C. C. Loy, and X. TangICCV 2015Trained on a pre-release version of the dataset
more details
n/a79.597.388.037.793.986.665.487.7
Segnet basicyesyesnononononononono44yesyesSegNet: A Deep Convolutional Encoder-Decoder Architecture for Image SegmentationV. Badrinarayanan, A. Kendall, and R. CipollaarXiv preprint 2015Trained on a pre-release version of the dataset
more details
0.0679.197.486.742.591.883.864.787.2
Segnet extendedyesyesnononononononono44yesyesSegNet: A Deep Convolutional Encoder-Decoder Architecture for Image SegmentationV. Badrinarayanan, A. Kendall, and R. CipollaarXiv preprint 2015Trained on a pre-release version of the dataset
more details
0.0679.897.587.143.791.782.868.687.5
CRFasRNNyesyesnononononononono22yesyesConditional Random Fields as Recurrent Neural NetworksS. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. TorrICCV 2015Trained on a pre-release version of the dataset
more details
0.782.797.790.346.593.588.573.688.9
Scale invariant CNN + CRFyesyesnonononoyesyesnonononoyesyesConvolutional Scale Invariance for Semantic SegmentationI. Kreso, D. Causevic, J. Krapac, and S. SegvicGCPR 2016We propose an effective technique to address large scale variation in images taken from a moving car by cross-breeding deep learning with stereo reconstruction. Our main contribution is a novel scale selection layer which extracts convolutional features at the scale which matches the corresponding reconstructed depth. The recovered scaleinvariant representation disentangles appearance from scale and frees the pixel-level classifier from the need to learn the laws of the perspective. This results in improved segmentation results due to more effi- cient exploitation of representation capacity and training data. We perform experiments on two challenging stereoscopic datasets (KITTI and Cityscapes) and report competitive class-level IoU performance.
more details
n/a85.097.290.259.992.289.078.288.4
DPNyesyesnonononononononononononoSemantic Image Segmentation via Deep Parsing NetworkZ. Liu, X. Li, P. Luo, C. C. Loy, and X. TangICCV 2015DPN trained on full resolution images
more details
n/a86.098.291.158.994.589.878.491.2
Pixel-level Encoding for Instance SegmentationyesyesnonononoyesyesnonononononoPixel-level Encoding and Depth Layering for Instance-level Semantic LabelingJ. Uhrig, M. Cordts, U. Franke, and T. BroxGCPR 2016We predict three encoding channels from a single image using an FCN: semantic labels, depth classes, and an instance-aware representation based on directions towards instance centers. Using low-level computer vision techniques, we obtain pixel-level and instance-level semantic labeling paired with a depth estimate of the instances.
more details
n/a85.998.290.859.393.589.279.291.1
Adelaide_contextyesyesnonononononononononononoEfficient Piecewise Training of Deep Structured Models for Semantic SegmentationGuosheng Lin, Chunhua Shen, Anton van den Hengel, Ian ReidCVPR 2016We explore contextual information to improve semantic image segmentation. Details are described in the paper. We trained contextual networks for coarse level prediction and a refinement network for refining the coarse prediction. Our models are trained on the training set only (2975 images) without adding the validation set.
more details
n/a87.398.491.760.894.190.982.093.3
NVSegNetyesyesnonononononononononononoAnonymousIn the inference, we use the image of 2 different scales. The same for training!
more details
0.487.298.491.663.594.690.580.292.0
ENetyesyesnononononononono22yesyesENet: A Deep Neural Network Architecture for Real-Time Semantic SegmentationAdam Paszke, Abhishek Chaurasia, Sangpil Kim, Eugenio Culurciello
more details
0.01380.497.388.346.890.685.465.588.9
DeepLabv2-CRFyesyesnonononononononononoyesyesDeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFsLiang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, Alan L. YuillearXiv preprintDeepLabv2-CRF is based on three main methods. First, we employ convolution with upsampled filters, or ‘atrous convolution’, as a powerful tool to repurpose ResNet-101 (trained on image classification task) in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within DCNNs. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and fully connected Conditional Random Fields (CRFs). The model is only trained on train set.
more details
n/a86.498.391.557.394.290.880.292.6
m-TCFsyesyesyesyesnonononononononononoAnonymousConvolutional Neural Network
more details
1.087.698.491.963.294.591.580.393.2
DeepLab+DynamicCRFyesyesnonononononononononononoru.nl
more details
n/a83.797.389.448.293.688.877.191.1
LRR-4xyesyesnonononononononononoyesyesLaplacian Pyramid Reconstruction and Refinement for Semantic SegmentationGolnaz Ghiasi, Charless C. FowlkesECCV 2016We introduce a CNN architecture that reconstructs high-resolution class label predictions from low-resolution feature maps using class-specific basis functions. Our multi-resolution architecture also uses skip connections from higher resolution feature maps to successively refine segment boundaries reconstructed from lower resolution maps. The model used for this submission is based on VGG-16 and it was trained on the training set (2975 images). The segmentation predictions were not post-processed using CRF. (This is a revision of a previous submission in which we didn't use the correct basis functions; the method name changed from 'LLR-4x' to 'LRR-4x')
more details
n/a88.298.492.266.294.791.182.492.5
LRR-4xyesyesyesyesnonononononononoyesyesLaplacian Pyramid Reconstruction and Refinement for Semantic SegmentationGolnaz Ghiasi, Charless C. FowlkesECCV 2016We introduce a CNN architecture that reconstructs high-resolution class label predictions from low-resolution feature maps using class-specific basis functions. Our multi-resolution architecture also uses skip connections from higher resolution feature maps to successively refine segment boundaries reconstructed from lower resolution maps. The model used for this submission is based on VGG-16 and it was trained using both coarse and fine annotations. The segmentation predictions were not post-processed using CRF.
more details
n/a88.498.492.266.995.091.581.993.1
Le_Selfdriving_VGGyesyesnonononononononononononoAnonymous
more details
n/a84.498.089.954.793.488.975.290.7
SQyesyesnonononononononononononoSpeeding up Semantic Segmentation for Autonomous DrivingMichael Treml, José Arjona-Medina, Thomas Unterthiner, Rupesh Durgesh, Felix Friedmann, Peter Schuberth, Andreas Mayr, Martin Heusel, Markus Hofmarcher, Michael Widrich, Bernhard Nessler, Sepp HochreiterNIPS 2016 Workshop - MLITS Machine Learning for Intelligent Transportation Systems Neural Information Processing Systems 2016, Barcelona, Spain
more details
0.0684.396.790.457.093.087.575.689.9
SAITyesyesyesyesnonononononononononoAnonymousAnonymous
more details
4.089.698.692.969.195.292.883.694.6
FoveaNetyesyesnonononononononononononoFoveaNetXin Li, Jiashi Feng1.caffe-master
2.resnet-101
3.single scale testing

Previously listed as "LXFCRN".
more details
n/a89.398.592.469.894.591.984.393.4
RefineNetyesyesnonononononononononoyesyesRefineNet: Multi-Path Refinement Networks for High-Resolution Semantic SegmentationGuosheng Lin; Anton Milan; Chunhua Shen; Ian Reid;Please refer to our technical report for details: "RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation" (https://arxiv.org/abs/1611.06612). Our source code is available at: https://github.com/guosheng/refinenet
2975 images (training set with fine labels) are used for training.
more details
n/a87.998.491.963.894.891.781.393.6
SegModelyesyesnonononononononononononoAnonymousBoth train set (2975) and val set (500) are used to train model for this submission.
more details
0.889.898.693.068.195.593.385.295.0
TuSimpleyesyesnonononononononononoyesyesUnderstanding Convolution for Semantic SegmentationPanqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, Garrison Cottrell
more details
n/a90.198.693.071.995.293.084.894.5
Global-Local-RefinementyesyesnonononononononononononoGlobal-residual and Local-boundary Refinement Networks for Rectifying Scene Parsing PredictionsRui Zhang, Sheng Tang, Min Lin, Jintao Li, Shuicheng YanInternational Joint Conference on Artificial Intelligence (IJCAI) 2017global-residual and local-boundary refinement

The method was previously listed as "RefineNet". To avoid confusions with a recently appeared and similarly named approach, the submission name was updated.
more details
n/a90.098.793.070.195.493.185.294.8
XPARSEyesyesnonononononononononononoAnonymous
more details
n/a88.798.592.466.595.291.982.893.7
ResNet-38yesyesnonononononononononoyesyesWider or Deeper: Revisiting the ResNet Model for Visual RecognitionZifeng Wu, Chunhua Shen, Anton van den Hengelarxivsingle model, single scale, no post-processing with CRFs
Model A2, 2 conv., fine only, single scale testing

The submissions was previously listed as "Model A2, 2 conv.". The name was changed for consistency with the other submission of the same work.
more details
n/a90.998.793.473.495.593.587.095.1
SegModelyesyesyesyesnonononononononononoAnonymous
more details
n/a90.498.793.271.295.593.585.395.2
Deep Layer Cascade (LC)yesyesnonononononononononononoNot All Pixels Are Equal: Difficulty-aware Semantic Segmentation via Deep Layer CascadeXiaoxiao Li, Ziwei Liu, Ping Luo, Chen Change Loy, Xiaoou TangCVPR 2017We propose a novel deep layer cascade (LC) method to improve the accuracy and speed of semantic segmentation. Unlike the conventional model cascade (MC) that is composed of multiple independent models, LC treats a single deep model as a cascade of several sub-models. Earlier sub-models are trained to handle easy and confident regions, and they progressively feed-forward harder regions to the next sub-model for processing. Convolutions are only calculated on these regions to reduce computations. The proposed method possesses several advantages. First, LC classifies most of the easy regions in the shallow stage and makes deeper stage focuses on a few hard regions. Such an adaptive and 'difficulty-aware' learning improves segmentation performance. Second, LC accelerates both training and testing of deep network thanks to early decisions in the shallow stage. Third, in comparison to MC, LC is an end-to-end trainable framework, allowing joint learning of all sub-models. We evaluate our method on PASCAL VOC and
more details
n/a88.198.492.164.594.291.582.593.5
FRRNyesyesnononononononono22yesyesFull-Resolution Residual Networks for Semantic Segmentation in Street ScenesTobias Pohlen, Alexander Hermans, Markus Mathias, Bastian LeibeArxivFull-Resolution Residual Networks (FRRN) combine multi-scale context with pixel-level accuracy by using two processing streams within one network: One stream carries information at the full image resolution, enabling precise adherence to segment boundaries. The other stream undergoes a sequence of pooling operations to obtain robust features for recognition.
more details
n/a88.998.592.368.494.991.882.593.8
MNet_MPRGyesyesnonononononononononononoChubu University, MPRGwithout val dataset, external dataset (e.g. image net) and post-processing
more details
0.689.398.592.470.494.791.983.393.6
ResNet-38yesyesyesyesnonononononononoyesyesWider or Deeper: Revisiting the ResNet Model for Visual RecognitionZifeng Wu, Chunhua Shen, Anton van den Hengelarxivsingle model, no post-processing with CRFs
Model A2, 2 conv., fine+coarse, multi scale testing
more details
n/a91.098.793.473.695.593.686.995.5
FCN8s-QunjieYuyesyesnonononononononononononoAnonymous
more details
n/a81.896.591.235.093.388.377.990.1
RGB-D FCNyesyesyesyesnonoyesyesnonononononoAnonymousGoogLeNet + depth branch, single model
no data augmentation, no training on validation set, no graphical model
Used coarse labels to initialize depth branch
more details
n/a87.598.491.664.394.790.980.392.3
MultiBoostyesyesyesyesnonoyesyesnono22nonoAnonymousBoosting based solution.
Publication is under review.
more details
0.2581.997.588.750.990.387.171.487.3
GoogLeNet FCNyesyesnonononononononononononoGoing Deeper with ConvolutionsChristian Szegedy , Wei Liu , Yangqing Jia , Pierre Sermanet , Scott Reed , Dragomir Anguelov , Dumitru Erhan , Vincent Vanhoucke , Andrew RabinovichCVPR 2015GoogLeNet
No data augmentation, no graphical model
Trained by Lukas Schneider, following "Fully Convolutional Networks for Semantic Segmentation", Long et al. CVPR 2015
more details
n/a85.898.290.958.693.789.578.491.2
ERFNet (pretrained)yesyesnononononononono22yesyesERFNet: Efficient Residual Factorized ConvNet for Real-time Semantic SegmentationEduardo Romera, Jose M. Alvarez, Luis M. Bergasa and Roberto ArroyoTransactions on Intelligent Transportation Systems (T-ITS)ERFNet pretrained on ImageNet and trained only on the fine train (2975) annotated images


more details
0.0287.398.291.565.194.290.678.992.3
ERFNet (from scratch)yesyesnononononononono22yesyesEfficient ConvNet for Real-time Semantic SegmentationEduardo Romera, Jose M. Alvarez, Luis M. Bergasa and Roberto ArroyoIV2017ERFNet trained entirely on the fine train set (2975 images) without any pretraining nor coarse labels
more details
0.0286.598.291.162.494.290.177.491.9
TuSimple_CoarseyesyesyesyesnonononononononoyesyesUnderstanding Convolution for Semantic SegmentationPanqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, Garrison CottrellHere we show how to improve pixel-wise semantic segmentation by manipulating convolution-related operations that are better for practical use. First, we implement dense upsampling convolution (DUC) to generate pixel-level prediction, which is able to capture and decode more detailed information that is generally missing in bilinear upsampling. Second, we propose a hybrid dilated convolution (HDC) framework in the encoding phase. This framework 1) effectively enlarges the receptive fields of the network to aggregate global information; 2) alleviates what we call the "gridding issue" caused by the standard dilated convolution operation. We evaluate our approaches thoroughly on the Cityscapes dataset, and achieve a new state-of-art result of 80.1% mIOU in the test set. We also are state-of-the-art overall on the KITTI road estimation benchmark and the
PASCAL VOC2012 segmentation task. Pretrained models are available at https://goo.gl/DQMeun.
more details
n/a90.798.793.173.195.493.486.095.4
SAC-multipleyesyesnonononononononononononoScale-adaptive Convolutions for Scene ParsingRui Zhang, Sheng Tang, Yongdong Zhang, Jintao Li, and Shuicheng YanInternational Conference on Computer Vision (ICCV) 2017
more details
n/a90.698.793.171.895.693.486.195.3
NetWarpyesyesyesyesnonononoyesyesnonononoAnonymous
more details
n/a91.098.693.274.895.393.586.695.3
depthAwareSeg_RNN_ffyesyesnonononononononononoyesyesAnonymoustraining with fine-annotated training images only (val set is not used); flip-augmentation only in training; single GPU for train&test; softmax loss; resnet101 as front end; multiscale test.
more details
n/a89.798.692.868.394.893.085.595.0
Ladder DenseNetyesyesnonononononononononoyesyesLadder-style DenseNets for Semantic Segmentation of Large Natural ImagesIvan Krešo, Josip Krapac, Siniša ŠegvićICCV 2017https://ivankreso.github.io/publication/ladder-densenet/
more details
0.4589.798.392.171.195.592.384.593.9
Real-time FCNyesyesyesyesnonononononononononoUnderstanding Cityscapes: Efficient Urban Semantic Scene UnderstandingMarius CordtsDissertationCombines the following concepts:
Network architecture: "Going deeper with convolutions". Szegedy et al., CVPR 2015
Framework and skip connections: "Fully convolutional networks for semantic segmentation". Long et al., CVPR 2015
Context modules: "Multi-scale context aggregation by dilated convolutions". Yu and Kolutin, ICLR 2016
more details
0.04487.998.491.564.394.791.481.693.7
GridNetyesyesnonononononononononononoAnonymousConv-Deconv Grid-Network for semantic segmentation.
Using only the training set without extra coarse annotated data (only 2975 images).
No pre-training (ImageNet).
No post-processing (like CRF).
more details
n/a87.998.492.165.593.890.981.892.5
PEARLyesyesnonononononoyesyesnonononoVideo Scene Parsing with Predictive Feature LearningXiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe Lin, Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu, Zequn Jie, Jiashi Feng, and Shuicheng YanICCV 2017We proposed a novel Parsing with prEdictive feAtuRe Learning (PEARL) model to address the following two problems in video scene parsing: firstly, how to effectively learn meaningful video representations for producing the temporally consistent labeling maps; secondly, how to overcome the problem of insufficient labeled video training data, i.e. how to effectively conduct unsupervised deep learning. To our knowledge, this is the first model to employ predictive feature learning in the video scene parsing.
more details
n/a89.298.592.667.695.292.383.794.2
pruned & dilated inception-resnet-v2 (PD-IR2)yesyesyesyesnonononononononoyesyesAnonymous
more details
0.6986.598.391.061.994.490.278.491.6
PSPNetyesyesyesyesnonononononononoyesyesPyramid Scene Parsing NetworkHengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, Jiaya JiaCVPR 2017This submission is trained on coarse+fine(train+val set, 2975+500 images).

Former submission is trained on coarse+fine(train set, 2975 images) which gets 80.2 mIoU: https://www.cityscapes-dataset.com/method-details/?submissionID=314

Previous versions of this method were listed as "SenseSeg_1026".
more details
n/a91.298.793.374.295.393.887.195.7
motovisyesyesyesyesnonononononononononomotovis.com
more details
n/a91.598.793.475.495.893.887.395.8
ML-CRNNyesyesnonononononononononononoMulti-level Contextual RNNs with Attention Model for Scene LabelingHeng Fan, Xue Mei, Danil Prokhorov, Haibin LingarXivA framework based on CNNs and RNNs is proposed, in which the RNNs are used to model spatial dependencies among image units. Besides, to enrich deep features, we use different features from multiple levels, and adopt a novel attention model to fuse them.
more details
n/a87.798.391.864.794.691.280.792.7
Hybrid ModelyesyesnonononononononononononoAnonymous
more details
n/a85.298.190.856.891.889.678.191.1
tek-IflyyesyesnonononononononononononoIflytekIflytek-yinusing a fusion strategy of three single models, the best result of a single model is 80.01%,multi-scale
more details
n/a90.998.793.472.995.693.786.695.6
GridNetyesyesnonononononononononoyesyesResidual Conv-Deconv Grid Network for Semantic SegmentationDamien Fourure, Rémi Emonet, Elisa Fromont, Damien Muselet, Alain Tremeau & Christian WolfBMVC 2017We used a new architecture for semantic image segmentation called GridNet, following a grid pattern allowing multiple interconnected streams to work at different resolutions (see paper).
We used only the training set without extra coarse annotated data (only 2975 images) and no pre-training (ImageNet) nor pre or post-processing.
more details
n/a88.198.492.166.293.891.182.392.6
firenetyesyesnononononononono22nonoAnonymous
more details
n/a84.995.089.860.692.187.277.691.7
DeepLabv3yesyesyesyesnonononononononononoRethinking Atrous Convolution for Semantic Image SegmentationLiang-Chieh Chen, George Papandreou, Florian Schroff, Hartwig AdamarXiv preprintIn this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter’s field-of-view as well as control the resolution of feature responses computed by Deep Convolutional Neural Networks, in the application of semantic image segmentation. To handle the problem of segmenting objects
at multiple scales, we employ a module, called Atrous Spatial Pyrmid Pooling (ASPP), which adopts atrous convolution in parallel to capture multi-scale context with multiple atrous rates. Furthermore, we propose to augment ASPP module with image-level features encoding global context and further boost performance.
Results obtained with a single model (no ensemble), trained with fine + coarse annotations. More details will be shown in the updated arXiv report.
more details
n/a91.698.793.576.095.993.987.995.7
EdgeSenseSegyesyesnonononononononononononoAnonymousDeep segmentation network with hard negative mining and other tricks.
more details
n/a89.898.692.968.995.092.985.694.5
ScaleNetyesyesyesyesnonononononononononoScaleNet: Scale Invariant Network for Semantic Segmentation in Urban Driving ScenesMohammad Dawud Ansari, Stephan Krarß, Oliver Wasenmüller and Didier StrickerInternational Conference on Computer Vision Theory and Applications, Funchal, Portugal, 2018The scale difference in driving scenarios is one of the essential challenges in semantic scene segmentation.
Close objects cover significantly more pixels than far objects. In this paper, we address this challenge with a
scale invariant architecture. Within this architecture, we explicitly estimate the depth and adapt the pooling
field size accordingly. Our model is compact and can be extended easily to other research domains. Finally,
the accuracy of our approach is comparable to the state-of-the-art and superior for scale problems. We evaluate
on the widely used automotive dataset Cityscapes as well as a self-recorded dataset.
more details
n/a89.698.592.869.994.692.984.194.4
K-netyesyesnonononononononononononoXinLiang Zhong
more details
n/a88.898.592.366.494.692.483.094.4
MSNETyesyesnonononononononononononoAnonymouspreviously also listed as "MultiPathJoin" and "MultiPath_Scale".
more details
0.290.698.693.173.594.993.086.294.8
Multitask LearningyesyesnonononononononononononoMulti-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and SemanticsAlex Kendall, Yarin Gal and Roberto CipollaNumerous deep learning applications benefit from multi-task learning with multiple regression and classification objectives. In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task's loss. Tuning these weights by hand is a difficult and expensive process, making multi-task learning prohibitive in practice. We propose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the homoscedastic uncertainty of each task. This allows us to simultaneously learn various quantities with different units or scales in both classification and regression settings. We demonstrate our model learning per-pixel depth regression, semantic and instance segmentation from a monocular input image. Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate models trained individually on each task.
more details
n/a89.998.593.069.895.193.285.394.7
DeepMotionyesyesnonononononononononononoAnonymousWe propose a novel method based on convnets to extract multi-scale features in a large range particularly for solving street scene segmentation.
more details
n/a90.798.793.372.195.493.686.295.5
SR-AICyesyesyesyesnonononononononononoAnonymous
more details
n/a91.398.793.475.195.594.086.995.7
Roadstar.ai_CV(SFNet)yesyesnonononononononononononoRoadstar.ai-CVMaosheng Ye, Guang Zhou, Tongyi Cao, YongTao Huang, Yinzi Chensame foucs net(SFNet), based on only fine labels, with focus on the loss distribution and same focus on the every layer of feature map
more details
0.291.098.793.473.895.393.587.095.3
DFNyesyesyesyesnonononononononononoLearning a Discriminative Feature Network for Semantic SegmentationChangqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, Nong SangarxivMost existing methods of semantic segmentation still suffer from two aspects of challenges: intra-class inconsistency and inter-class indistinction. To tackle these two problems, we propose a Discriminative Feature Network (DFN), which contains two sub-networks: Smooth Network and Border Network. Specifically, to handle the intra-class inconsistency problem, we specially design a Smooth Network with Channel Attention Block and global average pooling to select the more discriminative features. Furthermore, we propose a Border Network to make the bilateral features of boundary distinguishable with deep semantic boundary supervision. Based on our proposed DFN, we achieve state-of-the-art performance 86.2% mean IOU on PASCAL VOC 2012 and 80.3% mean IOU on Cityscapes dataset.
more details
n/a90.898.793.172.795.593.486.795.6
RelationNet_CoarseyesyesyesyesnonononononononononoRelationNet: Learning Deep-Aligned Representation for Semantic Image SegmentationYueqing ZhuangICPR Semantic image segmentation, which assigns labels in pixel level, plays a central role in image understanding. Recent approaches have attempted to harness the capabilities of deep learning. However, one central problem of these methods is that deep convolution neural network gives little consideration to the correlation among pixels. To handle this issue, in this paper, we propose a novel deep neural network named RelationNet, which utilizes CNN and RNN to aggregate context information. Besides, a spatial correlation loss is applied to supervise RelationNet to align features of spatial pixels belonging to same category. Importantly, since it is expensive to obtain pixel-wise annotations, we exploit a new training method for combining the coarsely and finely labeled data. Separate experiments show the detailed improvements of each proposal. Experimental results demonstrate the effectiveness of our proposed method to the problem of semantic image segmentation.
more details
n/a91.898.893.676.195.894.187.995.9
ARSAITyesyesnonononononononononononoAnonymousanonymous
more details
1.089.098.592.467.695.292.183.293.9
Mapillary Research: In-Place Activated BatchNormyesyesyesyesnonononononononoyesyesIn-Place Activated BatchNorm for Memory-Optimized Training of DNNsSamuel Rota Bulò, Lorenzo Porzi, Peter KontschiederarXivIn-Place Activated Batch Normalization (InPlace-ABN) is a novel approach to drastically reduce the training memory footprint of modern deep neural networks in a computationally efficient way. Our solution substitutes the conventionally used succession of BatchNorm + Activation layers with a single plugin layer, hence avoiding invasive framework surgery while providing straightforward applicability for existing deep learning frameworks. We obtain memory savings of up to 50% by dropping intermediate results and by recovering required information during the backward pass through the inversion of stored forward results, with only minor increase (0.8-2%) in computation time. Test results are obtained using a single model.
more details
n/a91.298.693.374.495.693.886.995.4
EFBNETyesyesnonononononononononononoAnonymous
more details
n/a90.798.693.272.595.293.686.495.5
Ladder DenseNet v2yesyesnonononononononononononoJournal submissionAnonymousDenseNet-121 model used in downsampling path with ladder-style skip connections upsampling path on top of it.
more details
1.090.898.793.373.395.593.486.495.3
ESPNetyesyesnononononononono22yesyesESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation Sachin Mehta, Mohammad Rastegari, Anat Caspi, Linda Shapiro, and Hannaneh HajishirziWe introduce a fast and efficient convolutional neural network, ESPNet, for semantic segmentation of high resolution images under resource constraints. ESPNet is based on a new convolutional module, efficient spatial pyramid (ESP), which is efficient in terms of computation, memory, and power. ESPNet is 22 times faster (on a standard GPU) and 180 times smaller than the state-of-the-art semantic segmentation network PSPNet, while its category-wise accuracy is only 8% less. We evaluated EPSNet on a variety of semantic segmentation datasets including Cityscapes, PASCAL VOC, and a breast biopsy whole slide image dataset. Under the same constraints on memory and computation, ESPNet outperforms all the current efficient CNN networks such as MobileNet, ShuffleNet, and ENet on both standard metrics and our newly introduced performance metrics that measure efficiency on edge devices. Our network can process high resolution images at a rate of 112 and 9 frames per second on a standard GPU and edge device, respectively
more details
0.008982.295.589.552.992.586.769.888.4
ENet with the Lovász-Softmax lossyesyesnononononononono22yesyesThe Lovász-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networksMaxim Berman, Amal Rannen Triki, Matthew B. BlaschkoarxivThe Lovász-Softmax loss is a novel surrogate for optimizing the IoU measure in neural networks. Here we finetune the weights provided by the authors of ENet (arXiv:1606.02147) with this loss, for 10'000 iterations on training dataset. The runtimes are unchanged with respect to the ENet architecture.
more details
0.01383.698.089.654.592.787.672.889.7
DRN_CRL_CoarseyesyesyesyesnonononononononoyesyesDense Relation Network: Learning Consistent and Context-Aware Representation For Semantic Image SegmentationYueqing ZhuangICIPDRN_CoarseSemantic image segmentation, which aims at assigning pixel-wise category, is one of challenging image understanding problems. Global context plays an important role on local pixel-wise category assignment. To make the best of global context, in this paper, we propose dense relation network (DRN) and context-restricted loss (CRL) to aggregate global and local information. DRN uses Recurrent Neural Network (RNN) with different skip lengths in spatial directions to get context-aware representations while CRL helps aggregate them to learn consistency. Compared with previous methods, our proposed method takes full advantage of hierarchical contextual representations to produce high-quality results. Extensive experiments demonstrate that our methods achieves significant state-of-the-art performances on Cityscapes and Pascal Context benchmarks, with mean-IoU of 82.8% and 49.0% respectively.
more details
n/a91.898.893.776.095.894.288.196.1
ShuffleSegyesyesyesyesnonononononononononoShuffleSeg: Real-time Semantic Segmentation NetworkMostafa Gamal, Mennatullah Siam, Mo'men Abdel-RazekUnder Review by ICIP 2018ShuffleSeg: An efficient realtime semantic segmentation network with skip connections and ShuffleNet units
more details
n/a80.295.488.246.992.584.766.487.4
SkipNet-MobileNetyesyesyesyesnonononononononononoRTSeg: Real-time Semantic Segmentation FrameworkMennatullah Siam, Mostafa Gamal, Moemen Abdel-Razek, Senthil Yogamani, Martin JagersandUnder Review by ICIP 2018An efficient realtime semantic segmentation network with skip connections based on MobileNet.

more details
n/a82.095.989.151.592.986.070.488.3
ThunderNetyesyesnononononononono22nonoAnonymous
more details
0.010484.197.990.356.093.088.473.489.8
PAC: Perspective-adaptive ConvolutionsyesyesnonononononononononononoPerspective-adaptive Convolutions for Scene ParsingRui Zhang, Sheng Tang, Yongdong Zhang, Jintao Li, and Shuicheng YanIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)Many existing scene parsing methods adopt Convolutional Neural Networks with receptive fields of fixed sizes and shapes, which frequently results in inconsistent predictions of large objects and invisibility of small objects. To tackle this issue, we propose perspective-adaptive convolutions to acquire receptive fields of flexible sizes and shapes during scene parsing. Through adding a new perspective regression layer, we can dynamically infer the position-adaptive perspective coefficient vectors utilized to reshape the convolutional patches. Consequently, the receptive fields can be adjusted automatically according to the various sizes and perspective deformations of the objects in scene images. Our proposed convolutions are differentiable to learn the convolutional parameters and perspective coefficients in an end-to-end way without any extra training supervision of object sizes. Furthermore, considering that the standard convolutions lack contextual information and spatial dependencies, we propose a context adaptive bias to capture both local and global contextual information through average pooling on the local feature patches and global feature maps, followed by flexible attentive summing to the convolutional results. The attentive weights are position-adaptive and context-aware, and can be learned through adding an additional context regression layer. Experiments on Cityscapes and ADE20K datasets well demonstrate the effectiveness of the proposed methods.
more details
n/a90.798.793.272.295.693.586.195.4
SU_NetnonononononononononononononoAnonymous
more details
n/a88.598.592.463.694.592.384.094.4
MobileNetV2PlusyesyesnonononononononononononoHuijun LiuMobileNetV2Plus
more details
n/a87.698.491.964.094.591.280.493.1
DeepLabv3+yesyesyesyesnonononononononoyesyes Encoder-Decoder with Atrous Separable Convolution for Semantic Image SegmentationLiang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, Hartwig AdamarXivSpatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information. In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. We will provide more details in the coming update on the arXiv report.
more details
n/a92.098.893.777.195.894.288.396.1
RFMobileNetV2PlusyesyesnonononononononononononoHuijun LiuReceptive Filed MobileNetV2Plus for Semantic Segmentation
more details
n/a88.398.492.266.694.191.582.093.1
GoogLeNetV1_ROByesyesnonononononononononononoAnonymousGoogLeNet-v1 FCN trained on Cityscapes, KITTI, and ScanNet, as required by the Robust Vision Challenge at CVPR'18 (http://robustvision.net/)
more details
n/a83.097.388.553.992.287.473.588.4
SAITv2yesyesyesyesnonononononononononoAnonymous
more details
0.02584.598.090.154.793.789.673.591.9
GUNetyesyesnononononononono22nonoGuided Upsampling Network for Real-Time Semantic SegmentationDavide MazziniarxivGuided Upsampling Network for Real-Time Semantic Segmentation
more details
0.0386.898.491.459.294.890.879.793.4
RMNetyesyesnonononononononononononoAnonymousA fast and light net for semantic segmentation.
more details
0.01484.698.090.258.193.388.573.690.3
ContextNetyesyesnonononononononononononoContextNet: Exploring Context and Detail for Semantic Segmentation in Real-timeRudra PK Poudel, Ujwal Bonde, Stephan Liwicki, Christopher ZacharXivModern deep learning architectures produce highly accurate results on many challenging semantic segmentation datasets. State-of-the-art methods are, however, not directly transferable to real-time applications or embedded devices, since naive adaptation of such systems to reduce computational cost (speed, memory and energy) causes a significant drop in accuracy. We propose ContextNet, a new deep neural network architecture which builds on factorized convolution, network compression and pyramid representations to produce competitive semantic segmentation in real-time with low memory requirements. ContextNet combines a deep branch at low resolution that captures global context information efficiently with a shallow branch that focuses on high-resolution segmentation details. We analyze our network in a thorough ablation study and present results on the Cityscapes dataset, achieving 66.1% accuracy at 18.3 frames per second at full (1024x2048) resolution.
more details
0.023882.897.889.647.792.088.972.690.8
RFLRyesyesyesyesyesyesnononono44nonoRandom Forest with Learned Representations for Semantic SegmentationByeongkeun Kang, Truong Q. NguyenIEEE Transactions on Image ProcessingRandom Forest with Learned Representations for Semantic Segmentation
more details
0.0360.293.873.615.784.661.029.863.3
DPCyesyesyesyesnonononononononoyesyesSearching for Efficient Multi-Scale Architectures for Dense Image PredictionLiang-Chieh Chen, Maxwell D. Collins, Yukun Zhu, George Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, Jonathon ShlensNIPS 2018In this work we explore the construction of meta-learning techniques for dense image prediction focused on the tasks of scene parsing. Constructing viable search spaces in this domain is challenging because of the multi-scale representation of visual information and the necessity to operate on high resolution imagery. Based on a survey of techniques in dense image prediction, we construct a recursive search space and demonstrate that even with efficient random search, we can identify architectures that achieve state-of-the-art performance. Additionally, the resulting architecture (called DPC for Dense Prediction Cell) is more computationally efficient, requiring half the parameters and half the computational cost as previous state of the art systems.
more details
n/a92.098.893.676.995.494.188.796.2
NV-ADLRyesyesyesyesnonononononononononoAnonymousNVIDIA Applied Deep Learning Research
more details
n/a92.198.894.077.696.194.488.096.1
Adaptive Affinity Field on PSPNetyesyesnonononononononononoyesyesAdaptive Affinity Field for Semantic SegmentationTsung-Wei Ke*, Jyh-Jing Hwang*, Ziwei Liu, Stella X. YuECCV 2018Existing semantic segmentation methods mostly rely on per-pixel supervision, unable to capture structural regularity present in natural images. Instead of learning to enforce semantic labels on individual pixels, we propose to enforce affinity field patterns in individual pixel neighbourhoods, i.e., the semantic label patterns of whether neighbouring pixels are in the same segment should match between the prediction and the ground-truth. The affinity fields characterize geometric relationships within the image, such as "motorcycles have round wheels". We further develop a novel method for learning the optimal neighbourhood size for each semantic category, with an adversarial loss that optimizes over worst-case scenarios. Unlike the common Conditional Random Field (CRF) approaches, our adaptive affinity field (AAF) method has no extra parameters during inference, and is less sensitive to appearance changes in the image.
more details
n/a90.898.793.472.595.693.586.995.3
APMoE_seg_ROByesyesnonononononononononoyesyesPixel-wise Attentional Gating for Parsimonious Pixel LabelingShu Kong, Charless FowlkesarxivThe Pixel-level Attentional Gating (PAG) unit is trained to choose for each pixel the pooling size to adopt to aggregate contextual region around it. There are multiple branches with different dilate rates for varied pooling size, thus varying receptive field. For this ROB challenge, PAG is expected to robustly aggregate information for final prediction.

This is our entry for Robust Vision Challenge 2018 workshop (ROB). The model is based on ResNet50, trained over mixed dataset of Cityscapes, ScanNet and Kitti.
more details
0.983.596.990.253.392.388.174.688.9
BatMAN_ROByesyesyesyesnonononononononononoAnonymousbatch-normalized multistage attention network
more details
1.083.997.989.555.094.288.172.090.3
HiSS_ROByesyesnononononononono22nonoAnonymous
more details
0.0681.497.888.945.192.987.568.888.9
VENUS_ROByesyesnonononononononononononoAnonymousVENUS_ROB
more details
n/a84.596.790.752.993.889.576.791.6
VlocNet++_ROBnonononononononononononononoAnonymous
more details
n/a83.498.189.351.692.988.472.990.4
AHiSS_ROByesyesyesyesnononononono22nonoAnonymousAugmented Hierarchical Semantic Segmentation
more details
0.0684.298.090.251.793.790.073.692.0
IBN-PSP-SA_ROByesyesnonononononononononononoAnonymousIBN-PSP-SA_ROB
more details
n/a89.198.692.567.195.492.683.394.6
LDN2_ROByesyesnonononononononononononoAnonymousLadder DenseNet: https://ivankreso.github.io/publication/ladder-densenet/
more details
1.090.198.693.070.295.693.085.295.0
MiniNetyesyesnononononononono44nonoAnonymous
more details
0.00470.596.081.825.889.077.246.577.3
AdapNetv2_ROByesyesnonononononononononononoAnonymous
more details
n/a84.398.290.053.993.888.874.191.2
MapillaryAI_ROByesyesnonononononononononononoAnonymous
more details
n/a91.198.793.574.195.893.886.795.3
FCN101_ROByesyesnonononononononononononoAnonymous
more details
n/a61.192.775.21.576.474.433.973.3
MaskRCNN_BOSHyesyesnonononononononononononoJin shengtao, Yi zhihao, Liu wei [Our team name is firefly]Bosh autodrive challenge
more details
n/a87.298.391.665.888.390.782.493.7
EnsembleModel_BoschyesyesnonononononononononononoJin shengtao, Yi zhihao, Liu wei [Our team name was MaskRCNN_BOSH,firefly]we've ensembled three model(erfnet,deeplab-mobilenet,tusimple) and gained 0.57 improvment of IoU Classes value. The best single model is 73.8549
more details
n/a88.598.592.266.094.591.882.693.7
EVANetyesyesnonononononononononononoAnonymous
more details
n/a87.798.391.765.994.690.980.192.4
CLRCNetyesyesnonononononononononononoCLRCNet: Cascaded Low-Rank Convolutions for Semantic Segmentation in Real-timeAnonymousA lightweight and real-time semantic segmentation method.
more details
0.01384.498.090.656.793.488.574.189.7
Edgenetyesyesnononononononono22nonoAnonymousA lightweight semantic segmentation network combined with edge information and channel-wise attention mechanism.
more details
0.0388.598.492.168.594.991.580.993.1
L2-SPyesyesyesyesnonononononononoyesyesExplicit Inductive Bias for Transfer Learning with Convolutional NetworksXuhong Li, Yves Grandvalet, Franck DavoineICML-2018With a simple variant of weight decay, L2-SP regularization (see the paper for details), we reproduced PSPNet based on the original ResNet-101 using "train_fine + val_fine + train_extra" set (2975 + 500 + 20000 images), with a small batch size 8. The sync batch normalization layer is implemented in Tensorflow (see the code).
more details
n/a91.098.793.473.695.693.786.695.6
ALV303yesyesnonononononononononononoAnonymous
more details
0.289.898.692.572.295.092.184.493.8
NCTU-ITRIyesyesnononononononono22nonoAnonymousFor the purpose of fast semantic segmentation, we design a CNN-based encoder-decoder architecture, which is called DSNet. The encoder part is constructed based on the concept of DenseNet, and a simple decoder is adopted to make the network more efficient without degrading the accuracy. We pre-train the encoder network on the ImageNet dataset. Then, only the fine-annotated Cityscapes dataset (2975 training images) is used to train the complete DSNet. The DSNet demonstrates a good trade-off between accuracy and speed. It can process 68 frames per second on 1024x512 resolution images on a single GTX 1080 Ti GPU.
more details
0.014786.898.391.462.994.390.477.692.5
ADSCNetyesyesnonononononononononononoADSCNet: Asymmetric Depthwise Separable Convolution for Semantic Segmentation in Real-timeAnonymousA lightweight and real-time semantic segmentation method for mobile devices.
more details
0.01384.998.090.757.793.588.974.990.3
SRC-B-MachineLearningLabyesyesyesyesnonononononononononoJianlong Yuan, Zelu Deng, Shu Wang, Zhenbo LuoSamsung Research Center MachineLearningLab. The result is tested by multi scale and filp. The paper is in preparing.
more details
n/a91.898.893.776.495.994.187.995.9
Tencent AI LabyesyesyesyesnonononononononononoAnonymous
more details
n/a91.898.793.677.095.994.187.695.8
ERINetyesyesnononononononono22nonoAnonymousEfficient residual inception networks for real-time semantic segmentation
more details
0.02387.498.291.664.994.790.679.292.3
PGCNet_Res101_fineyesyesnonononononononononononoAnonymouswe choose the ResNet101 pretrained on ImageNet as our backbone, then we use both the train-fine and the val-fine data to train our model with batch size=8 for 8w iterations without any bells and whistles. We will release our paper latter.
more details
n/a91.598.893.675.395.694.087.595.8
EDANetyesyesnononononononono22yesyesEfficient Dense Modules of Asymmetric Convolution for Real-Time Semantic SegmentationShao-Yuan Lo (NCTU), Hsueh-Ming Hang (NCTU), Sheng-Wei Chan (ITRI), Jing-Jhih Lin (ITRI)Training data: Fine annotations only (train+val. set, 2975+500 images) without any pretraining nor coarse annotations.
For training on fine annotations (train set only, 2975 images), it attains a mIoU of 66.3%.

Runtime: (resolution 512x1024) 0.0092s on a single GTX 1080Ti, 0.0123s on a single Titan X.
more details
0.009285.898.191.059.693.689.876.591.6
OCNet_ResNet101_fineyesyesnonononononononononononoAnonymousContext is essential for various computer vision tasks.
The state-of-the-art scene parsing methods define the context as the prior of the scene categories (e.g., bathroom, badroom, street).
Such scene context is not suitable for the street scene parsing tasks as most of the scenes are similar.

In this work, we propose the Object Context that captures the prior of the object's category that the pixel belongs to.
We compute the object context by aggregating all the pixels' features according to a attention map that encodes the probability of each pixel that it belongs to the same category with the associated pixel.
Specifically, We employ the self-attention method to compute the pixel-wise attention map.

We further propose the Pyramid Object Context and Atrous Spatial Pyramid Object Context to handle the problem of multi-scales.
more details
n/a91.698.893.675.795.894.087.795.9
Knowledge-AwareyesyesnonononononononononononoAnonymousKnowledge-Aware Semantic Segmentation
more details
n/a90.798.793.172.695.793.486.095.5
CASIA_IVA_DANet_NoCoarseyesyesnonononononononononoyesyesDual Attention Network for Scene SegmentationJun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang,and Hanqing LuCVPR2019we address the scene segmentation task by capturing rich contextual dependencies based on the selfattention mechanism. Unlike previous works that capture contexts by multi-scale features fusion, we propose a Dual Attention Networks (DANet) to adaptively integrate local features with their global dependencies. Specifically, we append two types of attention modules on top of traditional dilated FCN, which model the semantic interdependencies in spatial and channel dimensions respectively. The position attention module selectively aggregates the features at each
position by a weighted sum of the features at all positions. Similar features would be related to each other regardless of their distances. Meanwhile, the channel attention module selectively emphasizes interdependent channel maps by integrating associated features among all channel maps.
We sum the outputs of the two attention modules to further improve feature representation which contributes to more precise segmentation results
more details
n/a91.698.793.575.895.793.987.795.8
LDFNetyesyesnonononoyesyesnono22yesyesIncorporating Luminance, Depth and Color Information by Fusion-based Networks for Semantic SegmentationShang-Wei Hung, Shao-Yuan LoWe propose a preferred solution, which incorporates Luminance, Depth and color information by a Fusion-based network named LDFNet. It includes a distinctive encoder sub-network to process the depth maps and further employs the luminance images to assist the depth information in a process. LDFNet achieves very competitive results compared to the other state-of-art systems on the challenging Cityscapes dataset, while it maintains an inference speed faster than most of the existing top-performing networks. The experimental results show the effectiveness of the proposed information-fused approach and the potential of LDFNet for road scene understanding tasks.
more details
n/a88.598.492.268.094.891.781.393.1
CGNetyesyesnonononononononononoyesyesTianyi Wu et alwe propose a novel Context Guided Network for semantic segmentation on mobile devices. We first design a Context Guided (CG) block by considering the inherent characteristic of semantic segmentation. CG Block aggregates local feature, surrounding context feature and global context feature effectively and efficiently. Based on the CG block, we develop Context Guided Network (CGNet), which not only has a strong capacity of localization and recognition, but also has a low computational and memory footprint. Under a similar number of parameters, the proposed
CGNet significantly outperforms existing segmentation networks. Extensive experiments on Cityscapes and CamVid datasets verify the effectiveness of the proposed approach.
Specifically, without any post-processing, the proposed approach achieves 64.8% mean IoU on Cityscapes test set with less than 0.5 M parameters, and has a frame-rate of 50 fps on one NVIDIA Tesla K80 card for 2048 × 1024 high-resolution image.
more details
0.0285.797.791.359.394.190.277.490.3
SAITv2-lightyesyesyesyesnonononononononononoAnonymous
more details
0.02587.498.492.063.494.591.478.993.1
Deform_ResNet_BalancedyesyesnonononononononononononoAnonymous
more details
0.25878.094.588.141.388.283.363.487.5
NfS-SegyesyesyesyesnonoyesyesyesyesnonononoUncertainty-Aware Knowledge Distillation for Real-Time Scene Segmentation: 7.43 GFLOPs at Full-HD Image with 120 fpsAnonymous
more details
0.0083731287.598.492.064.094.691.479.093.2
Improving Semantic Segmentation via Video Propagation and Label RelaxationyesyesyesyesnonononoyesyesnonoyesyesImproving Semantic Segmentation via Video Propagation and Label RelaxationYi Zhu, Karan Sapra, Fitsum A. Reda, Kevin J. Shih, Shawn Newsam, Andrew Tao, Bryan CatanzaroCVPR 2019Semantic segmentation requires large amounts of pixel-wise annotations to learn accurate models. In this paper, we present a video prediction-based methodology to scale up training sets by synthesizing new training samples in order to improve the accuracy of semantic segmentation networks. We exploit video prediction models' ability to predict future frames in order to also predict future labels. A joint propagation strategy is also proposed to alleviate mis-alignments in synthesized samples. We demonstrate that training segmentation models on datasets augmented by the synthesized samples lead to significant improvements in accuracy. Furthermore, we introduce a novel boundary label relaxation technique that makes training robust to annotation noise and propagation artifacts along object boundaries. Our proposed methods achieve state-of-the-art mIoUs of 83.5% on Cityscapes and 82.9% on CamVid. Our single model, without model ensembles, achieves 72.8% mIoU on the KITTI semantic segmentation test set, which surpasses the winning entry of the ROB challenge 2018.
more details
n/a92.298.893.978.096.194.488.296.1
Spatial Sampling Netyesyesnononononononono22nonoSpatial Sampling Network for Fast Scene UnderstandingDavide Mazzini, Raimondo SchettiniCVPR 2019 Workshop on Autonomous DrivingWe propose a network architecture to perform efficient scene understanding. This work presents three main novelties: the first is an Improved Guided Upsampling Module that can replace in toto the decoder part in common semantic segmentation networks.
Our second contribution is the introduction of a new module based on spatial sampling to perform Instance Segmentation. It provides a very fast instance segmentation, needing only thresholding as post-processing step at inference time. Finally, we propose a novel efficient network design that includes the new modules and we test it against different datasets for outdoor scene understanding.
more details
0.0088485.998.291.158.394.090.477.392.4
SwiftNetRN-18yesyesnonononononononononoyesyesIn Defense of Pre-trained ImageNet Architectures for Real-time Semantic Segmentation of Road-driving ImagesMarin Oršić, Ivan Krešo, Petra Bevandić, Siniša ŠegvićCVPR 2019
more details
0.024389.898.692.870.095.492.684.794.7
Fast-SCNNyesyesyesyesnonononononononononoFast-SCNN: Fast Semantic Segmentation NetworkRudra PK Poudel, Stephan Liwicki, Roberto CipollaThe encoder-decoder framework is state-of-the-art for offline semantic image segmentation. Since the rise in autonomous systems, real-time computation is increasingly desirable. In this paper, we introduce fast segmentation convolutional neural network (Fast-SCNN), an above real-time semantic segmentation model on high resolution image data (1024x2048px) suited to efficient computation on embedded devices with low memory. Building on existing two-branch methods for fast segmentation, we introduce our `learning to downsample' module which computes low-level features for multiple resolution branches simultaneously. Our network combines spatial detail at high resolution with deep features extracted at lower resolution, yielding an accuracy of 68.0% mean intersection over union at 123.5 frames per second on Cityscapes. We also show that large scale pre-training is unnecessary. We thoroughly validate our metric in experiments with ImageNet pre-training and the coarse labeled data of Cityscapes. Finally, we show even faster computation with competitive results on subsampled inputs, without any network modifications.
more details
0.008184.798.290.355.194.389.674.291.4
Fast-SCNN (Half-resolution)yesyesyesyesnononononono22nonoFast-SCNN: Fast Semantic Segmentation NetworkRudra P K Poudel, Stephan Liwicki, Roberto CipollaThe encoder-decoder framework is state-of-the-art for offline semantic image segmentation. Since the rise in autonomous systems, real-time computation is increasingly desirable. In this paper, we introduce fast segmentation convolutional neural network (Fast-SCNN), an above real-time semantic segmentation model on high resolution image data (1024x2048px) suited to efficient computation on embedded devices with low memory. Building on existing two-branch methods for fast segmentation, we introduce our `learning to downsample' module which computes low-level features for multiple resolution branches simultaneously. Our network combines spatial detail at high resolution with deep features extracted at lower resolution, yielding an accuracy of 68.0% mean intersection over union at 123.5 frames per second on Cityscapes. We also show that large scale pre-training is unnecessary. We thoroughly validate our metric in experiments with ImageNet pre-training and the coarse labeled data of Cityscapes. Finally, we show even faster computation with competitive results on subsampled inputs, without any network modifications.
more details
0.003580.597.788.042.592.787.365.989.4
Fast-SCNN (Quarter-resolution)yesyesnononononononono44nonoFast-SCNN: Fast Semantic Segmentation NetworkRudra P K Poudel, Stephan Liwicki, Roberto CipollaThe encoder-decoder framework is state-of-the-art for offline semantic image segmentation. Since the rise in autonomous systems, real-time computation is increasingly desirable. In this paper, we introduce fast segmentation convolutional neural network (Fast-SCNN), an above real-time semantic segmentation model on high resolution image data (1024x2048px) suited to efficient computation on embedded devices with low memory. Building on existing two-branch methods for fast segmentation, we introduce our `learning to downsample' module which computes low-level features for multiple resolution branches simultaneously. Our network combines spatial detail at high resolution with deep features extracted at lower resolution, yielding an accuracy of 68.0% mean intersection over union at 123.5 frames per second on Cityscapes. We also show that large scale pre-training is unnecessary. We thoroughly validate our metric in experiments with ImageNet pre-training and the coarse labeled data of Cityscapes. Finally, we show even faster computation with competitive results on subsampled inputs, without any network modifications.
more details
0.0020674.296.883.925.489.582.955.685.1
DSNetyesyesyesyesnononononono22yesyesDSNet for Real-Time Driving Scene Semantic SegmentationWenfu WangDSNet for Real-Time Driving Scene Semantic Segmentation
more details
0.02786.097.790.563.193.589.976.191.3
SwiftNetRN-18 pyramidyesyesnonononononononononononoAnonymous
more details
n/a89.598.692.369.195.392.284.394.6
DF-SegyesyesnonononononononononoyesyesPartial Order Pruning: for Best Speed/Accuracy Trade-off in Neural Architecture SearchXin Li, Yiming Zhou, Zheng Pan, Jiashi FengCVPR 2019DF1-Seg-d8
more details
0.00786.698.191.361.093.991.078.992.3
DF-SegyesyesnonononononononononononoAnonymousDF2-Seg2
more details
0.01889.298.592.568.795.392.582.994.3
DDARyesyesyesyesnonononononononononoAnonymousDiDi Labs, AR Group
more details
n/a91.998.793.576.795.894.188.395.9
LDN-121yesyesnonononononononononononoEfficient Ladder-style DenseNets for Semantic Segmentation of Large ImagesIvan Kreso, Josip Krapac, Sinisa SegvicLadder DenseNet-121 trained on train+val, fine labels only. Single-scale inference.
more details
0.04890.798.793.272.595.793.386.195.4
TKCNyesyesnonononononononononoyesyesTree-structured Kronecker Convolutional Network for Semantic SegmentationTianyi Wu, Sheng Tang, Rui Zhang, Juan Cao, Jintao Li
more details
n/a91.198.693.374.195.493.587.395.2
RPNetyesyesnononononononono22yesyesResidual Pyramid Learning for Single-Shot Semantic SegmentationXiaoyu Chen, Xiaotian Lou, Lianfa Bai, Jing HanarXivwe put forward a method for single-shot segmentation in a feature residual pyramid network (RPNet), which learns the main and residuals of segmentation by decomposing the label at different levels of residual blocks.
more details
0.00886.898.291.363.294.590.278.691.7
naviyesyesnonononononononononononoyuxbmutil scale test
more details
n/a91.298.893.574.095.693.787.095.6
Auto-DeepLab-LyesyesyesyesnonononononononoyesyesAuto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image SegmentationChenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan Yuille, Li Fei-FeiarxivIn this work, we study Neural Architecture Search for semantic image segmentation, an important computer vision task that assigns a semantic label to every pixel in an image. Existing works often focus on searching the repeatable cell structure, while hand-designing the outer network structure that controls the spatial resolution changes. This choice simplifies the search space, but becomes increasingly problematic for dense image prediction which exhibits a lot more network level architectural variations. Therefore, we propose to search the network level structure in addition to the cell level structure, which forms a hierarchical architecture search space. We present a network level search space that includes many popular designs, and develop a formulation that allows efficient gradient-based architecture search (3 P100 GPU days on Cityscapes images). We demonstrate the effectiveness of the proposed method on the challenging Cityscapes, PASCAL VOC 2012, and ADE20K datasets. Without any ImageNet pretraining, our architecture searched specifically for semantic image segmentation attains state-of-the-art performance. Please refer to https://arxiv.org/abs/1901.02985 for details.
more details
n/a91.998.893.776.796.094.287.896.0
LiteSeg-Darknet19yesyesyesyesnonononononononoyesyesLiteSeg: A Litewiegth ConvNet for Semantic SegmentationTaha Emara, Hossam E. Abd El Munim, Hazem M. AbbasDICTA 2019
more details
0.010288.398.492.365.995.091.781.793.0
AdapNet++yesyesyesyesnonononononononoyesyes Self-Supervised Model Adaptation for Multimodal Semantic SegmentationAbhinav Valada, Rohit Mohan, Wolfram BurgardIJCV 2019In this work, we propose the AdapNet++ architecture for semantic segmentation that aims to achieve the right trade-off between performance and computational complexity of the model. AdapNet++ incorporates a new encoder with multiscale residual units and an efficient atrous spatial pyramid pooling (eASPP) module that has a larger effective receptive field with more than 10x fewer parameters compared to the standard ASPP, complemented with a strong decoder with a multi-resolution supervision scheme that recovers high-resolution details. Comprehensive empirical evaluations on the challenging Cityscapes, Synthia, SUN RGB-D, ScanNet and Freiburg Forest datasets demonstrate that our architecture achieves state-of-the-art performance while simultaneously being efficient in terms of both the number of parameters and inference time. Please refer to https://arxiv.org/abs/1808.03833 for details.

A live demo on various datasets can be viewed at http://deepscene.cs.uni-freiburg.de
more details
n/a91.098.793.273.795.393.686.795.8
SSMAyesyesyesyesnonoyesyesnonononoyesyes Self-Supervised Model Adaptation for Multimodal Semantic SegmentationAbhinav Valada, Rohit Mohan, Wolfram BurgardIJCV 2019Learning to reliably perceive and understand the scene is an integral enabler for robots to operate in the real-world. This problem is inherently challenging due to the multitude of object types as well as appearance changes caused by varying illumination and weather conditions. Leveraging
complementary modalities can enable learning of semantically richer representations that are resilient to such perturbations. Despite the tremendous progress in recent years, most multimodal convolutional neural network approaches directly concatenate feature maps from individual modality streams
rendering the model incapable of focusing only on the relevant complementary information for fusion. To address this limitation, we propose a mutimodal semantic segmentation framework that dynamically adapts the fusion of modality-specific features while being sensitive to the object category, spatial location and scene context in a self-supervised manner. Specifically, we propose an architecture consisting of two modality-specific encoder streams that fuse intermediate encoder representations into a single decoder using our proposed SSMA fusion mechanism which optimally combines complementary features. As intermediate representations are not aligned across modalities, we introduce an attention scheme for better correlation. Extensive experimental evaluations on the challenging Cityscapes, Synthia, SUN RGB-D, ScanNet and Freiburg Forest datasets demonstrate that our architecture achieves state-of-the-art performance in addition to providing exceptional robustness in adverse perceptual conditions. Please refer to https://arxiv.org/abs/1808.03833 for details.

A live demo on various datasets can be viewed at http://deepscene.cs.uni-freiburg.de
more details
n/a91.598.793.575.295.393.987.896.1
LiteSeg-MobilenetyesyesyesyesnonononononononoyesyesLiteSeg: A Litewiegth ConvNet for Semantic SegmentationTaha Emara, Hossam E. Abd El Munim, Hazem M. AbbasDICTA 2019
more details
0.006286.897.991.762.894.690.479.390.9
LiteSeg-ShufflenetyesyesyesyesnonononononononoyesyesLiteSeg: A Litewiegth ConvNet for Semantic SegmentationTaha Emara, Hossam E. Abd El Munim, Hazem M. AbbasDICTA 2019
more details
0.00751885.497.991.257.494.089.777.390.3
Fast OCNetyesyesnonononononononononononoAnonymous
more details
n/a91.798.893.576.295.694.087.695.9
ShuffleNet v2 + DPCyesyesyesyesnonononononononoyesyesAn efficient solution for semantic segmentation: ShuffleNet V2 with atrous separable convolutionsSercan Turkmen, Janne HeikkilaShuffleNet v2 with DPC at output_stride 16.
more details
n/a86.598.391.359.593.990.978.693.0
ERSNet-coarseyesyesyesyesnononononono44nonoAnonymous
more details
0.01285.998.291.160.594.589.975.591.8
MiniNet-v2-coarseyesyesyesyesnononononono22nonoAnonymous
more details
0.01286.198.291.260.794.590.175.992.1
SwiftNetRN-18 ensembleyesyesnonononononononononoyesyesIn Defense of Pre-trained ImageNet Architectures for Real-time Semantic Segmentation of Road-driving ImagesMarin Oršić, Ivan Krešo, Petra Bevandić, Siniša ŠegvićCVPR 2019
more details
n/a90.198.692.870.695.592.885.295.0
EFC_syncyesyesnonononononoyesyesnonononoAnonymous
more details
n/a90.998.793.372.995.693.686.895.5
PL-SegyesyesnonononononononononoyesyesPartial Order Pruning: for Best Speed/Accuracy Trade-of in Neural Architecture SearchXin Li, Yiming Zhou, Zheng Pan, Jiashi FengCVPR 2019Following "partial order pruning", we conduct architecture searching experiments on Snapdragon 845 platform, and obtained PL1A/PL1A-Seg.

1、Snapdragon 845
2、NCNN Library
3、latency evaluated at 640x384

more details
0.019286.498.291.459.894.790.478.292.0
MiniNet-v2-pretrainedyesyesyesyesnononononono22nonoAnonymous
more details
0.01286.298.391.360.994.590.276.292.1
GALD-NetyesyesyesyesyesyesyesyesnonononoyesyesGlobal Aggregation then Local Distribution in Fully Convolutional NetworksXiangtai Li, Li Zhang, Ansheng You, Maoke Yang, Kuiyuan Yang, Yunhai TongBMVC 2019We propose Global Aggregation then Local Distribution (GALD) scheme to distribute global information to each position adaptively according to the local information around the position. (Joint work: Key Laboratory of Machine Perception, School of EECS @Peking University and DeepMotion AI Research )
more details
n/a92.398.893.878.396.094.488.596.1
GALD-netyesyesyesyesnonononononononoyesyesGlobal Aggregation then Local Distribution in Fully Convolutional NetworksXiangtai Li, Li Zhang, Ansheng You, Maoke Yang, Kuiyuan Yang, Yunhai TongBMVC 2019We propose Global Aggregation then Local Distribution (GALD) scheme to distribute global information to each position adaptively according the local information surrounding the position.
more details
n/a92.298.893.878.096.094.488.596.1
ndnetyesyesnonononononononononononoAnonymous
more details
0.02484.598.190.553.693.789.274.791.4
HRNetV2yesyesnonononononononononoyesyesHigh-Resolution Representations for Labeling Pixels and RegionsKe Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, Jingdong WangThe high-resolution network (HRNet) recently developed for human pose estimation, maintains high-resolution representations through the whole process by connecting high-to-low resolution convolutions in parallel and produces strong high-resolution representations by repeatedly conducting fusions across parallel convolutions.
more details
n/a92.298.893.777.796.094.388.696.0
SPGNetyesyesnonononononononononononoSPGNet: Semantic Prediction Guidance for Scene ParsingBowen Cheng, Liang-Chieh Chen, Yunchao Wei, Yukun Zhu, Zilong Huang, Jinjun Xiong, Thomas Huang, Wen-Mei Hwu, Honghui ShiICCV 2019Multi-scale context module and single-stage encoder-decoder structure are commonly employed for semantic segmentation. The multi-scale context module refers to the operations to aggregate feature responses from a large spatial extent, while the single-stage encoder-decoder structure encodes the high-level semantic information in the encoder path and recovers the boundary information in the decoder path. In contrast, multi-stage encoder-decoder networks have been widely used in human pose estimation and show superior performance than their single-stage counterpart. However, few efforts have been attempted to bring this effective design to semantic segmentation. In this work, we propose a Semantic Prediction Guidance (SPG) module which learns to re-weight the local features through the guidance from pixel-wise semantic prediction. We find that by carefully re-weighting features across stages, a two-stage encoder-decoder network coupled with our proposed SPG module can significantly outperform its one-stage counterpart with similar parameters and computations. Finally, we report experimental results on the semantic segmentation benchmark Cityscapes, in which our SPGNet attains 81.1% on the test set using only 'fine' annotations.
more details
n/a92.198.893.877.596.194.288.895.9
LDN-161yesyesnonononononononononononoEfficient Ladder-style DenseNets for Semantic Segmentation of Large ImagesIvan Kreso, Josip Krapac, Sinisa SegvicLadder DenseNet-161 trained on train+val, fine labels only. Inference on multi-scale inputs.
more details
2.091.398.793.474.395.893.887.095.8
GGCFyesyesyesyesnonononononononononoAnonymous
more details
n/a92.098.893.877.095.894.388.296.0
GFF-NetyesyesnonononononononononononoGFF: Gated Fully Fusion for Semantic SegmentationXiangtai Li, Houlong Zhao, Yunhai Tong, Kuiyuan YangWe proposed Gated Fully Fusion (GFF) to fuse features from multiple levels through gates in a fully connected way. Specifically, features at each level are enhanced by higher-level features with stronger semantics and lower-level features with more details, and gates are used to control the pass of useful information which significantly reducing noise propagation during fusion. (Joint work: Key Laboratory of Machine Perception, School of EECS @Peking University and DeepMotion AI Research )
more details
n/a92.098.893.777.295.994.288.496.0
Gated-SCNNyesyesnonononononononononoyesyesGated-SCNN: Gated Shape CNNs for Semantic SegmentationTowaki Takikawa, David Acuna, Varun Jampani, Sanja Fidler
more details
n/a92.398.894.078.396.294.588.496.2
ESPNetv2yesyesnononononononono22yesyesESPNetv2: A Light-weight, Power Efficient, and General Purpose Convolutional Neural NetworkSachin Mehta, Mohammad Rastegari, Linda Shapiro, and Hannaneh HajishirziCVPR 2019We introduce a light-weight, power efficient, and general purpose convolutional neural network, ESPNetv2, for modeling visual and sequential data. Our network uses group point-wise and depth-wise dilated separable convolutions to learn representations from a large effective receptive field with fewer FLOPs and parameters. The performance of our network is evaluated on three different tasks: (1) object classification, (2) semantic segmentation, and (3) language modeling. Experiments on these tasks, including image classification on the ImageNet and language modeling on the PenTree bank dataset, demonstrate the superior performance of our method over the state-of-the-art methods. Our network has better generalization properties than ShuffleNetv2 when tested on the MSCOCO multi-object classification task and the Cityscapes urban scene semantic segmentation task. Our experiments show that ESPNetv2 is much more power efficient than existing state-of-the-art efficient methods including ShuffleNets and MobileNets. Our code is open-source and available at https://github.com/sacmehta/ESPNetv2
more details
n/a84.397.990.156.293.388.973.390.7
MRFMyesyesyesyesnonononononononononoMulti Receptive Field Network for Semantic SegmentationJianlong Yuan, Zelu Deng, Shu Wang, Zhenbo LuoWACV2020Semantic segmentation is one of the key tasks in comput-
er vision, which is to assign a category label to each pixel
in an image. Despite significant progress achieved recently,
most existing methods still suffer from two challenging is-
sues: 1) the size of objects and stuff in an image can be very
diverse, demanding for incorporating multi-scale features
into the fully convolutional networks (FCNs); 2) the pixel-
s close to or at the boundaries of object/stuff are hard to
classify due to the intrinsic weakness of convolutional net-
works. To address the first issue, we propose a new Multi-
Receptive Field Module (MRFM), explicitly taking multi-
scale features into account. For the second issue, we design
an edge-aware loss which is effective in distinguishing the
boundaries of object/stuff. With these two designs, our Mul-
ti Receptive Field Network achieves new state-of-the-art re-
sults on two widely-used semantic segmentation benchmark
datasets. Specifically, we achieve a mean IoU of 83.0% on
the Cityscapes dataset and 88.4% mean IoU on the Pascal
VOC2012 dataset.
more details
n/a92.098.893.877.695.794.188.495.9
DGCNetyesyesnonononononononononononoDual Graph Convolutional Network for Semantic SegmentationLi Zhang*, Xiangtai Li*, Anurag Arnab, Kuiyuan Yang, Yunhai Tong, Philip H.S. TorrBMVC 2019We propose Dual Graph Convolutional Network (DGCNet) models the global context of the input feature by modelling two orthogonal graphs in a single framework. (Joint work: University of Oxford, Peking University and DeepMotion AI Research)
more details
n/a91.898.893.676.695.894.188.095.9
dpcan_trainval_os16_225yesyesnonononononononononononoAnonymous
more details
n/a91.798.893.575.995.794.088.095.8
Learnable Tree FilteryesyesnonononononononononoyesyesLearnable Tree Filter for Structure-preserving Feature TransformLin Song; Yanwei Li; Zeming Li; Gang Yu; Hongbin Sun; Jian Sun; Nanning ZhengNeurIPS 2019Learnable Tree Filter for Structure-preserving Feature Transform
more details
n/a91.698.793.475.795.893.888.095.7
FreeNetyesyesnonononononononononononoAnonymous
more details
n/a85.196.691.256.193.589.377.392.0
HRNetV2 + OCRyesyesyesyesnonononononononoyesyesHigh-Resolution Representations for Labeling Pixels and Regions; OCNet: Object Context Network for Scene ParsingHRNet Team; OCR TeamHRNetV2W48 + OCR. OCR is an extension of object context networks https://arxiv.org/pdf/1809.00916.pdf
more details
n/a92.198.893.877.796.094.488.296.1
Valeo DAR GermanyyesyesyesyesnonononononononononoAnonymousValeo DAR Germany, New Algo Lab

more details
n/a92.098.893.776.995.494.188.996.2
GLNet_fineyesyesnonononononononononononoAnonymousThe proposed network architecture, combined with spatial information and multi scale context information, and repair the boundaries and details of the segmented object through channel attention modules.(Use the train-fine and the val-fine data)
more details
n/a91.398.793.474.495.993.887.395.7
MCDNyesyesnonononononononononononoAnonymous
more details
n/a91.598.793.575.895.893.987.295.6
AAF+GLRyesyesnonononononononononononoAnonymous
more details
n/a90.998.793.373.995.693.386.695.2
HRNetV2 + OCR (w/ ASP)yesyesyesyesnonononononononoyesyesopenseg-group (OCR team + HRNet team)Our approach is based on a single HRNet48V2 and an OCR module combined with ASPP. We apply depth based multi-scale ensemble weights during testing (provided by DeepMotion AI Research) .
more details
n/a92.498.893.978.796.094.588.696.1
CASIA_IVA_DRANet-101_NoCoarseyesyesnonononononononononononoAnonymous
more details
n/a92.498.893.978.496.094.488.996.1
Hyundai Mobis AD LabyesyesyesyesnonononononononononoHyundai Mobis AD Lab, DL-DB Group, AA (Automated Annotator) Team
more details
n/a92.498.894.078.396.194.588.796.3
EFRNet-13yesyesnonononononononononononoAnonymous
more details
0.014687.098.391.861.594.791.078.992.7
FarSee-Netyesyesnononononononono22nonoFarSee-Net: Real-Time Semantic Segmentation by Efficient Multi-scale Context Aggregation and Feature Space Super-resolutionZhanpeng Zhang and Kaipeng ZhangIEEE International Conference on Robotics and Automation (ICRA) 2020FarSee-Net: Real-Time Semantic Segmentation by Efficient Multi-scale Context Aggregation and Feature Space Super-resolution

Real-time semantic segmentation is desirable in many robotic applications with limited computation resources. One challenge of semantic segmentation is to deal with the objectscalevariationsandleveragethecontext.Howtoperform multi-scale context aggregation within limited computation budget is important. In this paper, firstly, we introduce a novel and efficient module called Cascaded Factorized Atrous Spatial Pyramid Pooling (CF-ASPP). It is a lightweight cascaded structure for Convolutional Neural Networks (CNNs) to efficiently leverage context information. On the other hand, for runtime efficiency, state-of-the-art methods will quickly decrease the spatial size of the inputs or feature maps in the early network stages. The final high-resolution result is usuallyobtainedbynon-parametricup-samplingoperation(e.g. bilinear interpolation). Differently, we rethink this pipeline and treat it as a super-resolution process. We use optimized superresolution operation in the up-sampling step and improve the accuracy, especially in sub-sampled input image scenario for real-time applications. By fusing the above two improvements, our methods provide better latency-accuracy trade-off than the other state-of-the-art methods. In particular, we achieve 68.4% mIoU at 84 fps on the Cityscapes test set with a single Nivida Titan X (Maxwell) GPU card. The proposed module can be plugged into any feature extraction CNN and benefits from the CNN structure development.
more details
0.011985.998.190.660.194.089.976.592.2
C3Net [2,3,7,13]nononononononononono22yesyesC3: Concentrated-Comprehensive Convolution and its application to semantic segmentationHyojin Park, Youngjoon Yoo, Geonseok Seo, Dongyoon Han, Sangdoo Yun, Nojun Kwak
more details
n/a84.797.890.558.893.588.873.589.8
Panoptic-DeepLab [Cityscapes-fine]yesyesnonononononononononononoPanoptic-DeepLabBowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh ChenOur proposed bottom-up Panoptic-DeepLab is conceptually simple yet delivers state-of-the-art results. The Panoptic-DeepLab adopts dual-ASPP and dual-decoder modules, specific to semantic segmentation and instance segmentation respectively. The semantic segmentation prediction follows the typical design of any semantic segmentation model (e.g., DeepLab), while the instance segmentation prediction involves a simple instance center regression, where the model learns to predict instance centers as well as the offset from each pixel to its corresponding center. This submission exploits only Cityscapes fine annotations.
more details
n/a91.798.793.576.595.793.988.495.5
EKENetyesyesnonononononononononononoAnonymous
more details
0.022987.298.491.862.094.791.279.193.0
SPSSNyesyesnonononononononononononoAnonymousStage Pooling Semantic Segmentation Network
more details
n/a86.098.191.160.394.290.076.691.8
FC-HarDNet-70yesyesnonononononononononoyesyesHarDNet: A Low Memory Traffic NetworkPing Chao, Chao-Yang Kao, Yu-Shan Ruan, Chien-Hsiang Huang, Youn-Long LinICCV 2019Fully Convolutional Harmonic DenseNet 70
U-shape encoder-decoder structure with HarDNet blocks
Trained with single scale loss at stride-4
validation mIoU=77.7
more details
0.01589.998.792.870.595.492.784.694.9
BFPyesyesnonononononononononononoBoundary-Aware Feature Propagation for Scene SegmentationHenghui Ding, Xudong Jiang, Ai Qun Liu, Nadia Magnenat Thalmann, and Gang WangIEEE International Conference on Computer Vision (ICCV), 2019Boundary-Aware Feature Propagation for Scene Segmentation
more details
n/a91.498.793.475.295.593.987.395.6
FasterSegyesyesnonononononononononoyesyesFasterSeg: Searching for Faster Real-time Semantic SegmentationWuyang Chen, Xinyu Gong, Xianming Liu, Qian Zhang, Yuan Li, Zhangyang WangICLR 2020We present FasterSeg, an automatically designed semantic segmentation network with not only state-of-the-art performance but also faster speed than current methods. Utilizing neural architecture search (NAS), FasterSeg is discovered from a novel and broader search space integrating multi-resolution branches, that has been recently found to be vital in manually designed segmentation models. To better calibrate the balance between the goals of high accuracy and low latency, we propose a decoupled and fine-grained latency regularization, that effectively overcomes our observed phenomenons that the searched networks are prone to "collapsing" to low-latency yet poor-accuracy models. Moreover, we seamlessly extend FasterSeg to a new collaborative search (co-searching) framework, simultaneously searching for a teacher and a student network in the same single run. The teacher-student distillation further boosts the student model's accuracy. Experiments on popular segmentation benchmarks demonstrate the competency of FasterSeg. For example, FasterSeg can run over 30% faster than the closest manually designed competitor on Cityscapes, while maintaining comparable accuracy.
more details
0.0061388.198.292.165.694.591.681.892.8
VCD-NoCoarseyesyesnonononononononononononoAnonymous
more details
n/a92.398.893.878.395.994.388.896.1
NAVINFO_DLRyesyesyesyesnonononononononononopengfei zhangweighted aspp+ohem+hard region refine
more details
n/a92.498.893.978.795.994.489.196.3
LBPSSyesyesnononononononono22nonoAnonymousCVPR 2020 submission #5455
more details
0.984.397.990.058.993.188.172.889.4
KANet_Res101yesyesnonononononononononononoAnonymous
more details
n/a91.698.793.475.795.793.987.695.8
Learnable Tree Filter V2yesyesnonononononononononoyesyesRethinking Learnable Tree Filter for Generic Feature TransformLin Song, Yanwei Li, Zhengkai Jiang, Zeming Li, Xiangyu Zhang, Hongbin Sun, Jian Sun, Nanning ZhengNeurIPS 2020Based on ResNet-101 backbone and FPN architecture.
more details
n/a92.198.793.677.595.994.188.796.0
GPSNetyesyesnonononononononononononoAnonymous
more details
n/a92.098.893.877.095.894.288.196.1
FTFNetyesyesyesyesnonononononononononoAnonymousAn Efficient Network Focused on Tiny Feature Maps for Real-Time Semantic Segmentation
more details
0.008887.698.492.062.794.791.480.593.6
iFLYTEK-CVyesyesyesyesnonononononononononoAnonymousiFLYTEK Research, CV Group
more details
n/a92.498.894.178.596.094.688.896.3
F2MF-shortyesyesnonononononoyesyesnonoyesyesWarp to the Future: Joint Forecasting of Features and Feature Motion Josip Saric, Marin Orsic, Tonci Antunovic, Sacha Vrazic, Sinisa SegvicThe IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020Our method forecasts semantic segmentation 3 timesteps into the future.
more details
n/a82.597.389.254.591.689.266.289.2
HPNetyesyesnonononononononononononoHigh-Order Paired-ASPP Networks for Semantic SegmentationYu Zhang, Xin Sun, Junyu Dong, Changrui Chen, Yue Shen
more details
n/a91.498.793.674.795.793.987.395.6
HANet (fine-train only)yesyesnonononononononononononoTBAAnonymousWe use only fine-training data.
more details
n/a91.298.893.673.895.893.987.195.8
F2MF-midyesyesnonononononoyesyesnonoyesyesWarp to the Future: Joint Forecasting of Features and Feature MotionJosip Saric, Marin Orsic, Tonci Antunovic, Sacha Vrazic, Sinisa SegvicThe IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020Our method forecasts semantic segmentation 9 timesteps into the future.
more details
n/a72.495.383.732.486.083.446.379.4
EMANetyesyesnonononononononononononoExpectation Maximization Attention Networks for Semantic SegmentationXia Li, Zhisheng Zhong, Jianlong Wu, Yibo Yang, Zhouchen Lin, Hong LiuICCV 2019
more details
n/a91.698.793.675.995.794.087.895.6
PartnerNetyesyesnonononononononononononoAnonymousPARTNERNET: A LIGHTWEIGHT AND EFFICIENT PARTNER NETWORK FOR SEMANTIC SEGMENTATION
more details
0.005888.298.492.165.794.691.582.193.2
SwiftNet RN18 pyr sepBN MVDyesyesnonononononononononoyesyesEfficient semantic segmentation with pyramidal fusionM Oršić, S ŠegvićPattern Recognition 2020
more details
0.02990.398.693.072.395.893.184.394.9
Tencent YYB VisualAlgoyesyesyesyesnonononononononononoAnonymousTencent YYB VisualAlgo Group
more details
n/a92.298.893.977.796.194.488.196.1
MoKu LabyesyesnonononononononononononoAnonymousAlibaba, MoKu AI Lab, CV Group
more details
n/a92.698.994.179.396.294.789.096.3
HRNetV2 + OCR + SegFixyesyesyesyesnonononononononoyesyesObject-Contextual Representations for Semantic SegmentationYuhui Yuan, Xilin Chen, Jingdong WangFirst, we pre-train "HRNet+OCR" method on the Mapillary training set (achieves 50.8% on the Mapillary val set). Second, we fine-tune the model with the Cityscapes training, validation and coarse set. Finally, we apply the "SegFix" scheme to further improve the results.
more details
n/a92.798.894.079.296.194.689.496.5
DecoupleSegNetyesyesnonononononononononoyesyesImproving Semantic Segmentation via Decoupled Body and Edge SupervisionXiangtai Li, Xia Li, Li Zhang, Guangliang Cheng, Jianping Shi, Zhouchen Lin, Shaohua Tan, and Yunhai TongECCV-2020In this paper, We propose a new paradigm for semantic segmentation. Our insight is that appealing performance of semantic segmentation re- quires explicitly modeling the object body and edge, which correspond to the high and low frequency of the image. To do so, we first warp the image feature by learning a flow field to make the object part more consistent. The resulting body feature and the residual edge feature are further optimized under decoupled supervision by explicitly sampling dif- ferent parts (body or edge) pixels. The code and models have been released.
more details
n/a92.398.894.077.896.194.588.696.3
LGE A&B Center: HANet (ResNet-101)yesyesnonononononononononoyesyesCars Can’t Fly up in the Sky: Improving Urban-Scene Segmentation via Height-driven Attention NetworksSungha Choi (LGE, Korea Univ.), Joanne T. Kim (Korea Univ.), Jaegul Choo (KAIST)CVPR 2020Dataset: "fine train + fine val", No coarse, Backbone: ImageNet pretrained ResNet-101
more details
n/a92.098.893.776.896.194.288.196.0
DCNASyesyesyesyesnonononononononononoDCNAS: Densely Connected Neural Architecture Search for Semantic Image SegmentationXiong Zhang, Hongmin Xu, Hong Mo, Jianchao Tan, Cheng Yang, Wenqi RenNeural Architecture Search (NAS) has shown great potentials in automatically designing scalable network architectures for dense image predictions. However, existing NAS algorithms usually compromise on restricted search space and search on proxy task to meet the achievable computational demands. To allow as wide as possible network architectures and avoid the gap between target and proxy dataset, we propose a Densely Connected NAS (DCNAS) framework, which directly searches the optimal network structures for the multi-scale representations of visual information, over a large-scale target dataset. Specifically, by connecting cells with each other using learnable weights, we introduce a densely connected search space to cover an abundance of mainstream network designs. Moreover, by combining both path-level and channel-level sampling strategies, we design a fusion module to reduce the memory consumption of ample search space.
more details
n/a91.998.893.877.794.094.488.396.1
GPNet-ResNet101yesyesnonononononononononononoAnonymous
more details
n/a91.998.893.776.495.994.288.496.1
Axial-DeepLab-XL [Cityscapes-fine]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a91.698.793.476.595.693.888.195.3
Axial-DeepLab-L [Mapillary Vistas]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a92.398.793.778.696.194.488.995.9
Axial-DeepLab-L [Cityscapes-fine]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.

more details
n/a91.598.793.276.095.793.688.195.2
LGE A&B Center: HANet (ResNext-101)yesyesyesyesnonononononononoyesyesCars Can’t Fly up in the Sky: Improving Urban-Scene Segmentation via Height-driven Attention NetworksSungha Choi (LGE, Korea Univ.), Joanne T. Kim (Korea Univ.), Jaegul Choo (KAIST)CVPR 2020Dataset: "fine train + fine val + coarse", Backbone: Mapillary pretrained ResNext-101
more details
n/a92.198.893.877.396.194.388.096.1
ERINet-v2yesyesnonononononononononononoEfficient Residual Inception NetworkMINJONG KIM, SUYOUNG CHIongoing
more details
0.0052631685.798.291.258.993.989.876.691.6
Naive-Student (iterative semi-supervised learning with Panoptic-DeepLab)yesyesnonononononoyesyesnonononoSemi-Supervised Learning in Video Sequences for Urban Scene SegmentationLiang-Chieh Chen, Raphael Gontijo Lopes, Bowen Cheng, Maxwell D. Collins, Ekin D. Cubuk, Barret Zoph, Hartwig Adam, Jonathon ShlensSupervised learning in large discriminative models is a mainstay for modern computer vision. Such an approach necessitates investing in large-scale human-annotated datasets for achieving state-of-the-art results. In turn, the efficacy of supervised learning may be limited by the size of the human annotated dataset. This limitation is particularly notable for image segmentation tasks, where the expense of human annotation is especially large, yet large amounts of unlabeled data may exist. In this work, we ask if we may leverage semi-supervised learning in unlabeled video sequences to improve the performance on urban scene segmentation, simultaneously tackling semantic, instance, and panoptic segmentation. The goal of this work is to avoid the construction of sophisticated, learned architectures specific to label propagation (e.g., patch matching and optical flow). Instead, we simply predict pseudo-labels for the unlabeled data and train subsequent models with both human-annotated and pseudo-labeled data. The procedure is iterated for several times. As a result, our Naive-Student model, trained with such simple yet effective iterative semi-supervised learning, attains state-of-the-art results at all three Cityscapes benchmarks, reaching the performance of 67.8% PQ, 42.6% AP, and 85.2% mIOU on the test set. We view this work as a notable step towards building a simple procedure to harness unlabeled video sequences to surpass state-of-the-art performance on core computer vision tasks.
more details
n/a92.998.894.180.296.294.889.896.4
Axial-DeepLab-XL [Mapillary Vistas]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a92.698.893.979.696.194.789.396.1
TUE-5LSM0-g23yesyesyesyesnonononononononononoAnonymousDeeplabv3+decoder
more details
n/a81.895.788.851.591.185.770.689.4
PBRNetyesyesnonononononononononononoAnonymousmodified MobileNetV2 backbone + Prediction and Boundary attention-based Refinement Module (PBRM)
more details
0.010788.798.592.666.294.791.983.293.4
ResNeSt200yesyesnonononononononononononoResNeSt: Split-Attention NetworksHang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Mueller, R. Manmatha, Mu Li, and Alexander SmolaDeepLabV3+ network with ResNeSt200 backbone.
more details
n/a92.398.893.977.896.394.688.596.2
Panoptic-DeepLab [Mapillary Vistas]yesyesnonononononononononononoPanoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic SegmentationBowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh ChenWe employ a stronger backbone, WR-41, in Panoptic-DeepLab.
For Panoptic-DeepLab, please refer to https://arxiv.org/abs/1911.10194.
For wide-ResNet-41 (WR-41) backbone, please refer to https://arxiv.org/abs/2005.10266.
more details
n/a92.998.893.980.296.194.789.796.4
EaNet-V1yesyesnonononononononononononoParsing Very High Resolution Urban Scene Images by Learning Deep ConvNets with Edge-Aware LossXianwei Zheng, Linxi Huan, Gui-Song Xia, Jianya GongParsing very high resolution (VHR) urban scene images into regions with semantic meaning, e.g. buildings and cars, is a fundamental task necessary for interpreting and understanding urban scenes. However, due to the huge quantity of details contained in an image and the large variations of objects in scale and appearance, the existing semantic segmentation methods often break one object into pieces, or confuse adjacent objects and thus fail to depict these objects consistently. To address this issue, we propose a concise and effective edge-aware neural network (EaNet) for urban scene semantic segmentation. The proposed EaNet model is deployed as a standard balanced encoder-decoder framework. Specifically, we devised two plug-and-play modules that append on top of the encoder and decoder respectively, i.e., the large kernel pyramid pooling (LKPP) and the edge-aware loss (EA loss) function, to extend the model ability in learning discriminating features. The LKPP module captures rich multi-scale context with strong continuous feature relations to promote coherent labeling of multi-scale urban objects. The EA loss module learns edge information directly from semantic segmentation prediction, which avoids costly post-processing or extra edge detection. During training, EA loss imposes a strong geometric awareness to guide object structure learning at both the pixel- and image-level, and thus effectively separates confusing objects with sharp contours.
more details
n/a91.298.893.573.895.793.986.895.8
EfficientPS [Mapillary Vistas]yesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaUnderstanding the scene in which an autonomous robot operates is critical for its competent functioning. Such scene comprehension necessitates recognizing instances of traffic participants along with general scene semantics which can be effectively addressed by the panoptic segmentation task. In this paper, we introduce the Efficient Panoptic Segmentation (EfficientPS) architecture that consists of a shared backbone which efficiently encodes and fuses semantically rich multi-scale features. We incorporate a new semantic head that aggregates fine and contextual features coherently and a new variant of Mask R-CNN as the instance head. We also propose a novel panoptic fusion module that congruously integrates the output logits from both the heads of our EfficientPS architecture to yield the final panoptic segmentation output. Additionally, we introduce the KITTI panoptic segmentation dataset that contains panoptic annotations for the popularly challenging KITTI benchmark. Extensive evaluations on Cityscapes, KITTI, Mapillary Vistas and Indian Driving Dataset demonstrate that our proposed architecture consistently sets the new state-of-the-art on all these four benchmarks while being the most efficient and fast panoptic segmentation architecture to date.
more details
n/a92.598.893.978.896.094.688.896.3
FSFNetyesyesnonononononononononoyesyesAccelerator-Aware Fast Spatial Feature Network for Real-Time Semantic SegmentationMinjong Kim, Byungjae Park, Suyoung ChiIEEE AccessSemantic segmentation is performed to understand an image at the pixel level; it is widely used in the field of autonomous driving. In recent years, deep neural networks achieve good accuracy performance; however, there exist few models that have a good trade-off between high accuracy and low inference time. In this paper, we propose a fast spatial feature network (FSFNet), an optimized lightweight semantic segmentation model using an accelerator, offering high performance as well as faster inference speed than current methods. FSFNet employs the FSF and MRA modules. The FSF module has three different types of subset modules to extract spatial features efficiently. They are designed in consideration of the size of the spatial domain. The multi-resolution aggregation module combines features that are extracted at different resolutions to reconstruct the segmentation image accurately. Our approach is able to run at over 203 FPS at full resolution 1024 x 2048) in a single NVIDIA 1080Ti GPU, and obtains a result of 69.13% mIoU on the Cityscapes test dataset. Compared with existing models in real-time semantic segmentation, our proposed model retains remarkable accuracy while having high FPS that is over 30% faster than the state-of-the-art model. The experimental results proved that our model is an ideal approach for the Cityscapes dataset.
more details
0.004926186.698.291.661.294.290.478.592.0
Hierarchical Multi-Scale Attention for Semantic SegmentationyesyesyesyesnonononononononoyesyesHierarchical Multi-Scale Attention for Semantic SegmentationAndrew Tao, Karan Sapra, Bryan Catanzaro Multi-scale inference is commonly used to improve the results of semantic segmentation. Multiple images scales are passed through a network and then the results are combined with averaging or max pooling. In this work, we present an attention-based approach to combining multi-scale predictions. We show that predictions at certain scales are better at resolving particular failures modes and that the network learns to favor those scales for such cases in order to generate better predictions. Our attention mechanism is hierarchical, which enables it to be roughly 4x more memory efficient to train than other recent approaches. In addition to enabling faster training, this allows us to train with larger crop sizes which leads to greater model accuracy. We demonstrate the result of our method on two datasets: Cityscapes and Mapillary Vistas. For Cityscapes, which has a large number of weakly labelled images, we also leverage auto-labelling to improve generalization. Using our approach we achieve a new state-of-the-art results in both Mapillary (61.1 IOU val) and Cityscapes (85.4 IOU test).
more details
n/a93.298.994.380.996.394.990.296.6
SANetyesyesnononononononono44nonoAnonymous
more details
25.091.498.893.674.395.893.987.495.8
SJTU_hpmyesyesyesyesnonoyesyesnonononononoHard Pixel Mining for Depth Privileged Semantic SegmentationZhangxuan Gu, Li Niu*, Haohua Zhao, and Liqing Zhang
more details
n/a91.498.893.575.295.993.986.995.8
FANetyesyesnonononononononononononoFANet: Feature Aggregation Network for Semantic SegmentationTanmay Singha, Duc-Son Pham, and Aneesh KrishnaFeature Aggregation Network for Semantic Segmentation
more details
n/a83.196.289.852.293.088.172.290.1
Hard Pixel Mining for Depth Privileged Semantic SegmentationyesyesyesyesnonoyesyesnonononononoHard Pixel Mining for Depth Privileged Semantic SegmentationZhangxuan Gu, Li Niu, Haohua Zhao, and Liqing ZhangSemantic segmentation has achieved remarkable progress but remains challenging due to the complex scene, object occlusion, and so on. Some research works have attempted to use extra information such as a depth map to help RGB based semantic segmentation because the depth map could provide complementary geometric cues. However, due to the inaccessibility of depth sensors, depth information is usually unavailable for the test images. In this paper, we leverage only the depth of training images as the privileged information to mine the hard pixels in semantic segmentation, in which depth information is only available for training images but not available for test images. Specifically, we propose a novel Loss Weight Module, which outputs a loss weight map by employing two depth-related measurements of hard pixels: Depth Prediction Error and Depthaware Segmentation Error. The loss weight map is then applied to segmentation loss, with the goal of learning a more robust model by paying more attention to the hard pixels. Besides, we also explore a curriculum learning strategy based on the loss weight map. Meanwhile, to fully mine the hard pixels on different scales, we apply our loss weight module to multi-scale side outputs. Our hard pixels mining method achieves the state-of-the-art results on three benchmark datasets, and even outperforms the methods which need depth input during testing.
more details
n/a92.398.893.978.296.194.488.696.3
MSeg1080_RVCyesyesnonononononononononoyesyesMSeg: A Composite Dataset for Multi-domain Semantic SegmentationJohn Lambert*, Zhuang Liu*, Ozan Sener, James Hays, Vladlen KoltunCVPR 2020We present MSeg, a composite dataset that unifies semantic segmentation datasets from different domains. A naive merge of the constituent datasets yields poor performance due to inconsistent taxonomies and annotation practices. We reconcile the taxonomies and bring the pixel-level annotations into alignment by relabeling more than 220,000 object masks in more than 80,000 images, requiring more than 1.34 years of collective annotator effort. The resulting composite dataset enables training a single semantic segmentation model that functions effectively across domains and generalizes to datasets that were not seen during training. We adopt zero-shot cross-dataset transfer as a benchmark to systematically evaluate a model’s robustness and show that MSeg training yields substantially more robust models in comparison to training on individual datasets or naive mixing of datasets without the presented contributions.
more details
0.4991.598.893.775.395.994.187.295.7
SA-Gate (ResNet-101,OS=16)yesyesnonononoyesyesnonononoyesyesBi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic SegmentationXiaokang Chen, Kwan-Yee Lin, Jingbo Wang, Wayne Wu, Chen Qian, Hongsheng Li, and Gang ZengEuropean Conference on Computer Vision (ECCV), 2020RGB+HHA input, input resolution = 800x800, output stride = 16, training 240 epochs, no coarse data is used.
more details
n/a91.998.893.676.895.994.188.496.1
seamseg_rvcsubsetnonononononononononononoyesyesSeamless Scene SegmentationPorzi, Lorenzo and Rota Bulò, Samuel and Colovic, Aleksander and Kontschieder, PeterThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2019Seamless Scene Segmentation Resnet101, pretrained on Imagenet; supplied with altered MVD to include WildDash2 classes; does not contain other RVC label policies (i.e. no ADE20K/COCO-specific classes -> rvcsubset and not a proper submission)
more details
n/a86.088.991.368.794.889.882.486.4
HRNet + LKPP + EA lossyesyesnonononononononononononoAnonymous
more details
n/a91.498.893.574.695.694.087.195.9
SN_RN152pyrx8_RVCyesyesnonononononononononoyesyesIn Defense of Pre-trained ImageNet Architectures for Real-time Semantic Segmentation of Road-driving ImagesMarin Oršić, Ivan Krešo, Petra Bevandić, Siniša ŠegvićCVPR 2019
more details
1.089.498.592.768.195.592.883.794.6
EffPS_b1bs4_RVCyesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaEfficientPS with EfficientNet-b1 backbone. Trained with a batch size of 4.
more details
n/a85.398.190.454.594.790.077.292.7
AttaNet_lightyesyesnonononononononononoyesyesAttaNet: Attention-Augmented Network for Fast and Accurate Scene Parsing(AAAI21)Anonymous
more details
n/a87.098.491.959.794.091.081.193.1
CFPNetyesyesnonononononononononononoAnonymous
more details
n/a87.498.491.862.894.391.080.893.0
Seg_UJSyesyesnonononononononononononoAnonymous
more details
n/a92.798.994.278.896.294.889.696.6
Bilateral_attention_semanticyesyesnonononononononononononoAnonymouswe use bilateral attention mechanism for semantic segmentation
more details
0.014190.498.692.972.094.992.986.195.0
Panoptic-DeepLab w/ SWideRNet [Cityscapes-fine]yesyesnonononononononononononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
more details
n/a92.398.893.778.496.094.389.395.9
ESANet RGB-D (small input)yesyesnonononoyesyesnono22yesyesEfficient RGB-D Semantic Segmentation for Indoor Scene AnalysisDaniel Seichter, Mona Köhler, Benjamin Lewandowski, Tim Wengefeld and Horst-Michael GrossEfficient RGB-D Semantic Segmentation for Indoor Scene Analysis.
ESANet-R34-NBt1D using RGB-D data with half the input resolution.
more details
0.042789.098.592.068.095.292.482.494.2
ESANet RGB (small input)yesyesnononononononono22yesyesEfficient RGB-D Semantic Segmentation for Indoor Scene AnalysisDaniel Seichter, Mona Köhler, Benjamin Lewandowski, Tim Wengefeld and Horst-Michael GrossESANet: Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis.
ESANet-R34-NBt1D using RGB images with half the input resolution.
more details
0.03187.198.391.261.694.691.379.393.2
ESANet RGB-DyesyesnonononoyesyesnonononoyesyesEfficient RGB-D Semantic Segmentation for Indoor Scene AnalysisDaniel Seichter, Mona Köhler, Benjamin Lewandowski, Tim Wengefeld and Horst-Michael GrossEfficient RGB-D Semantic Segmentation for Indoor Scene Analysis.
ESANet-R34-NBt1D using RGB-D data.
more details
0.161391.398.793.375.195.993.787.195.5
DAHUA-ARIyesyesyesyesnonononononononononoAnonymousmulti-scale and refineNet
more details
n/a93.298.994.380.896.395.090.196.7
ESANet RGByesyesnonononononononononoyesyesEfficient RGB-D Semantic Segmentation for Indoor Scene AnalysisDaniel Seichter, Mona Köhler, Benjamin Lewandowski, Tim Wengefeld and Horst-Michael GrossESANet: Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis.
ESANet-R34-NBt1D using RGB images only.
more details
0.120590.298.693.071.095.393.185.195.0
DCNAS+ASPP [Mapillary Vistas]yesyesyesyesnonononononononononoAnonymousExisting NAS algorithms usually compromise on restricted search space or search on proxy task to meet the achievable computational demands. To allow as wide as possible network architectures and avoid the gap between realistic and proxy setting, we propose a novel Densely Connected NAS (DCNAS) framework, which directly searches the optimal network structures for the multi-scale representations of visual information, over a large-scale target dataset without proxy. Specifically, by connecting cells with each other using learnable weights, we introduce a densely connected search space to cover an abundance of mainstream network designs. Moreover, by combining both path-level and channel-level sampling strategies, we design a fusion module and mixture layer to reduce the memory consumption of ample search space, hence favor the proxyless searching. Compared with contemporary works, experiments reveal that the proxyless searching scheme is capable of bridge the gap between searching and training environments.
more details
n/a93.198.994.380.696.394.990.096.6
Panoptic-DeepLab w/ SWideRNet [Mapillary Vistas]yesyesnonononononononononononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
more details
n/a92.998.894.180.596.294.989.796.4
DCNAS+ASPPyesyesyesyesnonononononononononoDCNAS: Densely Connected Neural Architecture Search for Semantic ImageSegmentationAnonymousExisting NAS algorithms usually compromise on restricted search space or search on proxy task to meet the achievable computational demands. To allow as wide as possible network architectures and avoid the gap between realistic and proxy setting, we propose a novel Densely Connected NAS (DCNAS) framework, which directly searches the optimal network structures for the multi-scale representations of visual information, over a large-scale target dataset without proxy. Specifically, by connecting cells with each other using learnable weights, we introduce a densely connected search space to cover an abundance of mainstream network designs. Moreover, by combining both path-level and channel-level sampling strategies, we design a fusion module and mixture layer to reduce the memory consumption of ample search space, hence favor the proxyless searching.
more details
n/a92.798.994.179.296.294.689.396.4
ddl_segyesyesnonononononononononononoAnonymous
more details
n/a92.898.994.378.896.294.989.696.7
CABiNetyesyesnonononononononononononoCABiNet: Efficient Context Aggregation Network for Low-Latency Semantic SegmentationSaumya Kumaar, Ye Lyu, Francesco Nex, Michael Ying YangWith the increasing demand of autonomous machines, pixel-wise semantic segmentation for visual scene understanding needs to be not only accurate but also efficient for any potential real-time applications. In this paper, we propose CABiNet (Context Aggregated Bi-lateral Network), a dual branch convolutional neural network (CNN), with significantly lower computational costs as compared to the state-of-the-art, while maintaining a competitive prediction accuracy. Building upon the existing multi-branch architectures for high-speed semantic segmentation, we design a cheap high resolution branch for effective spatial detailing and a context branch with light-weight versions of global aggregation and local distribution blocks, potent to capture both long-range and local contextual dependencies required for accurate semantic segmentation, with low computational overheads. Specifically, we achieve 76.6% and 75.9% mIOU on Cityscapes validation and test sets respectively, at 76 FPS on an NVIDIA RTX 2080Ti and 8 FPS on a Jetson Xavier NX. Codes and training models will be made publicly available.
more details
0.01391.198.693.376.695.793.585.094.8
Margin calibrationyesyesyesyesnonononononononononoAnonymousThe model is DeepLab v3+ backend on SEResNeXt50. We used the margin calibration with log-loss as the learning objective.
more details
n/a92.198.893.977.296.094.388.296.1
MT-SSSRyesyesnononononononono22nonoAnonymous
more details
n/a91.498.793.475.895.793.787.395.5
Panoptic-DeepLab w/ SWideRNet [Mapillary Vistas + Pseudo-labels]yesyesnonononononoyesyesnonononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime. Following Naive-Student, this model is additionally trained with pseudo-labels generated from Cityscapes Video and train-extra set (i.e., the coarse annotations are not used, but the images are).
more details
n/a93.098.894.180.996.294.989.896.4
DSANet: Dilated Spatial Attention for Real-time Semantic Segmentation in Urban Street ScenesyesyesnonononononononononononoAnonymouswe present computationally efficient network named DSANet, which follows a two-branch strategy to tackle the problem of real-time semantic segmentation in urban scenes. We first design a Context branch, which employs Depth-wise Asymmetric ShuffleNet DAS as main building block to acquire sufficient receptive fields. In addition, we propose a dual attention module consisting of dilated spatial attention and channel attention to make full use of the multi-level feature maps simultaneously, which helps predict the pixel-wise labels in each stage. Meanwhile, Spatial Encoding Network is used to enhance semantic information by preserving the spatial details. Finally, to better combine context information and spatial information, we introduce a Simple Feature Fusion Module to combine the features from the two branches.
more details
n/a88.098.092.365.694.591.881.692.0
UJS_modelyesyesnonononononononononononoAnonymous
more details
n/a93.198.994.380.796.294.890.196.7
Mobilenetv3-small-backbone real-time segmentationyesyesnonononononononononoyesyesAnonymousThe model is a dual-path network with mobilenetv3-small backbone. PSP module was used as the context aggregation block. We also use feature fusion module at x16, x32. The features of the two branches are then concatenated and fused with a bottleneck conv.
Only train data is used to train the model excluding validation data. And evaluation was done by single scale input images.
more details
0.0284.398.190.653.193.288.975.890.8
M2FANetyesyesnonononononononononoyesyesUrban street scene analysis using lightweight multi-level multi-path feature aggregation networkTanmay Singha; Duc-Son Pham; Aneesh KrishnaMultiagent and Grid Systems Journal
more details
n/a86.996.991.663.294.590.179.692.2
AFPNetyesyesnonononononononononononoAnonymous
more details
0.0389.398.692.568.394.992.483.794.4
YOLO V5s with Segmentation Headyesyesnononononononono22yesyesAnonymousMultitask model. fine tune from COCO detection pretrained model, train semantic segmentation and object detection(transfer from instance label) at the same time
more details
0.00785.798.290.455.693.590.378.593.2
FSFFNetyesyesyesyesnonononononononoyesyesA Lightweight Multi-scale Feature Fusion Network for Real-Time Semantic SegmentationTanmay Singha, Duc-Son Pham, Aneesh Krishna, Tom GedeonInternational Conference on Neural Information Processing 2021Feature Scaling Feature Fusion Network
more details
n/a87.196.891.564.194.490.279.892.7
Qualcomm AI ResearchyesyesyesyesnonononononononoyesyesInverseForm: A Loss Function for Structured Boundary-Aware SegmentationShubhankar Borse, Ying Wang, Yizhe Zhang, Fatih PorikliCVPR 2021 oral
more details
n/a93.198.794.180.996.394.990.396.7
HIK-CCSLTyesyesyesyesnonononononononononoAnonymous
more details
n/a93.398.994.481.196.395.190.496.8
BFNetyesyesnonononononononononononoBFNetJiaqi Fan
more details
n/a87.697.991.964.694.891.580.292.2
Hai Wang+Yingfeng Cai-research groupyesyesnonononononononononononoAnonymous
more details
0.0016493.198.994.280.696.394.990.196.7
Jiangsu_university_Intelligent_Drive_AIyesyesnonononononononononononoAnonymous
more details
n/a93.198.994.280.696.394.990.196.7
MCANetyesyesyesyesnonononononononoyesyesAnonymous
more details
n/a88.998.592.268.495.392.082.393.8
UFONet (half-resolution)yesyesnonononononononononononoUFO RPN: A Region Proposal Network for Ultra Fast Object DetectionWenkai Li, Andy SongThe 34th Australasian Joint Conference on Artificial Intelligence
more details
n/a78.697.288.643.490.684.262.983.3
SCMNetyesyesnonononononononononononoAnonymous
more details
n/a86.898.291.562.494.690.778.392.0
FsaNetyesyesyesyesnonononononononononoFsaNet: Frequency Self-attention for Semantic SegmentationAnonymous
more details
n/a91.898.893.775.895.994.288.096.0
SCMNet coarseyesyesyesyesnonononononononoyesyesSCMNet: Shared Context Mining Network for Real-time Semantic SegmentationTanmay Singha; Moritz Bergemann; Duc-Son Pham; Aneesh Krishna2021 Digital Image Computing: Techniques and Applications (DICTA)
more details
n/a87.298.391.763.594.790.979.192.2
SAIT SeeThroughNetyesyesyesyesnonononononononononoAnonymous
more details
n/a93.298.994.381.196.395.090.296.8
JSU_IDT_groupyesyesnonononononononononononoAnonymous
more details
n/a93.298.994.380.796.395.090.296.7
DLA_HRNet48OCR_MSFLIP_000yesyesyesyesnonononononononononoAnonymousThis set of predictions is from DLA (differentiable lattice assignment network) with "HRNet48+OCR-Head" as base segmentation model. The model is, first trained on coarse-data, and then trained on fine-annotated train/val sets. Multi-scale (0.5, 0.75, 1.0, 1.25, 1.5, 1.75) and flip scheme is adopted during inference.
more details
n/a93.098.994.280.696.294.889.896.6
MYBank-AIoTyesyesyesyesnonononononononononoAnonymous
more details
n/a93.398.994.581.196.495.190.696.7
kMaX-DeepLab [Cityscapes-fine]yesyesnonononononononononoyesyesk-means Mask TransformerQihang Yu, Huiyu Wang, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, and Liang-Chieh ChenECCV 2022kMaX-DeepLab w/ ConvNeXt-L backbone (ImageNet-22k + 1k pretrained). This result is obtained by the kMaX-DeepLab trained for Panoptic Segmentation task. No test-time augmentation or other external dataset.
more details
n/a92.398.893.877.896.294.588.996.2
LeapAIyesyesyesyesnonononononononononoAnonymousUsing advanced AI techniques.
more details
n/a93.299.094.481.296.494.990.096.7
adlab_iiau_ldzyesyesnonononononononononononoAnonymousmeticulous-caiman_2022.05.01_03.32
more details
n/a93.198.994.480.896.395.090.096.6
SFRSegyesyesnonononononononononoyesyesA Real-Time Semantic Segmentation Model Using Iteratively Shared Features In Multiple Sub-EncodersTanmay Singha, Duc-Son Pham, Aneesh KrishnaPattern Recognition
more details
n/a86.398.291.158.094.790.978.592.4
PIDNet-SyesyesnonononononononononoyesyesPIDNet: A Real-time Semantic Segmentation Network Inspired from PID ControllerAnonymous
more details
0.010790.598.793.271.895.693.385.795.1
Vision Transformer Adapter for Dense PredictionsyesyesnonononononononononoyesyesVision Transformer Adapter for Dense PredictionsZhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu QiaoViT-Adapter-L, BEiT pre-train, multi-scale testing
more details
n/a92.898.894.079.696.294.989.996.4
SSNetyesyesnonononoyesyesnonononononoAnonymous
more details
n/a89.998.492.371.795.192.285.394.3
SDBNetyesyesnonononononononononoyesyesSDBNet: Lightweight Real-time Semantic Segmentation Using Short-term Dense BottleneckTanmay Singha, Duc-Son Pham, Aneesh Krishna2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)
more details
n/a87.298.291.762.894.491.179.792.7
MeiTuan-BaseModelyesyesyesyesnonononononononononoAnonymous
more details
n/a93.498.994.581.496.495.290.696.6
SDBNetV2yesyesnonononononononononoyesyesImproved Short-term Dense Bottleneck network for efficient scene analysisTanmay Singha; Duc-Son Pham; Aneesh KrishnaComputer Vision and Image Understanding
more details
n/a87.998.391.964.594.891.780.893.1
mogo_semanticyesyesnonononononononononononoAnonymous
more details
n/a93.298.994.281.096.394.990.296.6
UDSSEG_RVCyesyesnonononononononononononoAnonymousUDSSEG_RVC
more details
n/a90.798.393.373.095.993.785.595.3
MIX6D_RVCyesyesnonononononononononononoAnonymousMIX6D_RVC
more details
n/a89.397.792.668.395.092.484.794.4
FAN_NV_RVCyesyesnonononononononononononoAnonymousHybrid-Base + Segformer
more details
n/a91.097.993.673.096.193.687.195.8
UNIV_CNP_RVCyesyesnonononononononononononoAnonymousRVC 2022
more details
n/a86.297.791.357.494.391.779.791.4
AntGroup-AI-VisionAlgoyesyesyesyesnonononoyesyesnonononoAnonymousAntGroup AI vision algo
more details
n/a93.298.994.381.096.395.190.196.5
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable ConvolutionsyesyesyesyesnonononononononoyesyesInternImage: Exploring Large-Scale Vision Foundation Models with Deformable ConvolutionsWenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, Yu QiaoCVPR 2023We use Mask2Former as the segmentation framework, and initialize our InternImage-H model with the pre-trained weights on the 427M joint dataset of public Laion-400M, YFCC-15M, and CC12M. Following common practices, we first pre-train on Mapillary Vistas for 80k iterations, and then fine-tune on Cityscapes for 80k iterations. The crop size is set to 1024×1024 in this experiment. As a result, our InternImage-H achieves 87.0 multi-scale mIoU on the validation set, and 86.1 multi-scale mIoU on the test set.
more details
n/a93.098.994.380.496.395.090.096.4
Dense Prediction with Attentive Feature aggregationyesyesyesyesnonononononononoyesyesDense Prediction with Attentive Feature AggregationYung-Hsu Yang, Thomas E. Huang, Min Sun, Samuel Rota Bulò, Peter Kontschieder, Fisher YuWACV 2023We propose Attentive Feature Aggregation (AFA) to exploit both spatial and channel information for semantic segmentation and boundary detection.
more details
n/a92.598.994.078.496.194.589.396.3
W3_FAFMyesyesnonononononononononononoJunyan Yang, Qian Xu, Lei LaTeam: BOSCH-XC-DX-WAVE3
more details
0.02930991.198.793.373.995.593.787.095.6
HRNyesyesnonononononononononononoAnonymousHierarchical residual network
more details
45.090.098.693.070.595.393.084.794.9
HRN+DCNv2_for_DOASyesyesnonononononononononononoAnonymousHRN with DCNv2 for DOAS in paper "Dynamic Obstacle Avoidance System based on Rapid Instance Segmentation Network"
more details
0.03291.698.793.675.995.694.087.995.7
GEELY-ATC-SEGyesyesyesyesnonononononononononoAnonymous
more details
n/a93.398.994.481.196.395.290.396.5
PMSDSENyesyesnonononononononononoyesyesEfficient Parallel Multi-Scale Detail and Semantic Encoding Network for Lightweight Semantic SegmentationXiao Liu, Xiuya Shi, Lufei Chen, Linbo Qing, Chao RenACM International Conference on Multimedia 2023MM '23: Proceedings of the 31th ACM International Conference on Multimedia
more details
n/a88.898.492.368.794.891.881.693.7
ECFDyesyesnonononononononononoyesyesAnonymousbackbone: ConvNext-Large
more details
n/a92.298.793.877.896.094.488.995.9
DWGSeg-L75yesyesnononononononono1.31.3nonoAnonymous
more details
0.0075589.498.592.668.495.292.783.694.7
VLTSegyesyesnonononononononononononoVLTSeg: Simple Transfer of CLIP-Based Vision-Language Representations for Domain Generalized Semantic SegmentationChristoph Hümmer, Manuel Schwonberg, Liangwei Zhou, Hu Cao, Alois Knoll, Hanno Gottschalk
more details
n/a93.198.994.280.596.395.090.596.5
CGMANet_v1yesyesnonononononononononononoContext Guided Multi-scale Attention for Real-time Semantic Segmentation of Road-sceneSaquib MazharContext Guided Multi-scale Attention for Real-time Semantic Segmentation of Road-scene
more details
n/a88.597.791.568.194.391.682.993.3
SERNet-Former_v2yesyesyesyesnonononononononononoAnonymous
more details
n/a92.198.793.577.395.994.488.795.9

iIoU on category-level

namefinefinecoarsecoarse16-bit16-bitdepthdepthvideovideosubsubcodecodetitleauthorsvenuedescriptionRuntime [s]averagehumanvehicle
FCN 8syesyesnonononononononononoyesyesFully Convolutional Networks for Semantic SegmentationJ. Long, E. Shelhamer, and T. DarrellCVPR 2015Trained by Marius Cordts on a pre-release version of the dataset
more details
0.570.158.082.3
RRR-ResNet152-MultiScaleyesyesyesyesnonononononononononoAnonymousupdate: this submission actually used the coarse labels, which was previously not marked accordingly
more details
n/a74.061.886.1
Dilation10yesyesnonononononononononoyesyesMulti-Scale Context Aggregation by Dilated ConvolutionsFisher Yu and Vladlen KoltunICLR 2016Dilation10 is a convolutional network that consists of a front-end prediction module and a context aggregation module. Both are described in the paper. The combined network was trained jointly. The context module consists of 10 layers, each of which has C=19 feature maps. The larger number of layers in the context module (10 for Cityscapes versus 8 for Pascal VOC) is due to the high input resolution. The Dilation10 model is a pure convolutional network: there is no CRF and no structured prediction. Dilation10 can therefore be used as the baseline input for structured prediction models. Note that the reported results were produced by training on the training set only; the network was not retrained on train+val.
more details
4.071.158.383.9
AdelaideyesyesnonononononononononononoEfficient Piecewise Training of Deep Structured Models for Semantic SegmentationG. Lin, C. Shen, I. Reid, and A. van den HengelarXiv preprint 2015Trained on a pre-release version of the dataset
more details
35.067.458.276.7
DeepLab LargeFOV StrongWeakyesyesyesyesnononononono22yesyesWeakly- and Semi-Supervised Learning of a DCNN for Semantic Image SegmentationG. Papandreou, L.-C. Chen, K. Murphy, and A. L. YuilleICCV 2015Trained on a pre-release version of the dataset
more details
4.058.741.475.9
DeepLab LargeFOV Strongyesyesnononononononono22yesyesSemantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFsL.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. YuilleICLR 2015Trained on a pre-release version of the dataset
more details
4.058.741.376.1
DPNyesyesyesyesnononononono33nonoSemantic Image Segmentation via Deep Parsing NetworkZ. Liu, X. Li, P. Luo, C. C. Loy, and X. TangICCV 2015Trained on a pre-release version of the dataset
more details
n/a57.939.976.0
Segnet basicyesyesnononononononono44yesyesSegNet: A Deep Convolutional Encoder-Decoder Architecture for Image SegmentationV. Badrinarayanan, A. Kendall, and R. CipollaarXiv preprint 2015Trained on a pre-release version of the dataset
more details
0.0661.947.076.8
Segnet extendedyesyesnononononononono44yesyesSegNet: A Deep Convolutional Encoder-Decoder Architecture for Image SegmentationV. Badrinarayanan, A. Kendall, and R. CipollaarXiv preprint 2015Trained on a pre-release version of the dataset
more details
0.0666.451.980.9
CRFasRNNyesyesnononononononono22yesyesConditional Random Fields as Recurrent Neural NetworksS. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. TorrICCV 2015Trained on a pre-release version of the dataset
more details
0.766.053.478.6
Scale invariant CNN + CRFyesyesnonononoyesyesnonononoyesyesConvolutional Scale Invariance for Semantic SegmentationI. Kreso, D. Causevic, J. Krapac, and S. SegvicGCPR 2016We propose an effective technique to address large scale variation in images taken from a moving car by cross-breeding deep learning with stereo reconstruction. Our main contribution is a novel scale selection layer which extracts convolutional features at the scale which matches the corresponding reconstructed depth. The recovered scaleinvariant representation disentangles appearance from scale and frees the pixel-level classifier from the need to learn the laws of the perspective. This results in improved segmentation results due to more effi- cient exploitation of representation capacity and training data. We perform experiments on two challenging stereoscopic datasets (KITTI and Cityscapes) and report competitive class-level IoU performance.
more details
n/a71.260.681.7
DPNyesyesnonononononononononononoSemantic Image Segmentation via Deep Parsing NetworkZ. Liu, X. Li, P. Luo, C. C. Loy, and X. TangICCV 2015DPN trained on full resolution images
more details
n/a69.155.083.1
Pixel-level Encoding for Instance SegmentationyesyesnonononoyesyesnonononononoPixel-level Encoding and Depth Layering for Instance-level Semantic LabelingJ. Uhrig, M. Cordts, U. Franke, and T. BroxGCPR 2016We predict three encoding channels from a single image using an FCN: semantic labels, depth classes, and an instance-aware representation based on directions towards instance centers. Using low-level computer vision techniques, we obtain pixel-level and instance-level semantic labeling paired with a depth estimate of the instances.
more details
n/a73.962.685.2
Adelaide_contextyesyesnonononononononononononoEfficient Piecewise Training of Deep Structured Models for Semantic SegmentationGuosheng Lin, Chunhua Shen, Anton van den Hengel, Ian ReidCVPR 2016We explore contextual information to improve semantic image segmentation. Details are described in the paper. We trained contextual networks for coarse level prediction and a refinement network for refining the coarse prediction. Our models are trained on the training set only (2975 images) without adding the validation set.
more details
n/a74.163.185.1
NVSegNetyesyesnonononononononononononoAnonymousIn the inference, we use the image of 2 different scales. The same for training!
more details
0.468.153.582.7
ENetyesyesnononononononono22yesyesENet: A Deep Neural Network Architecture for Real-Time Semantic SegmentationAdam Paszke, Abhishek Chaurasia, Sangpil Kim, Eugenio Culurciello
more details
0.01364.049.378.7
DeepLabv2-CRFyesyesnonononononononononoyesyesDeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFsLiang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, Alan L. YuillearXiv preprintDeepLabv2-CRF is based on three main methods. First, we employ convolution with upsampled filters, or ‘atrous convolution’, as a powerful tool to repurpose ResNet-101 (trained on image classification task) in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within DCNNs. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and fully connected Conditional Random Fields (CRFs). The model is only trained on train set.
more details
n/a67.752.582.9
m-TCFsyesyesyesyesnonononononononononoAnonymousConvolutional Neural Network
more details
1.070.657.084.1
DeepLab+DynamicCRFyesyesnonononononononononononoru.nl
more details
n/a62.445.879.0
LRR-4xyesyesnonononononononononoyesyesLaplacian Pyramid Reconstruction and Refinement for Semantic SegmentationGolnaz Ghiasi, Charless C. FowlkesECCV 2016We introduce a CNN architecture that reconstructs high-resolution class label predictions from low-resolution feature maps using class-specific basis functions. Our multi-resolution architecture also uses skip connections from higher resolution feature maps to successively refine segment boundaries reconstructed from lower resolution maps. The model used for this submission is based on VGG-16 and it was trained on the training set (2975 images). The segmentation predictions were not post-processed using CRF. (This is a revision of a previous submission in which we didn't use the correct basis functions; the method name changed from 'LLR-4x' to 'LRR-4x')
more details
n/a74.763.386.2
LRR-4xyesyesyesyesnonononononononoyesyesLaplacian Pyramid Reconstruction and Refinement for Semantic SegmentationGolnaz Ghiasi, Charless C. FowlkesECCV 2016We introduce a CNN architecture that reconstructs high-resolution class label predictions from low-resolution feature maps using class-specific basis functions. Our multi-resolution architecture also uses skip connections from higher resolution feature maps to successively refine segment boundaries reconstructed from lower resolution maps. The model used for this submission is based on VGG-16 and it was trained using both coarse and fine annotations. The segmentation predictions were not post-processed using CRF.
more details
n/a73.962.785.0
Le_Selfdriving_VGGyesyesnonononononononononononoAnonymous
more details
n/a64.350.078.5
SQyesyesnonononononononononononoSpeeding up Semantic Segmentation for Autonomous DrivingMichael Treml, José Arjona-Medina, Thomas Unterthiner, Rupesh Durgesh, Felix Friedmann, Peter Schuberth, Andreas Mayr, Martin Heusel, Markus Hofmarcher, Michael Widrich, Bernhard Nessler, Sepp HochreiterNIPS 2016 Workshop - MLITS Machine Learning for Intelligent Transportation Systems Neural Information Processing Systems 2016, Barcelona, Spain
more details
0.0666.050.082.0
SAITyesyesyesyesnonononononononononoAnonymousAnonymous
more details
4.075.564.386.7
FoveaNetyesyesnonononononononononononoFoveaNetXin Li, Jiashi Feng1.caffe-master
2.resnet-101
3.single scale testing

Previously listed as "LXFCRN".
more details
n/a77.668.386.9
RefineNetyesyesnonononononononononoyesyesRefineNet: Multi-Path Refinement Networks for High-Resolution Semantic SegmentationGuosheng Lin; Anton Milan; Chunhua Shen; Ian Reid;Please refer to our technical report for details: "RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation" (https://arxiv.org/abs/1611.06612). Our source code is available at: https://github.com/guosheng/refinenet
2975 images (training set with fine labels) are used for training.
more details
n/a70.656.884.5
SegModelyesyesnonononononononononononoAnonymousBoth train set (2975) and val set (500) are used to train model for this submission.
more details
0.875.964.287.6
TuSimpleyesyesnonononononononononoyesyesUnderstanding Convolution for Semantic SegmentationPanqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, Garrison Cottrell
more details
n/a75.264.086.5
Global-Local-RefinementyesyesnonononononononononononoGlobal-residual and Local-boundary Refinement Networks for Rectifying Scene Parsing PredictionsRui Zhang, Sheng Tang, Min Lin, Jintao Li, Shuicheng YanInternational Joint Conference on Artificial Intelligence (IJCAI) 2017global-residual and local-boundary refinement

The method was previously listed as "RefineNet". To avoid confusions with a recently appeared and similarly named approach, the submission name was updated.
more details
n/a76.866.786.9
XPARSEyesyesnonononononononononononoAnonymous
more details
n/a74.263.085.4
ResNet-38yesyesnonononononononononoyesyesWider or Deeper: Revisiting the ResNet Model for Visual RecognitionZifeng Wu, Chunhua Shen, Anton van den Hengelarxivsingle model, single scale, no post-processing with CRFs
Model A2, 2 conv., fine only, single scale testing

The submissions was previously listed as "Model A2, 2 conv.". The name was changed for consistency with the other submission of the same work.
more details
n/a81.173.289.0
SegModelyesyesyesyesnonononononononononoAnonymous
more details
n/a77.066.287.9
Deep Layer Cascade (LC)yesyesnonononononononononononoNot All Pixels Are Equal: Difficulty-aware Semantic Segmentation via Deep Layer CascadeXiaoxiao Li, Ziwei Liu, Ping Luo, Chen Change Loy, Xiaoou TangCVPR 2017We propose a novel deep layer cascade (LC) method to improve the accuracy and speed of semantic segmentation. Unlike the conventional model cascade (MC) that is composed of multiple independent models, LC treats a single deep model as a cascade of several sub-models. Earlier sub-models are trained to handle easy and confident regions, and they progressively feed-forward harder regions to the next sub-model for processing. Convolutions are only calculated on these regions to reduce computations. The proposed method possesses several advantages. First, LC classifies most of the easy regions in the shallow stage and makes deeper stage focuses on a few hard regions. Such an adaptive and 'difficulty-aware' learning improves segmentation performance. Second, LC accelerates both training and testing of deep network thanks to early decisions in the shallow stage. Third, in comparison to MC, LC is an end-to-end trainable framework, allowing joint learning of all sub-models. We evaluate our method on PASCAL VOC and
more details
n/a74.162.086.2
FRRNyesyesnononononononono22yesyesFull-Resolution Residual Networks for Semantic Segmentation in Street ScenesTobias Pohlen, Alexander Hermans, Markus Mathias, Bastian LeibeArxivFull-Resolution Residual Networks (FRRN) combine multi-scale context with pixel-level accuracy by using two processing streams within one network: One stream carries information at the full image resolution, enabling precise adherence to segment boundaries. The other stream undergoes a sequence of pooling operations to obtain robust features for recognition.
more details
n/a75.164.985.4
MNet_MPRGyesyesnonononononononononononoChubu University, MPRGwithout val dataset, external dataset (e.g. image net) and post-processing
more details
0.677.968.687.1
ResNet-38yesyesyesyesnonononononononoyesyesWider or Deeper: Revisiting the ResNet Model for Visual RecognitionZifeng Wu, Chunhua Shen, Anton van den Hengelarxivsingle model, no post-processing with CRFs
Model A2, 2 conv., fine+coarse, multi scale testing
more details
n/a79.169.688.5
FCN8s-QunjieYuyesyesnonononononononononononoAnonymous
more details
n/a68.755.681.7
RGB-D FCNyesyesyesyesnonoyesyesnonononononoAnonymousGoogLeNet + depth branch, single model
no data augmentation, no training on validation set, no graphical model
Used coarse labels to initialize depth branch
more details
n/a71.058.083.9
MultiBoostyesyesyesyesnonoyesyesnono22nonoAnonymousBoosting based solution.
Publication is under review.
more details
0.2560.245.075.5
GoogLeNet FCNyesyesnonononononononononononoGoing Deeper with ConvolutionsChristian Szegedy , Wei Liu , Yangqing Jia , Pierre Sermanet , Scott Reed , Dragomir Anguelov , Dumitru Erhan , Vincent Vanhoucke , Andrew RabinovichCVPR 2015GoogLeNet
No data augmentation, no graphical model
Trained by Lukas Schneider, following "Fully Convolutional Networks for Semantic Segmentation", Long et al. CVPR 2015
more details
n/a69.856.383.3
ERFNet (pretrained)yesyesnononononononono22yesyesERFNet: Efficient Residual Factorized ConvNet for Real-time Semantic SegmentationEduardo Romera, Jose M. Alvarez, Luis M. Bergasa and Roberto ArroyoTransactions on Intelligent Transportation Systems (T-ITS)ERFNet pretrained on ImageNet and trained only on the fine train (2975) annotated images


more details
0.0272.761.284.1
ERFNet (from scratch)yesyesnononononononono22yesyesEfficient ConvNet for Real-time Semantic SegmentationEduardo Romera, Jose M. Alvarez, Luis M. Bergasa and Roberto ArroyoIV2017ERFNet trained entirely on the fine train set (2975 images) without any pretraining nor coarse labels
more details
0.0270.458.082.8
TuSimple_CoarseyesyesyesyesnonononononononoyesyesUnderstanding Convolution for Semantic SegmentationPanqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, Garrison CottrellHere we show how to improve pixel-wise semantic segmentation by manipulating convolution-related operations that are better for practical use. First, we implement dense upsampling convolution (DUC) to generate pixel-level prediction, which is able to capture and decode more detailed information that is generally missing in bilinear upsampling. Second, we propose a hybrid dilated convolution (HDC) framework in the encoding phase. This framework 1) effectively enlarges the receptive fields of the network to aggregate global information; 2) alleviates what we call the "gridding issue" caused by the standard dilated convolution operation. We evaluate our approaches thoroughly on the Cityscapes dataset, and achieve a new state-of-art result of 80.1% mIOU in the test set. We also are state-of-the-art overall on the KITTI road estimation benchmark and the
PASCAL VOC2012 segmentation task. Pretrained models are available at https://goo.gl/DQMeun.
more details
n/a77.868.687.1
SAC-multipleyesyesnonononononononononononoScale-adaptive Convolutions for Scene ParsingRui Zhang, Sheng Tang, Yongdong Zhang, Jintao Li, and Shuicheng YanInternational Conference on Computer Vision (ICCV) 2017
more details
n/a78.368.488.2
NetWarpyesyesyesyesnonononoyesyesnonononoAnonymous
more details
n/a79.870.788.9
depthAwareSeg_RNN_ffyesyesnonononononononononoyesyesAnonymoustraining with fine-annotated training images only (val set is not used); flip-augmentation only in training; single GPU for train&test; softmax loss; resnet101 as front end; multiscale test.
more details
n/a76.967.486.5
Ladder DenseNetyesyesnonononononononononoyesyesLadder-style DenseNets for Semantic Segmentation of Large Natural ImagesIvan Krešo, Josip Krapac, Siniša ŠegvićICCV 2017https://ivankreso.github.io/publication/ladder-densenet/
more details
0.4579.570.488.6
Real-time FCNyesyesyesyesnonononononononononoUnderstanding Cityscapes: Efficient Urban Semantic Scene UnderstandingMarius CordtsDissertationCombines the following concepts:
Network architecture: "Going deeper with convolutions". Szegedy et al., CVPR 2015
Framework and skip connections: "Fully convolutional networks for semantic segmentation". Long et al., CVPR 2015
Context modules: "Multi-scale context aggregation by dilated convolutions". Yu and Kolutin, ICLR 2016
more details
0.04471.660.582.7
GridNetyesyesnonononononononononononoAnonymousConv-Deconv Grid-Network for semantic segmentation.
Using only the training set without extra coarse annotated data (only 2975 images).
No pre-training (ImageNet).
No post-processing (like CRF).
more details
n/a71.158.384.0
PEARLyesyesnonononononoyesyesnonononoVideo Scene Parsing with Predictive Feature LearningXiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe Lin, Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu, Zequn Jie, Jiashi Feng, and Shuicheng YanICCV 2017We proposed a novel Parsing with prEdictive feAtuRe Learning (PEARL) model to address the following two problems in video scene parsing: firstly, how to effectively learn meaningful video representations for producing the temporally consistent labeling maps; secondly, how to overcome the problem of insufficient labeled video training data, i.e. how to effectively conduct unsupervised deep learning. To our knowledge, this is the first model to employ predictive feature learning in the video scene parsing.
more details
n/a75.164.385.9
pruned & dilated inception-resnet-v2 (PD-IR2)yesyesyesyesnonononononononoyesyesAnonymous
more details
0.6968.355.381.2
PSPNetyesyesyesyesnonononononononoyesyesPyramid Scene Parsing NetworkHengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, Jiaya JiaCVPR 2017This submission is trained on coarse+fine(train+val set, 2975+500 images).

Former submission is trained on coarse+fine(train set, 2975 images) which gets 80.2 mIoU: https://www.cityscapes-dataset.com/method-details/?submissionID=314

Previous versions of this method were listed as "SenseSeg_1026".
more details
n/a79.270.288.2
motovisyesyesyesyesnonononononononononomotovis.com
more details
n/a80.772.389.0
ML-CRNNyesyesnonononononononononononoMulti-level Contextual RNNs with Attention Model for Scene LabelingHeng Fan, Xue Mei, Danil Prokhorov, Haibin LingarXivA framework based on CNNs and RNNs is proposed, in which the RNNs are used to model spatial dependencies among image units. Besides, to enrich deep features, we use different features from multiple levels, and adopt a novel attention model to fuse them.
more details
n/a72.560.984.1
Hybrid ModelyesyesnonononononononononononoAnonymous
more details
n/a68.555.681.5
tek-IflyyesyesnonononononononononononoIflytekIflytek-yinusing a fusion strategy of three single models, the best result of a single model is 80.01%,multi-scale
more details
n/a79.670.788.4
GridNetyesyesnonononononononononoyesyesResidual Conv-Deconv Grid Network for Semantic SegmentationDamien Fourure, Rémi Emonet, Elisa Fromont, Damien Muselet, Alain Tremeau & Christian WolfBMVC 2017We used a new architecture for semantic image segmentation called GridNet, following a grid pattern allowing multiple interconnected streams to work at different resolutions (see paper).
We used only the training set without extra coarse annotated data (only 2975 images) and no pre-training (ImageNet) nor pre or post-processing.
more details
n/a71.458.784.2
firenetyesyesnononononononono22nonoAnonymous
more details
n/a75.566.484.5
DeepLabv3yesyesyesyesnonononononononononoRethinking Atrous Convolution for Semantic Image SegmentationLiang-Chieh Chen, George Papandreou, Florian Schroff, Hartwig AdamarXiv preprintIn this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter’s field-of-view as well as control the resolution of feature responses computed by Deep Convolutional Neural Networks, in the application of semantic image segmentation. To handle the problem of segmenting objects
at multiple scales, we employ a module, called Atrous Spatial Pyrmid Pooling (ASPP), which adopts atrous convolution in parallel to capture multi-scale context with multiple atrous rates. Furthermore, we propose to augment ASPP module with image-level features encoding global context and further boost performance.
Results obtained with a single model (no ensemble), trained with fine + coarse annotations. More details will be shown in the updated arXiv report.
more details
n/a81.774.089.4
EdgeSenseSegyesyesnonononononononononononoAnonymousDeep segmentation network with hard negative mining and other tricks.
more details
n/a78.569.787.3
ScaleNetyesyesyesyesnonononononononononoScaleNet: Scale Invariant Network for Semantic Segmentation in Urban Driving ScenesMohammad Dawud Ansari, Stephan Krarß, Oliver Wasenmüller and Didier StrickerInternational Conference on Computer Vision Theory and Applications, Funchal, Portugal, 2018The scale difference in driving scenarios is one of the essential challenges in semantic scene segmentation.
Close objects cover significantly more pixels than far objects. In this paper, we address this challenge with a
scale invariant architecture. Within this architecture, we explicitly estimate the depth and adapt the pooling
field size accordingly. Our model is compact and can be extended easily to other research domains. Finally,
the accuracy of our approach is comparable to the state-of-the-art and superior for scale problems. We evaluate
on the widely used automotive dataset Cityscapes as well as a self-recorded dataset.
more details
n/a76.866.986.7
K-netyesyesnonononononononononononoXinLiang Zhong
more details
n/a75.464.686.3
MSNETyesyesnonononononononononononoAnonymouspreviously also listed as "MultiPathJoin" and "MultiPath_Scale".
more details
0.281.675.088.3
Multitask LearningyesyesnonononononononononononoMulti-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and SemanticsAlex Kendall, Yarin Gal and Roberto CipollaNumerous deep learning applications benefit from multi-task learning with multiple regression and classification objectives. In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task's loss. Tuning these weights by hand is a difficult and expensive process, making multi-task learning prohibitive in practice. We propose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the homoscedastic uncertainty of each task. This allows us to simultaneously learn various quantities with different units or scales in both classification and regression settings. We demonstrate our model learning per-pixel depth regression, semantic and instance segmentation from a monocular input image. Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate models trained individually on each task.
more details
n/a77.768.087.4
DeepMotionyesyesnonononononononononononoAnonymousWe propose a novel method based on convnets to extract multi-scale features in a large range particularly for solving street scene segmentation.
more details
n/a78.168.587.8
SR-AICyesyesyesyesnonononononononononoAnonymous
more details
n/a79.670.289.0
Roadstar.ai_CV(SFNet)yesyesnonononononononononononoRoadstar.ai-CVMaosheng Ye, Guang Zhou, Tongyi Cao, YongTao Huang, Yinzi Chensame foucs net(SFNet), based on only fine labels, with focus on the loss distribution and same focus on the every layer of feature map
more details
0.282.676.488.7
DFNyesyesyesyesnonononononononononoLearning a Discriminative Feature Network for Semantic SegmentationChangqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, Nong SangarxivMost existing methods of semantic segmentation still suffer from two aspects of challenges: intra-class inconsistency and inter-class indistinction. To tackle these two problems, we propose a Discriminative Feature Network (DFN), which contains two sub-networks: Smooth Network and Border Network. Specifically, to handle the intra-class inconsistency problem, we specially design a Smooth Network with Channel Attention Block and global average pooling to select the more discriminative features. Furthermore, we propose a Border Network to make the bilateral features of boundary distinguishable with deep semantic boundary supervision. Based on our proposed DFN, we achieve state-of-the-art performance 86.2% mean IOU on PASCAL VOC 2012 and 80.3% mean IOU on Cityscapes dataset.
more details
n/a79.670.688.5
RelationNet_CoarseyesyesyesyesnonononononononononoRelationNet: Learning Deep-Aligned Representation for Semantic Image SegmentationYueqing ZhuangICPR Semantic image segmentation, which assigns labels in pixel level, plays a central role in image understanding. Recent approaches have attempted to harness the capabilities of deep learning. However, one central problem of these methods is that deep convolution neural network gives little consideration to the correlation among pixels. To handle this issue, in this paper, we propose a novel deep neural network named RelationNet, which utilizes CNN and RNN to aggregate context information. Besides, a spatial correlation loss is applied to supervise RelationNet to align features of spatial pixels belonging to same category. Importantly, since it is expensive to obtain pixel-wise annotations, we exploit a new training method for combining the coarsely and finely labeled data. Separate experiments show the detailed improvements of each proposal. Experimental results demonstrate the effectiveness of our proposed method to the problem of semantic image segmentation.
more details
n/a81.473.389.4
ARSAITyesyesnonononononononononononoAnonymousanonymous
more details
1.074.863.186.4
Mapillary Research: In-Place Activated BatchNormyesyesyesyesnonononononononoyesyesIn-Place Activated BatchNorm for Memory-Optimized Training of DNNsSamuel Rota Bulò, Lorenzo Porzi, Peter KontschiederarXivIn-Place Activated Batch Normalization (InPlace-ABN) is a novel approach to drastically reduce the training memory footprint of modern deep neural networks in a computationally efficient way. Our solution substitutes the conventionally used succession of BatchNorm + Activation layers with a single plugin layer, hence avoiding invasive framework surgery while providing straightforward applicability for existing deep learning frameworks. We obtain memory savings of up to 50% by dropping intermediate results and by recovering required information during the backward pass through the inversion of stored forward results, with only minor increase (0.8-2%) in computation time. Test results are obtained using a single model.
more details
n/a81.774.489.0
EFBNETyesyesnonononononononononononoAnonymous
more details
n/a78.869.488.2
Ladder DenseNet v2yesyesnonononononononononononoJournal submissionAnonymousDenseNet-121 model used in downsampling path with ladder-style skip connections upsampling path on top of it.
more details
1.078.769.188.4
ESPNetyesyesnononononononono22yesyesESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation Sachin Mehta, Mohammad Rastegari, Anat Caspi, Linda Shapiro, and Hannaneh HajishirziWe introduce a fast and efficient convolutional neural network, ESPNet, for semantic segmentation of high resolution images under resource constraints. ESPNet is based on a new convolutional module, efficient spatial pyramid (ESP), which is efficient in terms of computation, memory, and power. ESPNet is 22 times faster (on a standard GPU) and 180 times smaller than the state-of-the-art semantic segmentation network PSPNet, while its category-wise accuracy is only 8% less. We evaluated EPSNet on a variety of semantic segmentation datasets including Cityscapes, PASCAL VOC, and a breast biopsy whole slide image dataset. Under the same constraints on memory and computation, ESPNet outperforms all the current efficient CNN networks such as MobileNet, ShuffleNet, and ENet on both standard metrics and our newly introduced performance metrics that measure efficiency on edge devices. Our network can process high resolution images at a rate of 112 and 9 frames per second on a standard GPU and edge device, respectively
more details
0.008963.147.179.0
ENet with the Lovász-Softmax lossyesyesnononononononono22yesyesThe Lovász-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networksMaxim Berman, Amal Rannen Triki, Matthew B. BlaschkoarxivThe Lovász-Softmax loss is a novel surrogate for optimizing the IoU measure in neural networks. Here we finetune the weights provided by the authors of ENet (arXiv:1606.02147) with this loss, for 10'000 iterations on training dataset. The runtimes are unchanged with respect to the ENet architecture.
more details
0.01361.045.077.1
DRN_CRL_CoarseyesyesyesyesnonononononononoyesyesDense Relation Network: Learning Consistent and Context-Aware Representation For Semantic Image SegmentationYueqing ZhuangICIPDRN_CoarseSemantic image segmentation, which aims at assigning pixel-wise category, is one of challenging image understanding problems. Global context plays an important role on local pixel-wise category assignment. To make the best of global context, in this paper, we propose dense relation network (DRN) and context-restricted loss (CRL) to aggregate global and local information. DRN uses Recurrent Neural Network (RNN) with different skip lengths in spatial directions to get context-aware representations while CRL helps aggregate them to learn consistency. Compared with previous methods, our proposed method takes full advantage of hierarchical contextual representations to produce high-quality results. Extensive experiments demonstrate that our methods achieves significant state-of-the-art performances on Cityscapes and Pascal Context benchmarks, with mean-IoU of 82.8% and 49.0% respectively.
more details
n/a80.772.489.0
ShuffleSegyesyesyesyesnonononononononononoShuffleSeg: Real-time Semantic Segmentation NetworkMostafa Gamal, Mennatullah Siam, Mo'men Abdel-RazekUnder Review by ICIP 2018ShuffleSeg: An efficient realtime semantic segmentation network with skip connections and ShuffleNet units
more details
n/a62.246.577.9
SkipNet-MobileNetyesyesyesyesnonononononononononoRTSeg: Real-time Semantic Segmentation FrameworkMennatullah Siam, Mostafa Gamal, Moemen Abdel-Razek, Senthil Yogamani, Martin JagersandUnder Review by ICIP 2018An efficient realtime semantic segmentation network with skip connections based on MobileNet.

more details
n/a63.047.678.4
ThunderNetyesyesnononononononono22nonoAnonymous
more details
0.010469.356.082.6
PAC: Perspective-adaptive ConvolutionsyesyesnonononononononononononoPerspective-adaptive Convolutions for Scene ParsingRui Zhang, Sheng Tang, Yongdong Zhang, Jintao Li, and Shuicheng YanIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)Many existing scene parsing methods adopt Convolutional Neural Networks with receptive fields of fixed sizes and shapes, which frequently results in inconsistent predictions of large objects and invisibility of small objects. To tackle this issue, we propose perspective-adaptive convolutions to acquire receptive fields of flexible sizes and shapes during scene parsing. Through adding a new perspective regression layer, we can dynamically infer the position-adaptive perspective coefficient vectors utilized to reshape the convolutional patches. Consequently, the receptive fields can be adjusted automatically according to the various sizes and perspective deformations of the objects in scene images. Our proposed convolutions are differentiable to learn the convolutional parameters and perspective coefficients in an end-to-end way without any extra training supervision of object sizes. Furthermore, considering that the standard convolutions lack contextual information and spatial dependencies, we propose a context adaptive bias to capture both local and global contextual information through average pooling on the local feature patches and global feature maps, followed by flexible attentive summing to the convolutional results. The attentive weights are position-adaptive and context-aware, and can be learned through adding an additional context regression layer. Experiments on Cityscapes and ADE20K datasets well demonstrate the effectiveness of the proposed methods.
more details
n/a78.368.488.3
SU_NetnonononononononononononononoAnonymous
more details
n/a75.063.486.6
MobileNetV2PlusyesyesnonononononononononononoHuijun LiuMobileNetV2Plus
more details
n/a72.961.883.9
DeepLabv3+yesyesyesyesnonononononononoyesyes Encoder-Decoder with Atrous Separable Convolution for Semantic Image SegmentationLiang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, Hartwig AdamarXivSpatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information. In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. We will provide more details in the coming update on the arXiv report.
more details
n/a81.974.189.8
RFMobileNetV2PlusyesyesnonononononononononononoHuijun LiuReceptive Filed MobileNetV2Plus for Semantic Segmentation
more details
n/a75.866.385.3
GoogLeNetV1_ROByesyesnonononononononononononoAnonymousGoogLeNet-v1 FCN trained on Cityscapes, KITTI, and ScanNet, as required by the Robust Vision Challenge at CVPR'18 (http://robustvision.net/)
more details
n/a64.448.680.3
SAITv2yesyesyesyesnonononononononononoAnonymous
more details
0.02562.145.578.7
GUNetyesyesnononononononono22nonoGuided Upsampling Network for Real-Time Semantic SegmentationDavide MazziniarxivGuided Upsampling Network for Real-Time Semantic Segmentation
more details
0.0369.155.283.0
RMNetyesyesnonononononononononononoAnonymousA fast and light net for semantic segmentation.
more details
0.01467.753.581.9
ContextNetyesyesnonononononononononononoContextNet: Exploring Context and Detail for Semantic Segmentation in Real-timeRudra PK Poudel, Ujwal Bonde, Stephan Liwicki, Christopher ZacharXivModern deep learning architectures produce highly accurate results on many challenging semantic segmentation datasets. State-of-the-art methods are, however, not directly transferable to real-time applications or embedded devices, since naive adaptation of such systems to reduce computational cost (speed, memory and energy) causes a significant drop in accuracy. We propose ContextNet, a new deep neural network architecture which builds on factorized convolution, network compression and pyramid representations to produce competitive semantic segmentation in real-time with low memory requirements. ContextNet combines a deep branch at low resolution that captures global context information efficiently with a shallow branch that focuses on high-resolution segmentation details. We analyze our network in a thorough ablation study and present results on the Cityscapes dataset, achieving 66.1% accuracy at 18.3 frames per second at full (1024x2048) resolution.
more details
0.023864.348.180.5
RFLRyesyesyesyesyesyesnononono44nonoRandom Forest with Learned Representations for Semantic SegmentationByeongkeun Kang, Truong Q. NguyenIEEE Transactions on Image ProcessingRandom Forest with Learned Representations for Semantic Segmentation
more details
0.0322.620.324.9
DPCyesyesyesyesnonononononononoyesyesSearching for Efficient Multi-Scale Architectures for Dense Image PredictionLiang-Chieh Chen, Maxwell D. Collins, Yukun Zhu, George Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, Jonathon ShlensNIPS 2018In this work we explore the construction of meta-learning techniques for dense image prediction focused on the tasks of scene parsing. Constructing viable search spaces in this domain is challenging because of the multi-scale representation of visual information and the necessity to operate on high resolution imagery. Based on a survey of techniques in dense image prediction, we construct a recursive search space and demonstrate that even with efficient random search, we can identify architectures that achieve state-of-the-art performance. Additionally, the resulting architecture (called DPC for Dense Prediction Cell) is more computationally efficient, requiring half the parameters and half the computational cost as previous state of the art systems.
more details
n/a82.574.990.0
NV-ADLRyesyesyesyesnonononononononononoAnonymousNVIDIA Applied Deep Learning Research
more details
n/a82.273.690.9
Adaptive Affinity Field on PSPNetyesyesnonononononononononoyesyesAdaptive Affinity Field for Semantic SegmentationTsung-Wei Ke*, Jyh-Jing Hwang*, Ziwei Liu, Stella X. YuECCV 2018Existing semantic segmentation methods mostly rely on per-pixel supervision, unable to capture structural regularity present in natural images. Instead of learning to enforce semantic labels on individual pixels, we propose to enforce affinity field patterns in individual pixel neighbourhoods, i.e., the semantic label patterns of whether neighbouring pixels are in the same segment should match between the prediction and the ground-truth. The affinity fields characterize geometric relationships within the image, such as "motorcycles have round wheels". We further develop a novel method for learning the optimal neighbourhood size for each semantic category, with an adversarial loss that optimizes over worst-case scenarios. Unlike the common Conditional Random Field (CRF) approaches, our adaptive affinity field (AAF) method has no extra parameters during inference, and is less sensitive to appearance changes in the image.
more details
n/a78.569.187.8
APMoE_seg_ROByesyesnonononononononononoyesyesPixel-wise Attentional Gating for Parsimonious Pixel LabelingShu Kong, Charless FowlkesarxivThe Pixel-level Attentional Gating (PAG) unit is trained to choose for each pixel the pooling size to adopt to aggregate contextual region around it. There are multiple branches with different dilate rates for varied pooling size, thus varying receptive field. For this ROB challenge, PAG is expected to robustly aggregate information for final prediction.

This is our entry for Robust Vision Challenge 2018 workshop (ROB). The model is based on ResNet50, trained over mixed dataset of Cityscapes, ScanNet and Kitti.
more details
0.966.150.981.4
BatMAN_ROByesyesyesyesnonononononononononoAnonymousbatch-normalized multistage attention network
more details
1.065.048.281.9
HiSS_ROByesyesnononononononono22nonoAnonymous
more details
0.0660.542.278.8
VENUS_ROByesyesnonononononononononononoAnonymousVENUS_ROB
more details
n/a66.752.780.8
VlocNet++_ROBnonononononononononononononoAnonymous
more details
n/a60.842.279.3
AHiSS_ROByesyesyesyesnononononono22nonoAnonymousAugmented Hierarchical Semantic Segmentation
more details
0.0662.945.979.8
IBN-PSP-SA_ROByesyesnonononononononononononoAnonymousIBN-PSP-SA_ROB
more details
n/a72.058.685.4
LDN2_ROByesyesnonononononononononononoAnonymousLadder DenseNet: https://ivankreso.github.io/publication/ladder-densenet/
more details
1.077.166.188.1
MiniNetyesyesnononononononono44nonoAnonymous
more details
0.00444.825.164.4
AdapNetv2_ROByesyesnonononononononononononoAnonymous
more details
n/a62.444.380.5
MapillaryAI_ROByesyesnonononononononononononoAnonymous
more details
n/a80.271.389.0
FCN101_ROByesyesnonononononononononononoAnonymous
more details
n/a38.512.964.0
MaskRCNN_BOSHyesyesnonononononononononononoJin shengtao, Yi zhihao, Liu wei [Our team name is firefly]Bosh autodrive challenge
more details
n/a70.060.080.0
EnsembleModel_BoschyesyesnonononononononononononoJin shengtao, Yi zhihao, Liu wei [Our team name was MaskRCNN_BOSH,firefly]we've ensembled three model(erfnet,deeplab-mobilenet,tusimple) and gained 0.57 improvment of IoU Classes value. The best single model is 73.8549
more details
n/a72.961.784.1
EVANetyesyesnonononononononononononoAnonymous
more details
n/a73.161.784.4
CLRCNetyesyesnonononononononononononoCLRCNet: Cascaded Low-Rank Convolutions for Semantic Segmentation in Real-timeAnonymousA lightweight and real-time semantic segmentation method.
more details
0.01368.053.482.5
Edgenetyesyesnononononononono22nonoAnonymousA lightweight semantic segmentation network combined with edge information and channel-wise attention mechanism.
more details
0.0375.063.986.1
L2-SPyesyesyesyesnonononononononoyesyesExplicit Inductive Bias for Transfer Learning with Convolutional NetworksXuhong Li, Yves Grandvalet, Franck DavoineICML-2018With a simple variant of weight decay, L2-SP regularization (see the paper for details), we reproduced PSPNet based on the original ResNet-101 using "train_fine + val_fine + train_extra" set (2975 + 500 + 20000 images), with a small batch size 8. The sync batch normalization layer is implemented in Tensorflow (see the code).
more details
n/a78.568.988.1
ALV303yesyesnonononononononononononoAnonymous
more details
0.279.269.988.6
NCTU-ITRIyesyesnononononononono22nonoAnonymousFor the purpose of fast semantic segmentation, we design a CNN-based encoder-decoder architecture, which is called DSNet. The encoder part is constructed based on the concept of DenseNet, and a simple decoder is adopted to make the network more efficient without degrading the accuracy. We pre-train the encoder network on the ImageNet dataset. Then, only the fine-annotated Cityscapes dataset (2975 training images) is used to train the complete DSNet. The DSNet demonstrates a good trade-off between accuracy and speed. It can process 68 frames per second on 1024x512 resolution images on a single GTX 1080 Ti GPU.
more details
0.014770.858.483.3
ADSCNetyesyesnonononononononononononoADSCNet: Asymmetric Depthwise Separable Convolution for Semantic Segmentation in Real-timeAnonymousA lightweight and real-time semantic segmentation method for mobile devices.
more details
0.01368.755.182.3
SRC-B-MachineLearningLabyesyesyesyesnonononononononononoJianlong Yuan, Zelu Deng, Shu Wang, Zhenbo LuoSamsung Research Center MachineLearningLab. The result is tested by multi scale and filp. The paper is in preparing.
more details
n/a81.573.789.2
Tencent AI LabyesyesyesyesnonononononononononoAnonymous
more details
n/a80.472.388.4
ERINetyesyesnononononononono22nonoAnonymousEfficient residual inception networks for real-time semantic segmentation
more details
0.02373.461.585.4
PGCNet_Res101_fineyesyesnonononononononononononoAnonymouswe choose the ResNet101 pretrained on ImageNet as our backbone, then we use both the train-fine and the val-fine data to train our model with batch size=8 for 8w iterations without any bells and whistles. We will release our paper latter.
more details
n/a81.173.388.9
EDANetyesyesnononononononono22yesyesEfficient Dense Modules of Asymmetric Convolution for Real-Time Semantic SegmentationShao-Yuan Lo (NCTU), Hsueh-Ming Hang (NCTU), Sheng-Wei Chan (ITRI), Jing-Jhih Lin (ITRI)Training data: Fine annotations only (train+val. set, 2975+500 images) without any pretraining nor coarse annotations.
For training on fine annotations (train set only, 2975 images), it attains a mIoU of 66.3%.

Runtime: (resolution 512x1024) 0.0092s on a single GTX 1080Ti, 0.0123s on a single Titan X.
more details
0.009269.956.683.3
OCNet_ResNet101_fineyesyesnonononononononononononoAnonymousContext is essential for various computer vision tasks.
The state-of-the-art scene parsing methods define the context as the prior of the scene categories (e.g., bathroom, badroom, street).
Such scene context is not suitable for the street scene parsing tasks as most of the scenes are similar.

In this work, we propose the Object Context that captures the prior of the object's category that the pixel belongs to.
We compute the object context by aggregating all the pixels' features according to a attention map that encodes the probability of each pixel that it belongs to the same category with the associated pixel.
Specifically, We employ the self-attention method to compute the pixel-wise attention map.

We further propose the Pyramid Object Context and Atrous Spatial Pyramid Object Context to handle the problem of multi-scales.
more details
n/a81.173.289.1
Knowledge-AwareyesyesnonononononononononononoAnonymousKnowledge-Aware Semantic Segmentation
more details
n/a78.068.787.4
CASIA_IVA_DANet_NoCoarseyesyesnonononononononononoyesyesDual Attention Network for Scene SegmentationJun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang,and Hanqing LuCVPR2019we address the scene segmentation task by capturing rich contextual dependencies based on the selfattention mechanism. Unlike previous works that capture contexts by multi-scale features fusion, we propose a Dual Attention Networks (DANet) to adaptively integrate local features with their global dependencies. Specifically, we append two types of attention modules on top of traditional dilated FCN, which model the semantic interdependencies in spatial and channel dimensions respectively. The position attention module selectively aggregates the features at each
position by a weighted sum of the features at all positions. Similar features would be related to each other regardless of their distances. Meanwhile, the channel attention module selectively emphasizes interdependent channel maps by integrating associated features among all channel maps.
We sum the outputs of the two attention modules to further improve feature representation which contributes to more precise segmentation results
more details
n/a82.674.890.5
LDFNetyesyesnonononoyesyesnono22yesyesIncorporating Luminance, Depth and Color Information by Fusion-based Networks for Semantic SegmentationShang-Wei Hung, Shao-Yuan LoWe propose a preferred solution, which incorporates Luminance, Depth and color information by a Fusion-based network named LDFNet. It includes a distinctive encoder sub-network to process the depth maps and further employs the luminance images to assist the depth information in a process. LDFNet achieves very competitive results compared to the other state-of-art systems on the challenging Cityscapes dataset, while it maintains an inference speed faster than most of the existing top-performing networks. The experimental results show the effectiveness of the proposed information-fused approach and the potential of LDFNet for road scene understanding tasks.
more details
n/a74.262.685.8
CGNetyesyesnonononononononononoyesyesTianyi Wu et alwe propose a novel Context Guided Network for semantic segmentation on mobile devices. We first design a Context Guided (CG) block by considering the inherent characteristic of semantic segmentation. CG Block aggregates local feature, surrounding context feature and global context feature effectively and efficiently. Based on the CG block, we develop Context Guided Network (CGNet), which not only has a strong capacity of localization and recognition, but also has a low computational and memory footprint. Under a similar number of parameters, the proposed
CGNet significantly outperforms existing segmentation networks. Extensive experiments on Cityscapes and CamVid datasets verify the effectiveness of the proposed approach.
Specifically, without any post-processing, the proposed approach achieves 64.8% mean IoU on Cityscapes test set with less than 0.5 M parameters, and has a frame-rate of 50 fps on one NVIDIA Tesla K80 card for 2048 × 1024 high-resolution image.
more details
0.0267.553.781.3
SAITv2-lightyesyesyesyesnonononononononononoAnonymous
more details
0.02569.455.183.7
Deform_ResNet_BalancedyesyesnonononononononononononoAnonymous
more details
0.25863.250.675.8
NfS-SegyesyesyesyesnonoyesyesyesyesnonononoUncertainty-Aware Knowledge Distillation for Real-Time Scene Segmentation: 7.43 GFLOPs at Full-HD Image with 120 fpsAnonymous
more details
0.0083731270.156.184.0
Improving Semantic Segmentation via Video Propagation and Label RelaxationyesyesyesyesnonononoyesyesnonoyesyesImproving Semantic Segmentation via Video Propagation and Label RelaxationYi Zhu, Karan Sapra, Fitsum A. Reda, Kevin J. Shih, Shawn Newsam, Andrew Tao, Bryan CatanzaroCVPR 2019Semantic segmentation requires large amounts of pixel-wise annotations to learn accurate models. In this paper, we present a video prediction-based methodology to scale up training sets by synthesizing new training samples in order to improve the accuracy of semantic segmentation networks. We exploit video prediction models' ability to predict future frames in order to also predict future labels. A joint propagation strategy is also proposed to alleviate mis-alignments in synthesized samples. We demonstrate that training segmentation models on datasets augmented by the synthesized samples lead to significant improvements in accuracy. Furthermore, we introduce a novel boundary label relaxation technique that makes training robust to annotation noise and propagation artifacts along object boundaries. Our proposed methods achieve state-of-the-art mIoUs of 83.5% on Cityscapes and 82.9% on CamVid. Our single model, without model ensembles, achieves 72.8% mIoU on the KITTI semantic segmentation test set, which surpasses the winning entry of the ROB challenge 2018.
more details
n/a82.073.690.5
Spatial Sampling Netyesyesnononononononono22nonoSpatial Sampling Network for Fast Scene UnderstandingDavide Mazzini, Raimondo SchettiniCVPR 2019 Workshop on Autonomous DrivingWe propose a network architecture to perform efficient scene understanding. This work presents three main novelties: the first is an Improved Guided Upsampling Module that can replace in toto the decoder part in common semantic segmentation networks.
Our second contribution is the introduction of a new module based on spatial sampling to perform Instance Segmentation. It provides a very fast instance segmentation, needing only thresholding as post-processing step at inference time. Finally, we propose a novel efficient network design that includes the new modules and we test it against different datasets for outdoor scene understanding.
more details
0.0088466.552.480.7
SwiftNetRN-18yesyesnonononononononononoyesyesIn Defense of Pre-trained ImageNet Architectures for Real-time Semantic Segmentation of Road-driving ImagesMarin Oršić, Ivan Krešo, Petra Bevandić, Siniša ŠegvićCVPR 2019
more details
0.024377.267.287.1
Fast-SCNNyesyesyesyesnonononononononononoFast-SCNN: Fast Semantic Segmentation NetworkRudra PK Poudel, Stephan Liwicki, Roberto CipollaThe encoder-decoder framework is state-of-the-art for offline semantic image segmentation. Since the rise in autonomous systems, real-time computation is increasingly desirable. In this paper, we introduce fast segmentation convolutional neural network (Fast-SCNN), an above real-time semantic segmentation model on high resolution image data (1024x2048px) suited to efficient computation on embedded devices with low memory. Building on existing two-branch methods for fast segmentation, we introduce our `learning to downsample' module which computes low-level features for multiple resolution branches simultaneously. Our network combines spatial detail at high resolution with deep features extracted at lower resolution, yielding an accuracy of 68.0% mean intersection over union at 123.5 frames per second on Cityscapes. We also show that large scale pre-training is unnecessary. We thoroughly validate our metric in experiments with ImageNet pre-training and the coarse labeled data of Cityscapes. Finally, we show even faster computation with competitive results on subsampled inputs, without any network modifications.
more details
0.008163.546.180.8
Fast-SCNN (Half-resolution)yesyesyesyesnononononono22nonoFast-SCNN: Fast Semantic Segmentation NetworkRudra P K Poudel, Stephan Liwicki, Roberto CipollaThe encoder-decoder framework is state-of-the-art for offline semantic image segmentation. Since the rise in autonomous systems, real-time computation is increasingly desirable. In this paper, we introduce fast segmentation convolutional neural network (Fast-SCNN), an above real-time semantic segmentation model on high resolution image data (1024x2048px) suited to efficient computation on embedded devices with low memory. Building on existing two-branch methods for fast segmentation, we introduce our `learning to downsample' module which computes low-level features for multiple resolution branches simultaneously. Our network combines spatial detail at high resolution with deep features extracted at lower resolution, yielding an accuracy of 68.0% mean intersection over union at 123.5 frames per second on Cityscapes. We also show that large scale pre-training is unnecessary. We thoroughly validate our metric in experiments with ImageNet pre-training and the coarse labeled data of Cityscapes. Finally, we show even faster computation with competitive results on subsampled inputs, without any network modifications.
more details
0.003557.137.676.6
Fast-SCNN (Quarter-resolution)yesyesnononononononono44nonoFast-SCNN: Fast Semantic Segmentation NetworkRudra P K Poudel, Stephan Liwicki, Roberto CipollaThe encoder-decoder framework is state-of-the-art for offline semantic image segmentation. Since the rise in autonomous systems, real-time computation is increasingly desirable. In this paper, we introduce fast segmentation convolutional neural network (Fast-SCNN), an above real-time semantic segmentation model on high resolution image data (1024x2048px) suited to efficient computation on embedded devices with low memory. Building on existing two-branch methods for fast segmentation, we introduce our `learning to downsample' module which computes low-level features for multiple resolution branches simultaneously. Our network combines spatial detail at high resolution with deep features extracted at lower resolution, yielding an accuracy of 68.0% mean intersection over union at 123.5 frames per second on Cityscapes. We also show that large scale pre-training is unnecessary. We thoroughly validate our metric in experiments with ImageNet pre-training and the coarse labeled data of Cityscapes. Finally, we show even faster computation with competitive results on subsampled inputs, without any network modifications.
more details
0.0020648.228.767.7
DSNetyesyesyesyesnononononono22yesyesDSNet for Real-Time Driving Scene Semantic SegmentationWenfu WangDSNet for Real-Time Driving Scene Semantic Segmentation
more details
0.02770.758.083.4
SwiftNetRN-18 pyramidyesyesnonononononononononononoAnonymous
more details
n/a74.664.285.0
DF-SegyesyesnonononononononononoyesyesPartial Order Pruning: for Best Speed/Accuracy Trade-off in Neural Architecture SearchXin Li, Yiming Zhou, Zheng Pan, Jiashi FengCVPR 2019DF1-Seg-d8
more details
0.00769.657.182.2
DF-SegyesyesnonononononononononononoAnonymousDF2-Seg2
more details
0.01873.360.786.0
DDARyesyesyesyesnonononononononononoAnonymousDiDi Labs, AR Group
more details
n/a81.574.089.1
LDN-121yesyesnonononononononononononoEfficient Ladder-style DenseNets for Semantic Segmentation of Large ImagesIvan Kreso, Josip Krapac, Sinisa SegvicLadder DenseNet-121 trained on train+val, fine labels only. Single-scale inference.
more details
0.04878.468.688.1
TKCNyesyesnonononononononononoyesyesTree-structured Kronecker Convolutional Network for Semantic SegmentationTianyi Wu, Sheng Tang, Rui Zhang, Juan Cao, Jintao Li
more details
n/a81.573.689.5
RPNetyesyesnononononononono22yesyesResidual Pyramid Learning for Single-Shot Semantic SegmentationXiaoyu Chen, Xiaotian Lou, Lianfa Bai, Jing HanarXivwe put forward a method for single-shot segmentation in a feature residual pyramid network (RPNet), which learns the main and residuals of segmentation by decomposing the label at different levels of residual blocks.
more details
0.00872.359.085.7
naviyesyesnonononononononononononoyuxbmutil scale test
more details
n/a79.168.289.9
Auto-DeepLab-LyesyesyesyesnonononononononoyesyesAuto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image SegmentationChenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan Yuille, Li Fei-FeiarxivIn this work, we study Neural Architecture Search for semantic image segmentation, an important computer vision task that assigns a semantic label to every pixel in an image. Existing works often focus on searching the repeatable cell structure, while hand-designing the outer network structure that controls the spatial resolution changes. This choice simplifies the search space, but becomes increasingly problematic for dense image prediction which exhibits a lot more network level architectural variations. Therefore, we propose to search the network level structure in addition to the cell level structure, which forms a hierarchical architecture search space. We present a network level search space that includes many popular designs, and develop a formulation that allows efficient gradient-based architecture search (3 P100 GPU days on Cityscapes images). We demonstrate the effectiveness of the proposed method on the challenging Cityscapes, PASCAL VOC 2012, and ADE20K datasets. Without any ImageNet pretraining, our architecture searched specifically for semantic image segmentation attains state-of-the-art performance. Please refer to https://arxiv.org/abs/1901.02985 for details.
more details
n/a82.074.289.8
LiteSeg-Darknet19yesyesyesyesnonononononononoyesyesLiteSeg: A Litewiegth ConvNet for Semantic SegmentationTaha Emara, Hossam E. Abd El Munim, Hazem M. AbbasDICTA 2019
more details
0.010276.165.986.3
AdapNet++yesyesyesyesnonononononononoyesyes Self-Supervised Model Adaptation for Multimodal Semantic SegmentationAbhinav Valada, Rohit Mohan, Wolfram BurgardIJCV 2019In this work, we propose the AdapNet++ architecture for semantic segmentation that aims to achieve the right trade-off between performance and computational complexity of the model. AdapNet++ incorporates a new encoder with multiscale residual units and an efficient atrous spatial pyramid pooling (eASPP) module that has a larger effective receptive field with more than 10x fewer parameters compared to the standard ASPP, complemented with a strong decoder with a multi-resolution supervision scheme that recovers high-resolution details. Comprehensive empirical evaluations on the challenging Cityscapes, Synthia, SUN RGB-D, ScanNet and Freiburg Forest datasets demonstrate that our architecture achieves state-of-the-art performance while simultaneously being efficient in terms of both the number of parameters and inference time. Please refer to https://arxiv.org/abs/1808.03833 for details.

A live demo on various datasets can be viewed at http://deepscene.cs.uni-freiburg.de
more details
n/a80.171.588.7
SSMAyesyesyesyesnonoyesyesnonononoyesyes Self-Supervised Model Adaptation for Multimodal Semantic SegmentationAbhinav Valada, Rohit Mohan, Wolfram BurgardIJCV 2019Learning to reliably perceive and understand the scene is an integral enabler for robots to operate in the real-world. This problem is inherently challenging due to the multitude of object types as well as appearance changes caused by varying illumination and weather conditions. Leveraging
complementary modalities can enable learning of semantically richer representations that are resilient to such perturbations. Despite the tremendous progress in recent years, most multimodal convolutional neural network approaches directly concatenate feature maps from individual modality streams
rendering the model incapable of focusing only on the relevant complementary information for fusion. To address this limitation, we propose a mutimodal semantic segmentation framework that dynamically adapts the fusion of modality-specific features while being sensitive to the object category, spatial location and scene context in a self-supervised manner. Specifically, we propose an architecture consisting of two modality-specific encoder streams that fuse intermediate encoder representations into a single decoder using our proposed SSMA fusion mechanism which optimally combines complementary features. As intermediate representations are not aligned across modalities, we introduce an attention scheme for better correlation. Extensive experimental evaluations on the challenging Cityscapes, Synthia, SUN RGB-D, ScanNet and Freiburg Forest datasets demonstrate that our architecture achieves state-of-the-art performance in addition to providing exceptional robustness in adverse perceptual conditions. Please refer to https://arxiv.org/abs/1808.03833 for details.

A live demo on various datasets can be viewed at http://deepscene.cs.uni-freiburg.de
more details
n/a81.773.689.8
LiteSeg-MobilenetyesyesyesyesnonononononononoyesyesLiteSeg: A Litewiegth ConvNet for Semantic SegmentationTaha Emara, Hossam E. Abd El Munim, Hazem M. AbbasDICTA 2019
more details
0.006272.062.182.0
LiteSeg-ShufflenetyesyesyesyesnonononononononoyesyesLiteSeg: A Litewiegth ConvNet for Semantic SegmentationTaha Emara, Hossam E. Abd El Munim, Hazem M. AbbasDICTA 2019
more details
0.00751867.355.079.5
Fast OCNetyesyesnonononononononononononoAnonymous
more details
n/a80.772.489.0
ShuffleNet v2 + DPCyesyesyesyesnonononononononoyesyesAn efficient solution for semantic segmentation: ShuffleNet V2 with atrous separable convolutionsSercan Turkmen, Janne HeikkilaShuffleNet v2 with DPC at output_stride 16.
more details
n/a69.956.783.1
ERSNet-coarseyesyesyesyesnononononono44nonoAnonymous
more details
0.01267.853.182.6
MiniNet-v2-coarseyesyesyesyesnononononono22nonoAnonymous
more details
0.01268.354.482.2
SwiftNetRN-18 ensembleyesyesnonononononononononoyesyesIn Defense of Pre-trained ImageNet Architectures for Real-time Semantic Segmentation of Road-driving ImagesMarin Oršić, Ivan Krešo, Petra Bevandić, Siniša ŠegvićCVPR 2019
more details
n/a76.566.486.6
EFC_syncyesyesnonononononoyesyesnonononoAnonymous
more details
n/a77.867.388.4
PL-SegyesyesnonononononononononoyesyesPartial Order Pruning: for Best Speed/Accuracy Trade-of in Neural Architecture SearchXin Li, Yiming Zhou, Zheng Pan, Jiashi FengCVPR 2019Following "partial order pruning", we conduct architecture searching experiments on Snapdragon 845 platform, and obtained PL1A/PL1A-Seg.

1、Snapdragon 845
2、NCNN Library
3、latency evaluated at 640x384

more details
0.019267.752.982.5
MiniNet-v2-pretrainedyesyesyesyesnononononono22nonoAnonymous
more details
0.01268.453.882.9
GALD-NetyesyesyesyesyesyesyesyesnonononoyesyesGlobal Aggregation then Local Distribution in Fully Convolutional NetworksXiangtai Li, Li Zhang, Ansheng You, Maoke Yang, Kuiyuan Yang, Yunhai TongBMVC 2019We propose Global Aggregation then Local Distribution (GALD) scheme to distribute global information to each position adaptively according to the local information around the position. (Joint work: Key Laboratory of Machine Perception, School of EECS @Peking University and DeepMotion AI Research )
more details
n/a81.974.589.4
GALD-netyesyesyesyesnonononononononoyesyesGlobal Aggregation then Local Distribution in Fully Convolutional NetworksXiangtai Li, Li Zhang, Ansheng You, Maoke Yang, Kuiyuan Yang, Yunhai TongBMVC 2019We propose Global Aggregation then Local Distribution (GALD) scheme to distribute global information to each position adaptively according the local information surrounding the position.
more details
n/a81.473.889.1
ndnetyesyesnonononononononononononoAnonymous
more details
0.02464.748.580.9
HRNetV2yesyesnonononononononononoyesyesHigh-Resolution Representations for Labeling Pixels and RegionsKe Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, Jingdong WangThe high-resolution network (HRNet) recently developed for human pose estimation, maintains high-resolution representations through the whole process by connecting high-to-low resolution convolutions in parallel and produces strong high-resolution representations by repeatedly conducting fusions across parallel convolutions.
more details
n/a82.174.889.4
SPGNetyesyesnonononononononononononoSPGNet: Semantic Prediction Guidance for Scene ParsingBowen Cheng, Liang-Chieh Chen, Yunchao Wei, Yukun Zhu, Zilong Huang, Jinjun Xiong, Thomas Huang, Wen-Mei Hwu, Honghui ShiICCV 2019Multi-scale context module and single-stage encoder-decoder structure are commonly employed for semantic segmentation. The multi-scale context module refers to the operations to aggregate feature responses from a large spatial extent, while the single-stage encoder-decoder structure encodes the high-level semantic information in the encoder path and recovers the boundary information in the decoder path. In contrast, multi-stage encoder-decoder networks have been widely used in human pose estimation and show superior performance than their single-stage counterpart. However, few efforts have been attempted to bring this effective design to semantic segmentation. In this work, we propose a Semantic Prediction Guidance (SPG) module which learns to re-weight the local features through the guidance from pixel-wise semantic prediction. We find that by carefully re-weighting features across stages, a two-stage encoder-decoder network coupled with our proposed SPG module can significantly outperform its one-stage counterpart with similar parameters and computations. Finally, we report experimental results on the semantic segmentation benchmark Cityscapes, in which our SPGNet attains 81.1% on the test set using only 'fine' annotations.
more details
n/a82.174.589.7
LDN-161yesyesnonononononononononononoEfficient Ladder-style DenseNets for Semantic Segmentation of Large ImagesIvan Kreso, Josip Krapac, Sinisa SegvicLadder DenseNet-161 trained on train+val, fine labels only. Inference on multi-scale inputs.
more details
2.079.169.189.2
GGCFyesyesyesyesnonononononononononoAnonymous
more details
n/a81.373.189.4
GFF-NetyesyesnonononononononononononoGFF: Gated Fully Fusion for Semantic SegmentationXiangtai Li, Houlong Zhao, Yunhai Tong, Kuiyuan YangWe proposed Gated Fully Fusion (GFF) to fuse features from multiple levels through gates in a fully connected way. Specifically, features at each level are enhanced by higher-level features with stronger semantics and lower-level features with more details, and gates are used to control the pass of useful information which significantly reducing noise propagation during fusion. (Joint work: Key Laboratory of Machine Perception, School of EECS @Peking University and DeepMotion AI Research )
more details
n/a81.473.689.3
Gated-SCNNyesyesnonononononononononoyesyesGated-SCNN: Gated Shape CNNs for Semantic SegmentationTowaki Takikawa, David Acuna, Varun Jampani, Sanja Fidler
more details
n/a82.774.690.7
ESPNetv2yesyesnononononononono22yesyesESPNetv2: A Light-weight, Power Efficient, and General Purpose Convolutional Neural NetworkSachin Mehta, Mohammad Rastegari, Linda Shapiro, and Hannaneh HajishirziCVPR 2019We introduce a light-weight, power efficient, and general purpose convolutional neural network, ESPNetv2, for modeling visual and sequential data. Our network uses group point-wise and depth-wise dilated separable convolutions to learn representations from a large effective receptive field with fewer FLOPs and parameters. The performance of our network is evaluated on three different tasks: (1) object classification, (2) semantic segmentation, and (3) language modeling. Experiments on these tasks, including image classification on the ImageNet and language modeling on the PenTree bank dataset, demonstrate the superior performance of our method over the state-of-the-art methods. Our network has better generalization properties than ShuffleNetv2 when tested on the MSCOCO multi-object classification task and the Cityscapes urban scene semantic segmentation task. Our experiments show that ESPNetv2 is much more power efficient than existing state-of-the-art efficient methods including ShuffleNets and MobileNets. Our code is open-source and available at https://github.com/sacmehta/ESPNetv2
more details
n/a66.351.481.2
MRFMyesyesyesyesnonononononononononoMulti Receptive Field Network for Semantic SegmentationJianlong Yuan, Zelu Deng, Shu Wang, Zhenbo LuoWACV2020Semantic segmentation is one of the key tasks in comput-
er vision, which is to assign a category label to each pixel
in an image. Despite significant progress achieved recently,
most existing methods still suffer from two challenging is-
sues: 1) the size of objects and stuff in an image can be very
diverse, demanding for incorporating multi-scale features
into the fully convolutional networks (FCNs); 2) the pixel-
s close to or at the boundaries of object/stuff are hard to
classify due to the intrinsic weakness of convolutional net-
works. To address the first issue, we propose a new Multi-
Receptive Field Module (MRFM), explicitly taking multi-
scale features into account. For the second issue, we design
an edge-aware loss which is effective in distinguishing the
boundaries of object/stuff. With these two designs, our Mul-
ti Receptive Field Network achieves new state-of-the-art re-
sults on two widely-used semantic segmentation benchmark
datasets. Specifically, we achieve a mean IoU of 83.0% on
the Cityscapes dataset and 88.4% mean IoU on the Pascal
VOC2012 dataset.
more details
n/a82.074.889.2
DGCNetyesyesnonononononononononononoDual Graph Convolutional Network for Semantic SegmentationLi Zhang*, Xiangtai Li*, Anurag Arnab, Kuiyuan Yang, Yunhai Tong, Philip H.S. TorrBMVC 2019We propose Dual Graph Convolutional Network (DGCNet) models the global context of the input feature by modelling two orthogonal graphs in a single framework. (Joint work: University of Oxford, Peking University and DeepMotion AI Research)
more details
n/a81.172.989.2
dpcan_trainval_os16_225yesyesnonononononononononononoAnonymous
more details
n/a81.173.189.1
Learnable Tree FilteryesyesnonononononononononoyesyesLearnable Tree Filter for Structure-preserving Feature TransformLin Song; Yanwei Li; Zeming Li; Gang Yu; Hongbin Sun; Jian Sun; Nanning ZhengNeurIPS 2019Learnable Tree Filter for Structure-preserving Feature Transform
more details
n/a81.172.989.3
FreeNetyesyesnonononononononononononoAnonymous
more details
n/a70.258.981.6
HRNetV2 + OCRyesyesyesyesnonononononononoyesyesHigh-Resolution Representations for Labeling Pixels and Regions; OCNet: Object Context Network for Scene ParsingHRNet Team; OCR TeamHRNetV2W48 + OCR. OCR is an extension of object context networks https://arxiv.org/pdf/1809.00916.pdf
more details
n/a81.773.989.4
Valeo DAR GermanyyesyesyesyesnonononononononononoAnonymousValeo DAR Germany, New Algo Lab

more details
n/a82.274.489.9
GLNet_fineyesyesnonononononononononononoAnonymousThe proposed network architecture, combined with spatial information and multi scale context information, and repair the boundaries and details of the segmented object through channel attention modules.(Use the train-fine and the val-fine data)
more details
n/a79.570.488.5
MCDNyesyesnonononononononononononoAnonymous
more details
n/a81.272.490.1
AAF+GLRyesyesnonononononononononononoAnonymous
more details
n/a79.370.388.3
HRNetV2 + OCR (w/ ASP)yesyesyesyesnonononononononoyesyesopenseg-group (OCR team + HRNet team)Our approach is based on a single HRNet48V2 and an OCR module combined with ASPP. We apply depth based multi-scale ensemble weights during testing (provided by DeepMotion AI Research) .
more details
n/a83.576.890.1
CASIA_IVA_DRANet-101_NoCoarseyesyesnonononononononononononoAnonymous
more details
n/a84.477.891.0
Hyundai Mobis AD LabyesyesyesyesnonononononononononoHyundai Mobis AD Lab, DL-DB Group, AA (Automated Annotator) Team
more details
n/a82.474.490.4
EFRNet-13yesyesnonononononononononononoAnonymous
more details
0.014670.157.582.7
FarSee-Netyesyesnononononononono22nonoFarSee-Net: Real-Time Semantic Segmentation by Efficient Multi-scale Context Aggregation and Feature Space Super-resolutionZhanpeng Zhang and Kaipeng ZhangIEEE International Conference on Robotics and Automation (ICRA) 2020FarSee-Net: Real-Time Semantic Segmentation by Efficient Multi-scale Context Aggregation and Feature Space Super-resolution

Real-time semantic segmentation is desirable in many robotic applications with limited computation resources. One challenge of semantic segmentation is to deal with the objectscalevariationsandleveragethecontext.Howtoperform multi-scale context aggregation within limited computation budget is important. In this paper, firstly, we introduce a novel and efficient module called Cascaded Factorized Atrous Spatial Pyramid Pooling (CF-ASPP). It is a lightweight cascaded structure for Convolutional Neural Networks (CNNs) to efficiently leverage context information. On the other hand, for runtime efficiency, state-of-the-art methods will quickly decrease the spatial size of the inputs or feature maps in the early network stages. The final high-resolution result is usuallyobtainedbynon-parametricup-samplingoperation(e.g. bilinear interpolation). Differently, we rethink this pipeline and treat it as a super-resolution process. We use optimized superresolution operation in the up-sampling step and improve the accuracy, especially in sub-sampled input image scenario for real-time applications. By fusing the above two improvements, our methods provide better latency-accuracy trade-off than the other state-of-the-art methods. In particular, we achieve 68.4% mIoU at 84 fps on the Cityscapes test set with a single Nivida Titan X (Maxwell) GPU card. The proposed module can be plugged into any feature extraction CNN and benefits from the CNN structure development.
more details
0.011969.755.883.7
C3Net [2,3,7,13]nononononononononono22yesyesC3: Concentrated-Comprehensive Convolution and its application to semantic segmentationHyojin Park, Youngjoon Yoo, Geonseok Seo, Dongyoon Han, Sangdoo Yun, Nojun Kwak
more details
n/a67.853.881.7
Panoptic-DeepLab [Cityscapes-fine]yesyesnonononononononononononoPanoptic-DeepLabBowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh ChenOur proposed bottom-up Panoptic-DeepLab is conceptually simple yet delivers state-of-the-art results. The Panoptic-DeepLab adopts dual-ASPP and dual-decoder modules, specific to semantic segmentation and instance segmentation respectively. The semantic segmentation prediction follows the typical design of any semantic segmentation model (e.g., DeepLab), while the instance segmentation prediction involves a simple instance center regression, where the model learns to predict instance centers as well as the offset from each pixel to its corresponding center. This submission exploits only Cityscapes fine annotations.
more details
n/a78.471.884.9
EKENetyesyesnonononononononononononoAnonymous
more details
0.022969.957.282.6
SPSSNyesyesnonononononononononononoAnonymousStage Pooling Semantic Segmentation Network
more details
n/a70.256.683.9
FC-HarDNet-70yesyesnonononononononononoyesyesHarDNet: A Low Memory Traffic NetworkPing Chao, Chao-Yang Kao, Yu-Shan Ruan, Chien-Hsiang Huang, Youn-Long LinICCV 2019Fully Convolutional Harmonic DenseNet 70
U-shape encoder-decoder structure with HarDNet blocks
Trained with single scale loss at stride-4
validation mIoU=77.7
more details
0.01576.766.387.0
BFPyesyesnonononononononononononoBoundary-Aware Feature Propagation for Scene SegmentationHenghui Ding, Xudong Jiang, Ai Qun Liu, Nadia Magnenat Thalmann, and Gang WangIEEE International Conference on Computer Vision (ICCV), 2019Boundary-Aware Feature Propagation for Scene Segmentation
more details
n/a81.473.089.9
FasterSegyesyesnonononononononononoyesyesFasterSeg: Searching for Faster Real-time Semantic SegmentationWuyang Chen, Xinyu Gong, Xianming Liu, Qian Zhang, Yuan Li, Zhangyang WangICLR 2020We present FasterSeg, an automatically designed semantic segmentation network with not only state-of-the-art performance but also faster speed than current methods. Utilizing neural architecture search (NAS), FasterSeg is discovered from a novel and broader search space integrating multi-resolution branches, that has been recently found to be vital in manually designed segmentation models. To better calibrate the balance between the goals of high accuracy and low latency, we propose a decoupled and fine-grained latency regularization, that effectively overcomes our observed phenomenons that the searched networks are prone to "collapsing" to low-latency yet poor-accuracy models. Moreover, we seamlessly extend FasterSeg to a new collaborative search (co-searching) framework, simultaneously searching for a teacher and a student network in the same single run. The teacher-student distillation further boosts the student model's accuracy. Experiments on popular segmentation benchmarks demonstrate the competency of FasterSeg. For example, FasterSeg can run over 30% faster than the closest manually designed competitor on Cityscapes, while maintaining comparable accuracy.
more details
0.0061373.662.284.9
VCD-NoCoarseyesyesnonononononononononononoAnonymous
more details
n/a83.276.589.8
NAVINFO_DLRyesyesyesyesnonononononononononopengfei zhangweighted aspp+ohem+hard region refine
more details
n/a83.777.490.0
LBPSSyesyesnononononononono22nonoAnonymousCVPR 2020 submission #5455
more details
0.967.554.780.3
KANet_Res101yesyesnonononononononononononoAnonymous
more details
n/a82.574.990.1
Learnable Tree Filter V2yesyesnonononononononononoyesyesRethinking Learnable Tree Filter for Generic Feature TransformLin Song, Yanwei Li, Zhengkai Jiang, Zeming Li, Xiangyu Zhang, Hongbin Sun, Jian Sun, Nanning ZhengNeurIPS 2020Based on ResNet-101 backbone and FPN architecture.
more details
n/a83.677.389.8
GPSNetyesyesnonononononononononononoAnonymous
more details
n/a81.673.789.5
FTFNetyesyesyesyesnonononononononononoAnonymousAn Efficient Network Focused on Tiny Feature Maps for Real-Time Semantic Segmentation
more details
0.008870.958.483.5
iFLYTEK-CVyesyesyesyesnonononononononononoAnonymousiFLYTEK Research, CV Group
more details
n/a82.474.690.2
F2MF-shortyesyesnonononononoyesyesnonoyesyesWarp to the Future: Joint Forecasting of Features and Feature Motion Josip Saric, Marin Orsic, Tonci Antunovic, Sacha Vrazic, Sinisa SegvicThe IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020Our method forecasts semantic segmentation 3 timesteps into the future.
more details
n/a61.547.175.9
HPNetyesyesnonononononononononononoHigh-Order Paired-ASPP Networks for Semantic SegmentationYu Zhang, Xin Sun, Junyu Dong, Changrui Chen, Yue Shen
more details
n/a80.471.589.3
HANet (fine-train only)yesyesnonononononononononononoTBAAnonymousWe use only fine-training data.
more details
n/a79.569.589.5
F2MF-midyesyesnonononononoyesyesnonoyesyesWarp to the Future: Joint Forecasting of Features and Feature MotionJosip Saric, Marin Orsic, Tonci Antunovic, Sacha Vrazic, Sinisa SegvicThe IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020Our method forecasts semantic segmentation 9 timesteps into the future.
more details
n/a48.032.064.0
EMANetyesyesnonononononononononononoExpectation Maximization Attention Networks for Semantic SegmentationXia Li, Zhisheng Zhong, Jianlong Wu, Yibo Yang, Zhouchen Lin, Hong LiuICCV 2019
more details
n/a80.372.188.5
PartnerNetyesyesnonononononononononononoAnonymousPARTNERNET: A LIGHTWEIGHT AND EFFICIENT PARTNER NETWORK FOR SEMANTIC SEGMENTATION
more details
0.005873.662.285.0
SwiftNet RN18 pyr sepBN MVDyesyesnonononononononononoyesyesEfficient semantic segmentation with pyramidal fusionM Oršić, S ŠegvićPattern Recognition 2020
more details
0.02977.967.788.0
Tencent YYB VisualAlgoyesyesyesyesnonononononononononoAnonymousTencent YYB VisualAlgo Group
more details
n/a82.073.590.5
MoKu LabyesyesnonononononononononononoAnonymousAlibaba, MoKu AI Lab, CV Group
more details
n/a83.175.590.7
HRNetV2 + OCR + SegFixyesyesyesyesnonononononononoyesyesObject-Contextual Representations for Semantic SegmentationYuhui Yuan, Xilin Chen, Jingdong WangFirst, we pre-train "HRNet+OCR" method on the Mapillary training set (achieves 50.8% on the Mapillary val set). Second, we fine-tune the model with the Cityscapes training, validation and coarse set. Finally, we apply the "SegFix" scheme to further improve the results.
more details
n/a83.977.490.4
DecoupleSegNetyesyesnonononononononononoyesyesImproving Semantic Segmentation via Decoupled Body and Edge SupervisionXiangtai Li, Xia Li, Li Zhang, Guangliang Cheng, Jianping Shi, Zhouchen Lin, Shaohua Tan, and Yunhai TongECCV-2020In this paper, We propose a new paradigm for semantic segmentation. Our insight is that appealing performance of semantic segmentation re- quires explicitly modeling the object body and edge, which correspond to the high and low frequency of the image. To do so, we first warp the image feature by learning a flow field to make the object part more consistent. The resulting body feature and the residual edge feature are further optimized under decoupled supervision by explicitly sampling dif- ferent parts (body or edge) pixels. The code and models have been released.
more details
n/a81.472.790.1
LGE A&B Center: HANet (ResNet-101)yesyesnonononononononononoyesyesCars Can’t Fly up in the Sky: Improving Urban-Scene Segmentation via Height-driven Attention NetworksSungha Choi (LGE, Korea Univ.), Joanne T. Kim (Korea Univ.), Jaegul Choo (KAIST)CVPR 2020Dataset: "fine train + fine val", No coarse, Backbone: ImageNet pretrained ResNet-101
more details
n/a81.272.489.9
DCNASyesyesyesyesnonononononononononoDCNAS: Densely Connected Neural Architecture Search for Semantic Image SegmentationXiong Zhang, Hongmin Xu, Hong Mo, Jianchao Tan, Cheng Yang, Wenqi RenNeural Architecture Search (NAS) has shown great potentials in automatically designing scalable network architectures for dense image predictions. However, existing NAS algorithms usually compromise on restricted search space and search on proxy task to meet the achievable computational demands. To allow as wide as possible network architectures and avoid the gap between target and proxy dataset, we propose a Densely Connected NAS (DCNAS) framework, which directly searches the optimal network structures for the multi-scale representations of visual information, over a large-scale target dataset. Specifically, by connecting cells with each other using learnable weights, we introduce a densely connected search space to cover an abundance of mainstream network designs. Moreover, by combining both path-level and channel-level sampling strategies, we design a fusion module to reduce the memory consumption of ample search space.
more details
n/a82.575.489.5
GPNet-ResNet101yesyesnonononononononononononoAnonymous
more details
n/a82.074.090.0
Axial-DeepLab-XL [Cityscapes-fine]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a76.770.483.0
Axial-DeepLab-L [Mapillary Vistas]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a78.972.385.5
Axial-DeepLab-L [Cityscapes-fine]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.

more details
n/a76.269.782.7
LGE A&B Center: HANet (ResNext-101)yesyesyesyesnonononononononoyesyesCars Can’t Fly up in the Sky: Improving Urban-Scene Segmentation via Height-driven Attention NetworksSungha Choi (LGE, Korea Univ.), Joanne T. Kim (Korea Univ.), Jaegul Choo (KAIST)CVPR 2020Dataset: "fine train + fine val + coarse", Backbone: Mapillary pretrained ResNext-101
more details
n/a80.771.489.9
ERINet-v2yesyesnonononononononononononoEfficient Residual Inception NetworkMINJONG KIM, SUYOUNG CHIongoing
more details
0.0052631670.056.883.3
Naive-Student (iterative semi-supervised learning with Panoptic-DeepLab)yesyesnonononononoyesyesnonononoSemi-Supervised Learning in Video Sequences for Urban Scene SegmentationLiang-Chieh Chen, Raphael Gontijo Lopes, Bowen Cheng, Maxwell D. Collins, Ekin D. Cubuk, Barret Zoph, Hartwig Adam, Jonathon ShlensSupervised learning in large discriminative models is a mainstay for modern computer vision. Such an approach necessitates investing in large-scale human-annotated datasets for achieving state-of-the-art results. In turn, the efficacy of supervised learning may be limited by the size of the human annotated dataset. This limitation is particularly notable for image segmentation tasks, where the expense of human annotation is especially large, yet large amounts of unlabeled data may exist. In this work, we ask if we may leverage semi-supervised learning in unlabeled video sequences to improve the performance on urban scene segmentation, simultaneously tackling semantic, instance, and panoptic segmentation. The goal of this work is to avoid the construction of sophisticated, learned architectures specific to label propagation (e.g., patch matching and optical flow). Instead, we simply predict pseudo-labels for the unlabeled data and train subsequent models with both human-annotated and pseudo-labeled data. The procedure is iterated for several times. As a result, our Naive-Student model, trained with such simple yet effective iterative semi-supervised learning, attains state-of-the-art results at all three Cityscapes benchmarks, reaching the performance of 67.8% PQ, 42.6% AP, and 85.2% mIOU on the test set. We view this work as a notable step towards building a simple procedure to harness unlabeled video sequences to surpass state-of-the-art performance on core computer vision tasks.
more details
n/a82.076.187.8
Axial-DeepLab-XL [Mapillary Vistas]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a79.773.286.3
TUE-5LSM0-g23yesyesyesyesnonononononononononoAnonymousDeeplabv3+decoder
more details
n/a67.854.781.0
PBRNetyesyesnonononononononononononoAnonymousmodified MobileNetV2 backbone + Prediction and Boundary attention-based Refinement Module (PBRM)
more details
0.010777.367.087.6
ResNeSt200yesyesnonononononononononononoResNeSt: Split-Attention NetworksHang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Mueller, R. Manmatha, Mu Li, and Alexander SmolaDeepLabV3+ network with ResNeSt200 backbone.
more details
n/a81.372.689.9
Panoptic-DeepLab [Mapillary Vistas]yesyesnonononononononononononoPanoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic SegmentationBowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh ChenWe employ a stronger backbone, WR-41, in Panoptic-DeepLab.
For Panoptic-DeepLab, please refer to https://arxiv.org/abs/1911.10194.
For wide-ResNet-41 (WR-41) backbone, please refer to https://arxiv.org/abs/2005.10266.
more details
n/a82.876.888.9
EaNet-V1yesyesnonononononononononononoParsing Very High Resolution Urban Scene Images by Learning Deep ConvNets with Edge-Aware LossXianwei Zheng, Linxi Huan, Gui-Song Xia, Jianya GongParsing very high resolution (VHR) urban scene images into regions with semantic meaning, e.g. buildings and cars, is a fundamental task necessary for interpreting and understanding urban scenes. However, due to the huge quantity of details contained in an image and the large variations of objects in scale and appearance, the existing semantic segmentation methods often break one object into pieces, or confuse adjacent objects and thus fail to depict these objects consistently. To address this issue, we propose a concise and effective edge-aware neural network (EaNet) for urban scene semantic segmentation. The proposed EaNet model is deployed as a standard balanced encoder-decoder framework. Specifically, we devised two plug-and-play modules that append on top of the encoder and decoder respectively, i.e., the large kernel pyramid pooling (LKPP) and the edge-aware loss (EA loss) function, to extend the model ability in learning discriminating features. The LKPP module captures rich multi-scale context with strong continuous feature relations to promote coherent labeling of multi-scale urban objects. The EA loss module learns edge information directly from semantic segmentation prediction, which avoids costly post-processing or extra edge detection. During training, EA loss imposes a strong geometric awareness to guide object structure learning at both the pixel- and image-level, and thus effectively separates confusing objects with sharp contours.
more details
n/a77.866.888.9
EfficientPS [Mapillary Vistas]yesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaUnderstanding the scene in which an autonomous robot operates is critical for its competent functioning. Such scene comprehension necessitates recognizing instances of traffic participants along with general scene semantics which can be effectively addressed by the panoptic segmentation task. In this paper, we introduce the Efficient Panoptic Segmentation (EfficientPS) architecture that consists of a shared backbone which efficiently encodes and fuses semantically rich multi-scale features. We incorporate a new semantic head that aggregates fine and contextual features coherently and a new variant of Mask R-CNN as the instance head. We also propose a novel panoptic fusion module that congruously integrates the output logits from both the heads of our EfficientPS architecture to yield the final panoptic segmentation output. Additionally, we introduce the KITTI panoptic segmentation dataset that contains panoptic annotations for the popularly challenging KITTI benchmark. Extensive evaluations on Cityscapes, KITTI, Mapillary Vistas and Indian Driving Dataset demonstrate that our proposed architecture consistently sets the new state-of-the-art on all these four benchmarks while being the most efficient and fast panoptic segmentation architecture to date.
more details
n/a83.576.890.3
FSFNetyesyesnonononononononononoyesyesAccelerator-Aware Fast Spatial Feature Network for Real-Time Semantic SegmentationMinjong Kim, Byungjae Park, Suyoung ChiIEEE AccessSemantic segmentation is performed to understand an image at the pixel level; it is widely used in the field of autonomous driving. In recent years, deep neural networks achieve good accuracy performance; however, there exist few models that have a good trade-off between high accuracy and low inference time. In this paper, we propose a fast spatial feature network (FSFNet), an optimized lightweight semantic segmentation model using an accelerator, offering high performance as well as faster inference speed than current methods. FSFNet employs the FSF and MRA modules. The FSF module has three different types of subset modules to extract spatial features efficiently. They are designed in consideration of the size of the spatial domain. The multi-resolution aggregation module combines features that are extracted at different resolutions to reconstruct the segmentation image accurately. Our approach is able to run at over 203 FPS at full resolution 1024 x 2048) in a single NVIDIA 1080Ti GPU, and obtains a result of 69.13% mIoU on the Cityscapes test dataset. Compared with existing models in real-time semantic segmentation, our proposed model retains remarkable accuracy while having high FPS that is over 30% faster than the state-of-the-art model. The experimental results proved that our model is an ideal approach for the Cityscapes dataset.
more details
0.004926172.660.484.7
Hierarchical Multi-Scale Attention for Semantic SegmentationyesyesyesyesnonononononononoyesyesHierarchical Multi-Scale Attention for Semantic SegmentationAndrew Tao, Karan Sapra, Bryan Catanzaro Multi-scale inference is commonly used to improve the results of semantic segmentation. Multiple images scales are passed through a network and then the results are combined with averaging or max pooling. In this work, we present an attention-based approach to combining multi-scale predictions. We show that predictions at certain scales are better at resolving particular failures modes and that the network learns to favor those scales for such cases in order to generate better predictions. Our attention mechanism is hierarchical, which enables it to be roughly 4x more memory efficient to train than other recent approaches. In addition to enabling faster training, this allows us to train with larger crop sizes which leads to greater model accuracy. We demonstrate the result of our method on two datasets: Cityscapes and Mapillary Vistas. For Cityscapes, which has a large number of weakly labelled images, we also leverage auto-labelling to improve generalization. Using our approach we achieve a new state-of-the-art results in both Mapillary (61.1 IOU val) and Cityscapes (85.4 IOU test).
more details
n/a85.479.391.5
SANetyesyesnononononononono44nonoAnonymous
more details
25.080.270.889.7
SJTU_hpmyesyesyesyesnonoyesyesnonononononoHard Pixel Mining for Depth Privileged Semantic SegmentationZhangxuan Gu, Li Niu*, Haohua Zhao, and Liqing Zhang
more details
n/a79.169.289.0
FANetyesyesnonononononononononononoFANet: Feature Aggregation Network for Semantic SegmentationTanmay Singha, Duc-Son Pham, and Aneesh KrishnaFeature Aggregation Network for Semantic Segmentation
more details
n/a61.143.478.8
Hard Pixel Mining for Depth Privileged Semantic SegmentationyesyesyesyesnonoyesyesnonononononoHard Pixel Mining for Depth Privileged Semantic SegmentationZhangxuan Gu, Li Niu, Haohua Zhao, and Liqing ZhangSemantic segmentation has achieved remarkable progress but remains challenging due to the complex scene, object occlusion, and so on. Some research works have attempted to use extra information such as a depth map to help RGB based semantic segmentation because the depth map could provide complementary geometric cues. However, due to the inaccessibility of depth sensors, depth information is usually unavailable for the test images. In this paper, we leverage only the depth of training images as the privileged information to mine the hard pixels in semantic segmentation, in which depth information is only available for training images but not available for test images. Specifically, we propose a novel Loss Weight Module, which outputs a loss weight map by employing two depth-related measurements of hard pixels: Depth Prediction Error and Depthaware Segmentation Error. The loss weight map is then applied to segmentation loss, with the goal of learning a more robust model by paying more attention to the hard pixels. Besides, we also explore a curriculum learning strategy based on the loss weight map. Meanwhile, to fully mine the hard pixels on different scales, we apply our loss weight module to multi-scale side outputs. Our hard pixels mining method achieves the state-of-the-art results on three benchmark datasets, and even outperforms the methods which need depth input during testing.
more details
n/a82.674.990.3
MSeg1080_RVCyesyesnonononononononononoyesyesMSeg: A Composite Dataset for Multi-domain Semantic SegmentationJohn Lambert*, Zhuang Liu*, Ozan Sener, James Hays, Vladlen KoltunCVPR 2020We present MSeg, a composite dataset that unifies semantic segmentation datasets from different domains. A naive merge of the constituent datasets yields poor performance due to inconsistent taxonomies and annotation practices. We reconcile the taxonomies and bring the pixel-level annotations into alignment by relabeling more than 220,000 object masks in more than 80,000 images, requiring more than 1.34 years of collective annotator effort. The resulting composite dataset enables training a single semantic segmentation model that functions effectively across domains and generalizes to datasets that were not seen during training. We adopt zero-shot cross-dataset transfer as a benchmark to systematically evaluate a model’s robustness and show that MSeg training yields substantially more robust models in comparison to training on individual datasets or naive mixing of datasets without the presented contributions.
more details
0.4979.570.988.2
SA-Gate (ResNet-101,OS=16)yesyesnonononoyesyesnonononoyesyesBi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic SegmentationXiaokang Chen, Kwan-Yee Lin, Jingbo Wang, Wayne Wu, Chen Qian, Hongsheng Li, and Gang ZengEuropean Conference on Computer Vision (ECCV), 2020RGB+HHA input, input resolution = 800x800, output stride = 16, training 240 epochs, no coarse data is used.
more details
n/a83.076.190.0
seamseg_rvcsubsetnonononononononononononoyesyesSeamless Scene SegmentationPorzi, Lorenzo and Rota Bulò, Samuel and Colovic, Aleksander and Kontschieder, PeterThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2019Seamless Scene Segmentation Resnet101, pretrained on Imagenet; supplied with altered MVD to include WildDash2 classes; does not contain other RVC label policies (i.e. no ADE20K/COCO-specific classes -> rvcsubset and not a proper submission)
more details
n/a67.960.875.0
HRNet + LKPP + EA lossyesyesnonononononononononononoAnonymous
more details
n/a78.667.689.6
SN_RN152pyrx8_RVCyesyesnonononononononononoyesyesIn Defense of Pre-trained ImageNet Architectures for Real-time Semantic Segmentation of Road-driving ImagesMarin Oršić, Ivan Krešo, Petra Bevandić, Siniša ŠegvićCVPR 2019
more details
1.075.965.286.6
EffPS_b1bs4_RVCyesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaEfficientPS with EfficientNet-b1 backbone. Trained with a batch size of 4.
more details
n/a69.254.384.0
AttaNet_lightyesyesnonononononononononoyesyesAttaNet: Attention-Augmented Network for Fast and Accurate Scene Parsing(AAAI21)Anonymous
more details
n/a72.158.885.4
CFPNetyesyesnonononononononononononoAnonymous
more details
n/a72.260.484.0
Seg_UJSyesyesnonononononononononononoAnonymous
more details
n/a84.477.990.9
Bilateral_attention_semanticyesyesnonononononononononononoAnonymouswe use bilateral attention mechanism for semantic segmentation
more details
0.014179.571.087.9
Panoptic-DeepLab w/ SWideRNet [Cityscapes-fine]yesyesnonononononononononononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
more details
n/a81.775.887.6
ESANet RGB-D (small input)yesyesnonononoyesyesnono22yesyesEfficient RGB-D Semantic Segmentation for Indoor Scene AnalysisDaniel Seichter, Mona Köhler, Benjamin Lewandowski, Tim Wengefeld and Horst-Michael GrossEfficient RGB-D Semantic Segmentation for Indoor Scene Analysis.
ESANet-R34-NBt1D using RGB-D data with half the input resolution.
more details
0.042770.756.884.6
ESANet RGB (small input)yesyesnononononononono22yesyesEfficient RGB-D Semantic Segmentation for Indoor Scene AnalysisDaniel Seichter, Mona Köhler, Benjamin Lewandowski, Tim Wengefeld and Horst-Michael GrossESANet: Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis.
ESANet-R34-NBt1D using RGB images with half the input resolution.
more details
0.03166.550.582.6
ESANet RGB-DyesyesnonononoyesyesnonononoyesyesEfficient RGB-D Semantic Segmentation for Indoor Scene AnalysisDaniel Seichter, Mona Köhler, Benjamin Lewandowski, Tim Wengefeld and Horst-Michael GrossEfficient RGB-D Semantic Segmentation for Indoor Scene Analysis.
ESANet-R34-NBt1D using RGB-D data.
more details
0.161379.069.388.7
DAHUA-ARIyesyesyesyesnonononononononononoAnonymousmulti-scale and refineNet
more details
n/a85.479.291.6
ESANet RGByesyesnonononononononononoyesyesEfficient RGB-D Semantic Segmentation for Indoor Scene AnalysisDaniel Seichter, Mona Köhler, Benjamin Lewandowski, Tim Wengefeld and Horst-Michael GrossESANet: Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis.
ESANet-R34-NBt1D using RGB images only.
more details
0.120576.265.586.9
DCNAS+ASPP [Mapillary Vistas]yesyesyesyesnonononononononononoAnonymousExisting NAS algorithms usually compromise on restricted search space or search on proxy task to meet the achievable computational demands. To allow as wide as possible network architectures and avoid the gap between realistic and proxy setting, we propose a novel Densely Connected NAS (DCNAS) framework, which directly searches the optimal network structures for the multi-scale representations of visual information, over a large-scale target dataset without proxy. Specifically, by connecting cells with each other using learnable weights, we introduce a densely connected search space to cover an abundance of mainstream network designs. Moreover, by combining both path-level and channel-level sampling strategies, we design a fusion module and mixture layer to reduce the memory consumption of ample search space, hence favor the proxyless searching. Compared with contemporary works, experiments reveal that the proxyless searching scheme is capable of bridge the gap between searching and training environments.
more details
n/a85.379.291.5
Panoptic-DeepLab w/ SWideRNet [Mapillary Vistas]yesyesnonononononononononononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
more details
n/a83.377.789.0
DCNAS+ASPPyesyesyesyesnonononononononononoDCNAS: Densely Connected Neural Architecture Search for Semantic ImageSegmentationAnonymousExisting NAS algorithms usually compromise on restricted search space or search on proxy task to meet the achievable computational demands. To allow as wide as possible network architectures and avoid the gap between realistic and proxy setting, we propose a novel Densely Connected NAS (DCNAS) framework, which directly searches the optimal network structures for the multi-scale representations of visual information, over a large-scale target dataset without proxy. Specifically, by connecting cells with each other using learnable weights, we introduce a densely connected search space to cover an abundance of mainstream network designs. Moreover, by combining both path-level and channel-level sampling strategies, we design a fusion module and mixture layer to reduce the memory consumption of ample search space, hence favor the proxyless searching.
more details
n/a84.678.191.2
ddl_segyesyesnonononononononononononoAnonymous
more details
n/a84.477.990.9
CABiNetyesyesnonononononononononononoCABiNet: Efficient Context Aggregation Network for Low-Latency Semantic SegmentationSaumya Kumaar, Ye Lyu, Francesco Nex, Michael Ying YangWith the increasing demand of autonomous machines, pixel-wise semantic segmentation for visual scene understanding needs to be not only accurate but also efficient for any potential real-time applications. In this paper, we propose CABiNet (Context Aggregated Bi-lateral Network), a dual branch convolutional neural network (CNN), with significantly lower computational costs as compared to the state-of-the-art, while maintaining a competitive prediction accuracy. Building upon the existing multi-branch architectures for high-speed semantic segmentation, we design a cheap high resolution branch for effective spatial detailing and a context branch with light-weight versions of global aggregation and local distribution blocks, potent to capture both long-range and local contextual dependencies required for accurate semantic segmentation, with low computational overheads. Specifically, we achieve 76.6% and 75.9% mIOU on Cityscapes validation and test sets respectively, at 76 FPS on an NVIDIA RTX 2080Ti and 8 FPS on a Jetson Xavier NX. Codes and training models will be made publicly available.
more details
0.01375.762.888.7
Margin calibrationyesyesyesyesnonononononononononoAnonymousThe model is DeepLab v3+ backend on SEResNeXt50. We used the margin calibration with log-loss as the learning objective.
more details
n/a81.873.690.0
MT-SSSRyesyesnononononononono22nonoAnonymous
more details
n/a81.674.688.6
Panoptic-DeepLab w/ SWideRNet [Mapillary Vistas + Pseudo-labels]yesyesnonononononoyesyesnonononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime. Following Naive-Student, this model is additionally trained with pseudo-labels generated from Cityscapes Video and train-extra set (i.e., the coarse annotations are not used, but the images are).
more details
n/a85.180.190.1
DSANet: Dilated Spatial Attention for Real-time Semantic Segmentation in Urban Street ScenesyesyesnonononononononononononoAnonymouswe present computationally efficient network named DSANet, which follows a two-branch strategy to tackle the problem of real-time semantic segmentation in urban scenes. We first design a Context branch, which employs Depth-wise Asymmetric ShuffleNet DAS as main building block to acquire sufficient receptive fields. In addition, we propose a dual attention module consisting of dilated spatial attention and channel attention to make full use of the multi-level feature maps simultaneously, which helps predict the pixel-wise labels in each stage. Meanwhile, Spatial Encoding Network is used to enhance semantic information by preserving the spatial details. Finally, to better combine context information and spatial information, we introduce a Simple Feature Fusion Module to combine the features from the two branches.
more details
n/a72.562.282.7
UJS_modelyesyesnonononononononononononoAnonymous
more details
n/a85.178.991.4
Mobilenetv3-small-backbone real-time segmentationyesyesnonononononononononoyesyesAnonymousThe model is a dual-path network with mobilenetv3-small backbone. PSP module was used as the context aggregation block. We also use feature fusion module at x16, x32. The features of the two branches are then concatenated and fused with a bottleneck conv.
Only train data is used to train the model excluding validation data. And evaluation was done by single scale input images.
more details
0.0267.553.581.4
M2FANetyesyesnonononononononononoyesyesUrban street scene analysis using lightweight multi-level multi-path feature aggregation networkTanmay Singha; Duc-Son Pham; Aneesh KrishnaMultiagent and Grid Systems Journal
more details
n/a67.853.881.7
AFPNetyesyesnonononononononononononoAnonymous
more details
0.0375.264.386.0
YOLO V5s with Segmentation Headyesyesnononononononono22yesyesAnonymousMultitask model. fine tune from COCO detection pretrained model, train semantic segmentation and object detection(transfer from instance label) at the same time
more details
0.00770.457.183.7
FSFFNetyesyesyesyesnonononononononoyesyesA Lightweight Multi-scale Feature Fusion Network for Real-Time Semantic SegmentationTanmay Singha, Duc-Son Pham, Aneesh Krishna, Tom GedeonInternational Conference on Neural Information Processing 2021Feature Scaling Feature Fusion Network
more details
n/a68.555.281.8
Qualcomm AI ResearchyesyesyesyesnonononononononoyesyesInverseForm: A Loss Function for Structured Boundary-Aware SegmentationShubhankar Borse, Ying Wang, Yizhe Zhang, Fatih PorikliCVPR 2021 oral
more details
n/a85.679.891.5
HIK-CCSLTyesyesyesyesnonononononononononoAnonymous
more details
n/a85.078.891.2
BFNetyesyesnonononononononononononoBFNetJiaqi Fan
more details
n/a71.359.183.4
Hai Wang+Yingfeng Cai-research groupyesyesnonononononononononononoAnonymous
more details
0.0016484.678.091.2
Jiangsu_university_Intelligent_Drive_AIyesyesnonononononononononononoAnonymous
more details
n/a84.678.091.2
MCANetyesyesyesyesnonononononononoyesyesAnonymous
more details
n/a72.860.185.6
UFONet (half-resolution)yesyesnonononononononononononoUFO RPN: A Region Proposal Network for Ultra Fast Object DetectionWenkai Li, Andy SongThe 34th Australasian Joint Conference on Artificial Intelligence
more details
n/a56.737.476.0
SCMNetyesyesnonononononononononononoAnonymous
more details
n/a68.053.682.3
FsaNetyesyesyesyesnonononononononononoFsaNet: Frequency Self-attention for Semantic SegmentationAnonymous
more details
n/a79.570.288.8
SCMNet coarseyesyesyesyesnonononononononoyesyesSCMNet: Shared Context Mining Network for Real-time Semantic SegmentationTanmay Singha; Moritz Bergemann; Duc-Son Pham; Aneesh Krishna2021 Digital Image Computing: Techniques and Applications (DICTA)
more details
n/a69.155.183.2
SAIT SeeThroughNetyesyesyesyesnonononononononononoAnonymous
more details
n/a85.779.891.6
JSU_IDT_groupyesyesnonononononononononononoAnonymous
more details
n/a84.878.491.1
DLA_HRNet48OCR_MSFLIP_000yesyesyesyesnonononononononononoAnonymousThis set of predictions is from DLA (differentiable lattice assignment network) with "HRNet48+OCR-Head" as base segmentation model. The model is, first trained on coarse-data, and then trained on fine-annotated train/val sets. Multi-scale (0.5, 0.75, 1.0, 1.25, 1.5, 1.75) and flip scheme is adopted during inference.
more details
n/a84.577.891.2
MYBank-AIoTyesyesyesyesnonononononononononoAnonymous
more details
n/a85.880.191.4
kMaX-DeepLab [Cityscapes-fine]yesyesnonononononononononoyesyesk-means Mask TransformerQihang Yu, Huiyu Wang, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, and Liang-Chieh ChenECCV 2022kMaX-DeepLab w/ ConvNeXt-L backbone (ImageNet-22k + 1k pretrained). This result is obtained by the kMaX-DeepLab trained for Panoptic Segmentation task. No test-time augmentation or other external dataset.
more details
n/a82.777.088.4
LeapAIyesyesyesyesnonononononononononoAnonymousUsing advanced AI techniques.
more details
n/a84.277.191.3
adlab_iiau_ldzyesyesnonononononononononononoAnonymousmeticulous-caiman_2022.05.01_03.32
more details
n/a85.178.791.4
SFRSegyesyesnonononononononononoyesyesA Real-Time Semantic Segmentation Model Using Iteratively Shared Features In Multiple Sub-EncodersTanmay Singha, Duc-Son Pham, Aneesh KrishnaPattern Recognition
more details
n/a65.851.580.2
PIDNet-SyesyesnonononononononononoyesyesPIDNet: A Real-time Semantic Segmentation Network Inspired from PID ControllerAnonymous
more details
0.010777.968.587.3
Vision Transformer Adapter for Dense PredictionsyesyesnonononononononononoyesyesVision Transformer Adapter for Dense PredictionsZhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu QiaoViT-Adapter-L, BEiT pre-train, multi-scale testing
more details
n/a83.476.690.2
SSNetyesyesnonononoyesyesnonononononoAnonymous
more details
n/a78.269.287.2
SDBNetyesyesnonononononononononoyesyesSDBNet: Lightweight Real-time Semantic Segmentation Using Short-term Dense BottleneckTanmay Singha, Duc-Son Pham, Aneesh Krishna2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)
more details
n/a69.455.783.1
MeiTuan-BaseModelyesyesyesyesnonononononononononoAnonymous
more details
n/a85.479.791.0
SDBNetV2yesyesnonononononononononoyesyesImproved Short-term Dense Bottleneck network for efficient scene analysisTanmay Singha; Duc-Son Pham; Aneesh KrishnaComputer Vision and Image Understanding
more details
n/a70.657.383.8
mogo_semanticyesyesnonononononononononononoAnonymous
more details
n/a85.279.191.3
UDSSEG_RVCyesyesnonononononononononononoAnonymousUDSSEG_RVC
more details
n/a77.166.288.0
MIX6D_RVCyesyesnonononononononononononoAnonymousMIX6D_RVC
more details
n/a74.665.084.2
FAN_NV_RVCyesyesnonononononononononononoAnonymousHybrid-Base + Segformer
more details
n/a78.368.188.5
UNIV_CNP_RVCyesyesnonononononononononononoAnonymousRVC 2022
more details
n/a70.258.881.5
AntGroup-AI-VisionAlgoyesyesyesyesnonononoyesyesnonononoAnonymousAntGroup AI vision algo
more details
n/a84.578.190.8
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable ConvolutionsyesyesyesyesnonononononononoyesyesInternImage: Exploring Large-Scale Vision Foundation Models with Deformable ConvolutionsWenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, Yu QiaoCVPR 2023We use Mask2Former as the segmentation framework, and initialize our InternImage-H model with the pre-trained weights on the 427M joint dataset of public Laion-400M, YFCC-15M, and CC12M. Following common practices, we first pre-train on Mapillary Vistas for 80k iterations, and then fine-tune on Cityscapes for 80k iterations. The crop size is set to 1024×1024 in this experiment. As a result, our InternImage-H achieves 87.0 multi-scale mIoU on the validation set, and 86.1 multi-scale mIoU on the test set.
more details
n/a85.079.790.2
Dense Prediction with Attentive Feature aggregationyesyesyesyesnonononononononoyesyesDense Prediction with Attentive Feature AggregationYung-Hsu Yang, Thomas E. Huang, Min Sun, Samuel Rota Bulò, Peter Kontschieder, Fisher YuWACV 2023We propose Attentive Feature Aggregation (AFA) to exploit both spatial and channel information for semantic segmentation and boundary detection.
more details
n/a82.274.689.8
W3_FAFMyesyesnonononononononononononoJunyan Yang, Qian Xu, Lei LaTeam: BOSCH-XC-DX-WAVE3
more details
0.02930980.372.188.6
HRNyesyesnonononononononononononoAnonymousHierarchical residual network
more details
45.077.267.586.9
HRN+DCNv2_for_DOASyesyesnonononononononononononoAnonymousHRN with DCNv2 for DOAS in paper "Dynamic Obstacle Avoidance System based on Rapid Instance Segmentation Network"
more details
0.03281.073.388.7
GEELY-ATC-SEGyesyesyesyesnonononononononononoAnonymous
more details
n/a85.480.090.8
PMSDSENyesyesnonononononononononoyesyesEfficient Parallel Multi-Scale Detail and Semantic Encoding Network for Lightweight Semantic SegmentationXiao Liu, Xiuya Shi, Lufei Chen, Linbo Qing, Chao RenACM International Conference on Multimedia 2023MM '23: Proceedings of the 31th ACM International Conference on Multimedia
more details
n/a75.765.386.1
ECFDyesyesnonononononononononoyesyesAnonymousbackbone: ConvNext-Large
more details
n/a82.174.989.4
DWGSeg-L75yesyesnononononononono1.31.3nonoAnonymous
more details
0.0075575.063.886.2
VLTSegyesyesnonononononononononononoVLTSeg: Simple Transfer of CLIP-Based Vision-Language Representations for Domain Generalized Semantic SegmentationChristoph Hümmer, Manuel Schwonberg, Liangwei Zhou, Hu Cao, Alois Knoll, Hanno Gottschalk
more details
n/a85.380.090.7
CGMANet_v1yesyesnonononononononononononoContext Guided Multi-scale Attention for Real-time Semantic Segmentation of Road-sceneSaquib MazharContext Guided Multi-scale Attention for Real-time Semantic Segmentation of Road-scene
more details
n/a74.964.984.8
SERNet-Former_v2yesyesyesyesnonononononononononoAnonymous
more details
n/a81.073.588.5

Instance-Level Semantic Labeling Task

AP on class-level

namefinefinecoarsecoarse16-bit16-bitdepthdepthvideovideosubsubcodecodetitleauthorsvenuedescriptionRuntime [s]averagepersonridercartruckbustrainmotorcyclebicycle
R-CNN + MCG convex hullyesyesnononononononono22nonoThe Cityscapes Dataset for Semantic Urban Scene UnderstandingM. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, B. SchieleCVPR 2016We compute MCG object proposals [1] and use their convex hulls as instance candidates. These proposals are scored by a Fast R-CNN detector [2].

[1] P. Arbelaez, J. Pont-Tuset, J. Barron, F. Marqués, and J. Malik. Multiscale combinatorial grouping. In CVPR, 2014.
[2] R. Girshick. Fast R-CNN. In ICCV, 2015.
more details
60.04.61.30.610.56.19.75.91.70.5
Pixel-level Encoding for Instance SegmentationyesyesnonononoyesyesnonononononoPixel-level Encoding and Depth Layering for Instance-level Semantic LabelingJ. Uhrig, M. Cordts, U. Franke, and T. BroxGCPR 2016We predict three encoding channels from a single image using an FCN: semantic labels, depth classes, and an instance-aware representation based on directions towards instance centers. Using low-level computer vision techniques, we obtain pixel-level and instance-level semantic labeling paired with a depth estimate of the instances.
more details
n/a8.912.511.722.53.35.93.26.95.1
Instance-level Segmentation of Vehicles by Deep Contoursyesyesnononononononono22nonoInstance-level Segmentation of Vehicles by Deep ContoursJan van den Brand, Matthias Ochs and Rudolf MesterAsian Conference on Computer Vision - Workshop on Computer Vision Technologies for Smart VehicleOur method uses the fully convolutional network (FCN) for semantic labeling and for estimating the boundary of each vehicle. Even though a contour is in general a one pixel wide structure which cannot be directly learned by a CNN, our network addresses this by providing areas around the contours. Based on these areas, we separate the individual vehicle instances.
more details
0.22.30.00.018.20.00.00.00.00.0
Boundary-aware Instance Segmentationyesyesnononononononono22nonoBoundary-aware Instance SegmentationZeeshan Hayder, Xuming He, Mathieu SalzmannCVPR 2017End-to-end model for instance segmentation using VGG16 network

Previously listed as "Shape-Aware Instance Segmentation"
more details
n/a17.414.612.935.716.023.219.010.37.8
RecAttendyesyesnononononononono44nonoAnonymous
more details
n/a9.59.23.127.58.012.17.94.83.3
Joint Graph Decomposition and Node Labelingyesyesnononononononono88nonoJoint Graph Decomposition and Node Labeling: Problem, Algorithms, ApplicationsEvgeny Levinkov, Jonas Uhrig, Siyu Tang, Mohamed Omran, Eldar Insafutdinov, Alexander Kirillov, Carsten Rother, Thomas Brox, Bernt Schiele, Bjoern AndresComputer Vision and Pattern Recognition (CVPR) 2017
more details
n/a9.86.59.323.16.710.910.36.84.6
InstanceCutyesyesyesyesnonononononononononoInstanceCut: from Edges to Instances with MultiCutA. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, C. RotherComputer Vision and Pattern Recognition (CVPR) 2017InstanceCut represents the problem by two output modalities: (i) an instance-agnostic semantic segmentation and (ii) all instance-boundaries. The former is computed from a standard CNN for semantic segmentation, and the latter is derived from a new instance-aware edge detection model. To reason globally about the optimal partitioning of an image into instances, we combine these two modalities into a novel MultiCut formulation.
more details
n/a13.010.08.023.714.019.515.29.34.7
Semantic Instance Segmentation with a Discriminative Loss Functionyesyesnononononononono22yesyesSemantic Instance Segmentation with a Discriminative Loss FunctionBert De Brabandere, Davy Neven, Luc Van GoolDeep Learning for Robotic Vision, workshop at CVPR 2017This method uses a discriminative loss function, operating at the pixel level, that encourages a convolutional network to produce a representation of the image that can easily be clustered into instances with a simple post-processing step. The loss function encourages the network to map each pixel to a point in feature space so that pixels belonging to the same instance lie close together while different instances are separated by a wide margin.

Previously listed as "PPLoss".
more details
n/a17.513.516.224.416.823.919.215.210.7
SGNyesyesyesyesnonononononononononoSGN: Sequential Grouping Networks for Instance SegmentationShu Liu, Jiaya Jia, Sanja Fidler, Raquel UrtasunICCV 2017Instance segmentation using a sequence of neural networks, each solving a sub-grouping problem of increasing semantic complexity in order to gradually compose objects out of pixels.
more details
n/a25.021.820.139.424.833.230.817.712.4
Mask R-CNN [COCO]yesyesnonononononononononononoMask R-CNNKaiming He, Georgia Gkioxari, Piotr Dollár, Ross GirshickMask R-CNN, ResNet-50-FPN, Cityscapes [fine-only] + COCO
more details
n/a32.034.827.049.130.140.930.924.118.7
Mask R-CNN [fine-only]yesyesnonononononononononononoMask R-CNNKaiming He, Georgia Gkioxari, Piotr Dollár, Ross GirshickMask R-CNN, ResNet-50-FPN, Cityscapes fine-only
more details
n/a26.230.523.746.922.832.218.619.116.0
Deep Watershed Transformationyesyesnononononononono22nonoDeep Watershed Transformation for Instance SegmentationMin Bai and Raquel UrtasunCVPR 2017Instance segmentation using a watershed transformation inspired CNN. The input RGB image is augmented using the semantic segmentation from the recent PSPNet by H. Zhao et al.
Previously named "DWT".
more details
n/a19.415.514.131.522.527.022.913.98.0
Foveal Vision for Instance Segmentation of Road ImagesyesyesnonononoyesyesnonononononoFoveal Vision for Instance Segmentation of Road ImagesBenedikt Ortelt, Christian Herrmann, Dieter Willersinn, Jürgen BeyererVISAPP 2018Directly based on 'Pixel-level Encoding for Instance Segmentation'. Adds an improved angular distance measure and a foveal concept to better address small objects at the vanishing point of the road.
more details
n/a12.513.411.424.59.414.512.28.06.7
SegNetyesyesyesyesnonononononononononoAnonymous
more details
0.529.529.923.443.429.841.033.318.716.7
DCMEyesyesnononononononono33nonoDistance to Center of Mass Encoding for Instance SegmentationThomio Watanabe and Denis Wolf2018 21st International Conference on Intelligent Transportation Systems (ITSC)
more details
n/a3.81.80.715.52.04.34.60.90.3
RRLyesyesnonononoyesyesnonononononoAnonymous
more details
n/a29.733.826.951.924.235.625.320.918.7
PANet [fine-only]yesyesnonononononononononoyesyesPath Aggregation Network for Instance SegmentationShu Liu, Lu Qi, Haifang Qin, Jianping Shi, Jiaya JiaCVPR 2018PANet, ResNet-50 as base model, Cityscapes fine-only, training hyper-parameters are adopted from Mask R-CNN.
more details
n/a31.836.830.454.827.036.325.522.620.8
PANet [COCO]yesyesnonononononononononoyesyesPath Aggregation Network for Instance SegmentationShu Liu, Lu Qi, Haifang Qin, Jianping Shi, Jiaya JiaCVPR 2018PANet, ResNet-50 as base model, Cityscapes fine-only + COCO, training hyper-parameters are adopted from Mask R-CNN.
more details
n/a36.441.533.658.231.845.328.728.224.1
LCISyesyesnonononononononononononoAnonymous
more details
n/a15.115.114.823.712.916.815.412.49.3
Pixelwise Instance Segmentation with a Dynamically Instantiated NetworkyesyesyesyesnonononononononononoPixelwise Instance Segmentation with a Dynamically Instantiated NetworkAnurag Arnab and Philip H. S. TorrComputer Vision and Pattern Recognition (CVPR) 2017We propose an Instance Segmentation system that produces a segmentation map where each pixel is assigned an object class and instance identity label (this has recently been termed "Panoptic Segmentation"). Our method is based on an initial semantic segmentation module which feeds into an instance subnetwork. This subnetwork uses the initial category-level segmentation, along with cues from the output of an object detector, within an end-to-end CRF to predict instances. This part of our model is dynamically instantiated to produce a variable number of instances per image. Our end-to-end approach requires no post-processing and considers the image holistically, instead of processing independent proposals. As a result, it reasons about occlusions (unlike some related work, a single pixel cannot belong to multiple instances).

more details
n/a23.421.018.431.722.831.131.019.611.7
PolygonRNN++yesyesnonononononononononoyesyesEfficient Annotation of Segmentation Datasets with Polygon-RNN++D. Acuna, H. Ling, A. Kar, and S. FidlerCVPR 2018
more details
n/a25.529.421.848.321.132.323.713.613.6
GMIS: Graph Merge for Instance SegmentationyesyesyesyesnonononononononononoYiding Liu, Siyu Yang, Bin Li, Wengang Zhou, Jizheng Xu, Houqiang Li, Yan Lu
more details
n/a27.629.324.142.725.437.232.917.611.9
TCnetyesyesnonononononononononononoAnonymousTCnet
more details
n/a32.637.329.951.528.741.128.724.919.1
MaskRCNN_ROByesyesnonononononononononononoAnonymousMaskRCNN Instance segmentation baseline for ROB challenge using default parameters from Matterport's implementation of Mask RCNN
https://github.com/matterport/Mask_RCNN
more details
n/a10.219.110.534.12.77.20.08.00.0
Multitask LearningyesyesnonononononononononoyesyesMulti-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and SemanticsAlex Kendall, Yarin Gal and Roberto CipollaCVPR 2018Numerous deep learning applications benefit from multi-task learning with multiple regression and classification objectives. In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task's loss. Tuning these weights by hand is a difficult and expensive process, making multi-task learning prohibitive in practice. We propose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the homoscedastic uncertainty of each task. This allows us to simultaneously learn various quantities with different units or scales in both classification and regression settings. We demonstrate our model learning per-pixel depth regression, semantic and instance segmentation from a monocular input image. Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate models trained individually on each task.
more details
n/a21.619.221.436.618.826.815.919.414.5
Deep ColoringyesyesnonononononononononononoAnonymousAnonymous ECCV submission #2955
more details
n/a24.922.321.340.923.133.628.317.812.3
MRCNN_VSCMLab_ROByesyesnonononononononononononoAnonymousMaskRCNN+FPN with pre-trained COCO model.
ms-training with short edge [800, 1024]
inference with shore edge size 800
Randomly subsample ScanNet to the size close to CityScape

optimizer: Adam
learning rate: start from 1e-4 to 1e-3 with linear warm up schedule. decrease by factor of 0.1 at 200, 300 epoch.

epoch: 400
step per epoch: 500
roi_per_im: 512
more details
1.014.815.711.536.713.618.714.38.30.0
BAMRCNN_ROByesyesnonononononononononononoAnonymous
more details
n/a0.30.00.01.50.20.30.00.00.0
NL_ROI_ROByesyesnonononononononononononoAnonymousNon-local ROI on Mask R-CNN
more details
n/a24.028.320.645.522.330.617.315.112.1
RUSH_ROByesyesnonononononononononononoAnonymous
more details
n/a32.135.630.950.330.838.025.225.720.4
MaskRCNN_BOSHyesyesnonononononononononononoJin shengtao, Yi zhihao, Liu wei [Our team name is firefly]MaskRCNN segmentation baseline for Bosh autodrive challenge ,
using Matterport's implementation of Mask RCNN https://github.com/matterport/Mask_RCNN
55k iterations, default parameters (backbone :resenet 101)
19hours for training
more details
n/a12.814.39.425.210.915.213.67.66.3
NV-ADLRyesyesnonononononononononononoAnonymousNVIDIA Applied Deep Learning Research
more details
n/a35.339.529.556.334.244.730.327.121.1
Sogou_MMyesyesnonononononononononononoGlobal Concatenating Feature Enhancement for Instance SegmentationHang Yang, Xiaozhe Xin, Wenwen Yang, Bin LiGlobal Concatenating Feature Enhancement for Instance Segmentation
more details
n/a37.239.132.054.637.947.736.827.621.5
Instance Segmentation by Jointly Optimizing Spatial Embeddings and Clustering BandwidthyesyesnonononononononononoyesyesInstance Segmentation by Jointly Optimizing Spatial Embeddings and Clustering BandwidthDavy Neven, Bert De Brabandere, Marc Proesmans and Luc Van GoolCVPR 2019Fine only - ERFNet backbone
more details
0.127.734.526.152.421.731.216.420.118.9
Instance Annotationyesyesnononononononono22nonoInstance Segmentation as Image Segmentation AnnotationThomio Watanabe and Denis F. Wolf2019 IEEE Intelligent Vehicles Symposium (IV)Based on DCME
more details
4.4167.76.73.124.16.09.86.43.62.1
NJUSTyesyesnonononononononononononoAng Li, Chongyang ZhangMask R-CNN based on FPN enhancement and Mask Rescore, etc. Only one single model SE-ResNext-152 with COCO pre-train used;
more details
n/a38.944.035.257.936.248.735.130.523.9
BshapeNet+ [fine-only]yesyesnonononononononononononoBshapeNet: Object Detection and Instance Segmentation with Bounding Shape MasksBa Rom Kang, Ha Young KimBshapeNet+, ResNet-50-FPN as base model, Cityscapes [fine-only]
more details
n/a27.329.723.446.726.133.324.820.314.1
SSAPyesyesnonononononononononononoSSAP: Single-Shot Instance Segmentation With Affinity PyramidNaiyu Gao, Yanhu Shan, Yupei Wang, Xin Zhao, Yinan Yu, Ming Yang, Kaiqi HuangICCV 2019SSAP, ResNet-101, Cityscapes fine-only
more details
n/a32.735.425.555.933.243.931.919.516.2
Spatial Sampling Netyesyesnononononononono22nonoSpatial Sampling Network for Fast Scene UnderstandingDavide Mazzini, Raimondo SchettiniCVPR 2019 Workshop on Autonomous DrivingWe propose a network architecture to perform efficient scene understanding. This work presents three main novelties: the first is an Improved Guided Upsampling Module that can replace in toto the decoder part in common semantic segmentation networks.
Our second contribution is the introduction of a new module based on spatial sampling to perform Instance Segmentation. It provides a very fast instance segmentation, needing only thresholding as post-processing step at inference time. Finally, we propose a novel efficient network design that includes the new modules and we test it against different datasets for outdoor scene understanding.
more details
0.008849.28.83.224.010.013.28.54.41.5
UPSNetyesyesnonononononononononoyesyesUPSNet: A Unified Panoptic Segmentation NetworkYuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, Raquel UrtasunCVPR 2019
more details
0.22733.035.927.451.931.843.131.423.819.1
Sem2InsyesyesnonononononononononononoAnonymousAnonymous NeurIPS19 submission #4671
more details
n/a19.317.717.427.221.126.220.514.110.1
BshapeNet+ [COCO]yesyesnonononononononononononoBshapeNet: Object Detection and Instance Segmentation with Bounding Shape MasksBa Rom Kang, Ha Young KimBshapeNet+ single model, ResNet-50-FPN as base model, Cityscapes [fine-only + COCO]
more details
n/a32.936.624.850.433.741.033.725.417.8
AdaptISyesyesnonononononononononononoAnonymousAdaptive Instance Selection network architecture for class-agnostic instance segmentation. Given an input image and a point (x, y), it generates a mask for the object located at (x, y). The network adapts to the input point with a help of AdaIN layers, thus producing different masks for different objects on the same image. AdaptIS generates pixel-accurate object masks, therefore it accurately segments objects of complex shape or severely occluded ones.
more details
n/a32.531.429.149.831.641.739.424.712.1
AInnoSegmentationyesyesnonononononononononononoFaen Zhang, Jiahong Wu, Haotian Cao, Zhizheng Yang, Jianfei Song, Ze Huang, Jiashui Huang, Shenglan BenAInnoSegmentation use SE-Resnet 152 as backbone and FPN model to extract multi-level features and use self-develop method to combine multi-features and use COCO datasets to pre-train model and so on
more details
n/a39.542.332.657.640.051.339.830.622.1
iFLYTEK-CVyesyesnonononononononononononoAnonymousiFLYTEK Research, CV Group
more details
n/a42.344.136.959.245.452.843.231.724.6
Panoptic-DeepLab [Cityscapes-fine]yesyesnonononononononononononoPanoptic-DeepLabBowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh ChenOur proposed bottom-up Panoptic-DeepLab is conceptually simple yet delivers state-of-the-art results. The Panoptic-DeepLab adopts dual-ASPP and dual-decoder modules, specific to semantic segmentation and instance segmentation respectively. The semantic segmentation prediction follows the typical design of any semantic segmentation model (e.g., DeepLab), while the instance segmentation prediction involves a simple instance center regression, where the model learns to predict instance centers as well as the offset from each pixel to its corresponding center. This submission exploits only Cityscapes fine annotations.
This entry fixes a minor inference bug (i.e., same trained model) for instance segmentation, compared to the previous submission.
more details
n/a34.634.328.955.132.841.536.626.321.6
snakeyesyesnonononononononononoyesyesDeep Snake for Real-Time Instance SegmentationSida Peng, Wen Jiang, Huaijin Pi, Xiuli Li, Hujun Bao, Xiaowei ZhouCVPR 2020
more details
0.21731.737.227.056.029.540.528.219.016.4
PolyTransformyesyesnonononononononononononoPolyTransform: Deep Polygon Transformer for Instance SegmentationJustin Liang, Namdar Homayounfar, Wei-Chiu Ma, Yuwen Xiong, Rui Hu, Raquel Urtasun
more details
n/a40.142.434.858.539.850.041.330.923.4
StixelPointNetyesyesnonononononononononononoLearning Stixel-Based Instance SegmentationMonty Santarossa, Lukas Schneider, Claudius Zelenka, Lars Schmarje, Reinhard Koch, Uwe Franke IV 2021An adapted version of the PointNet is trained on Stixels as input for instance segmentation.
more details
0.0358.59.07.315.812.816.30.03.53.5
Axial-DeepLab-XL [Cityscapes-fine]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a34.032.328.252.632.641.838.125.920.5
PolyTransform + SegFixyesyesnonononononononononoyesyesAnonymousopensegWe simply apply a novel post-processing scheme based on the PolyTransform (thanks to the authors of PolyTransform for providing their segmentation results). The performance of the baseline PolyTransform is 40.1% and our method achieves 41.2%. Besides, our method also could improve the results of PointRend and PANet by more than 1.0% without any re-training or fine-tuning the segmentation models.
more details
n/a41.244.335.960.540.551.241.631.724.1
GAIS-NetyesyesnonononoyesyesnonononoyesyesGeometry-Aware Instance Segmentation with Disparity MapsCho-Ying Wu, Xiaoyan Hu, Michael Happold, Qiangeng Xu, Ulrich NeumannScalability in Autonomous Driving, workshop at CVPR 2020Geometry-Aware Instance Segmentation with Disparity Maps
more details
n/a32.336.029.052.829.739.828.923.318.5
Axial-DeepLab-L [Mapillary Vistas]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a38.134.730.455.140.949.743.529.021.7
LevelSet R-CNN [fine-only]yesyesnonononononononononononoLevelSet R-CNN: A Deep Variational Method for Instance SegmentationNamdar Homayounfar*, Yuwen Xiong*, Justin Liang*, Wei-Chiu Ma, Raquel UrtasunECCV 2020Obtaining precise instance segmentation masks is of high importance in many modern applications such as robotic manipulation and autonomous driving. Currently, many state of the art models are based on the Mask R-CNN framework which, while very powerful, outputs masks at low resolutions which could result in imprecise boundaries. On the other hand, classic variational methods for segmentation impose desirable global and local data and geometry constraints on the masks by optimizing an energy functional. While mathematically elegant, their direct dependence on good initialization, non-robust image cues and manual setting of hyperparameters renders them unsuitable for modern applications. We propose LevelSet R-CNN, which combines the best of both worlds by obtaining powerful feature representations that are combined in an end-to-end manner with a variational segmentation framework. We demonstrate the effectiveness of our approach on COCO and Cityscapes datasets.
more details
n/a33.337.029.354.630.439.430.225.520.3
LevelSet R-CNN [COCO]yesyesnonononononononononononoLevelSet R-CNN: A Deep Variational Method for Instance SegmentationNamdar Homayounfar*, Yuwen Xiong*, Justin Liang*, Wei-Chiu Ma, Raquel UrtasunECCV 2020Obtaining precise instance segmentation masks is of high importance in many modern applications such as robotic manipulation and autonomous driving. Currently, many state of the art models are based on the Mask R-CNN framework which, while very powerful, outputs masks at low resolutions which could result in imprecise boundaries. On the other hand, classic variational methods for segmentation impose desirable global and local data and geometry constraints on the masks by optimizing an energy functional. While mathematically elegant, their direct dependence on good initialization, non-robust image cues and manual setting of hyperparameters renders them unsuitable for modern applications. We propose LevelSet R-CNN, which combines the best of both worlds by obtaining powerful feature representations that are combined in an end-to-end manner with a variational segmentation framework. We demonstrate the effectiveness of our approach on COCO and Cityscapes datasets.
more details
n/a40.043.433.959.037.649.439.432.524.9
Axial-DeepLab-L [Cityscapes-fine]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.

more details
n/a33.332.027.852.430.744.234.025.719.2
Deep Affinity Net [fine-only]yesyesnonononononononononononoDeep Affinity Net: Instance Segmentation via AffinityXingqian Xu, Mangtik Chiu, Thomas Huang, Honghui ShiA proposal-free method that uses FPN generated features and network predicted 4-neighbor affinities to reconstruct instance segments. During inference time, an efficient graph partitioning algorithm, Cascade-GAEC, is introduced to overcome the long execution time in the high-resolution graph partitioning problem.
more details
n/a27.524.522.243.729.538.331.918.012.1
Naive-Student (iterative semi-supervised learning with Panoptic-DeepLab)yesyesnonononononoyesyesnonononoSemi-Supervised Learning in Video Sequences for Urban Scene SegmentationLiang-Chieh Chen, Raphael Gontijo Lopes, Bowen Cheng, Maxwell D. Collins, Ekin D. Cubuk, Barret Zoph, Hartwig Adam, Jonathon ShlensSupervised learning in large discriminative models is a mainstay for modern computer vision. Such an approach necessitates investing in large-scale human-annotated datasets for achieving state-of-the-art results. In turn, the efficacy of supervised learning may be limited by the size of the human annotated dataset. This limitation is particularly notable for image segmentation tasks, where the expense of human annotation is especially large, yet large amounts of unlabeled data may exist. In this work, we ask if we may leverage semi-supervised learning in unlabeled video sequences to improve the performance on urban scene segmentation, simultaneously tackling semantic, instance, and panoptic segmentation. The goal of this work is to avoid the construction of sophisticated, learned architectures specific to label propagation (e.g., patch matching and optical flow). Instead, we simply predict pseudo-labels for the unlabeled data and train subsequent models with both human-annotated and pseudo-labeled data. The procedure is iterated for several times. As a result, our Naive-Student model, trained with such simple yet effective iterative semi-supervised learning, attains state-of-the-art results at all three Cityscapes benchmarks, reaching the performance of 67.8% PQ, 42.6% AP, and 85.2% mIOU on the test set. We view this work as a notable step towards building a simple procedure to harness unlabeled video sequences to surpass state-of-the-art performance on core computer vision tasks.
more details
n/a42.640.535.360.044.753.444.135.826.7
Axial-DeepLab-XL [Mapillary Vistas]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a39.636.632.556.641.052.443.730.823.5
Panoptic-DeepLab [Mapillary Vistas]yesyesnonononononononononononoPanoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic SegmentationBowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh ChenWe employ a stronger backbone, WR-41, in Panoptic-DeepLab.
For Panoptic-DeepLab, please refer to https://arxiv.org/abs/1911.10194.
For wide-ResNet-41 (WR-41) backbone, please refer to https://arxiv.org/abs/2005.10266.
more details
n/a40.637.932.258.344.253.539.534.425.2
EfficientPS [Mapillary Vistas]yesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaUnderstanding the scene in which an autonomous robot operates is critical for its competent functioning. Such scene comprehension necessitates recognizing instances of traffic participants along with general scene semantics which can be effectively addressed by the panoptic segmentation task. In this paper, we introduce the Efficient Panoptic Segmentation (EfficientPS) architecture that consists of a shared backbone which efficiently encodes and fuses semantically rich multi-scale features. We incorporate a new semantic head that aggregates fine and contextual features coherently and a new variant of Mask R-CNN as the instance head. We also propose a novel panoptic fusion module that congruously integrates the output logits from both the heads of our EfficientPS architecture to yield the final panoptic segmentation output. Additionally, we introduce the KITTI panoptic segmentation dataset that contains panoptic annotations for the popularly challenging KITTI benchmark. Extensive evaluations on Cityscapes, KITTI, Mapillary Vistas and Indian Driving Dataset demonstrate that our proposed architecture consistently sets the new state-of-the-art on all these four benchmarks while being the most efficient and fast panoptic segmentation architecture to date.
more details
n/a39.843.134.859.038.149.638.929.025.7
seamseg_rvcsubsetnonononononononononononoyesyesSeamless Scene SegmentationPorzi, Lorenzo and Rota Bulò, Samuel and Colovic, Aleksander and Kontschieder, PeterThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2019Seamless Scene Segmentation Resnet101, pretrained on Imagenet; supplied with altered MVD to include WildDash2 classes; does not contain other RVC label policies (i.e. no ADE20K/COCO-specific classes -> rvcsubset and not a proper submission)
more details
n/a22.127.118.037.526.430.49.815.811.7
UniDet_RVCyesyesnononononononono22nonoAnonymous
more details
300.029.831.422.445.930.541.531.820.714.4
EffPS_b1bs4_RVCyesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaEfficientPS with EfficientNet-b1 backbone. Trained with a batch size of 4.
more details
n/a21.325.619.544.223.829.42.013.512.1
Panoptic-DeepLab w/ SWideRNet [Cityscapes-fine]yesyesnonononononononononononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
more details
n/a38.036.833.257.238.845.038.930.223.8
Panoptic-DeepLab w/ SWideRNet [Mapillary Vistas]yesyesnonononononononononononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
more details
n/a42.237.734.658.245.154.847.234.026.0
PolyTransform + SegFix + BPRyesyesnonononononononononoyesyesLook Closer to Segment Better: Boundary Patch Refinement for Instance SegmentationChufeng Tang*, Hang Chen*, Xiao Li, Jianmin Li, Zhaoxiang Zhang, Xiaolin HuCVPR 2021Tremendous efforts have been made on instance segmentation but the mask quality is still not satisfactory. The boundaries of predicted instance masks are usually imprecise due to the low spatial resolution of feature maps and the imbalance problem caused by the extremely low proportion of boundary pixels. To address these issues, we propose a conceptually simple yet effective post-processing refinement framework to improve the boundary quality based on the results of any instance segmentation model, termed BPR. Following the idea of looking closer to segment boundaries better, we extract and refine a series of small boundary patches along the predicted instance boundaries. The refinement is accomplished by a boundary patch refinement network at higher resolution. The proposed BPR framework yields significant improvements over the Mask R-CNN baseline on Cityscapes benchmark, especially on the boundary-aware metrics. Moreover, by applying the BPR framework to the PolyTransform + SegFix baseline, we reached 1st place on the Cityscapes leaderboard.
more details
n/a42.746.037.162.841.352.743.732.625.1
Panoptic-DeepLab w/ SWideRNet [Mapillary Vistas + Pseudo-labels]yesyesnonononononoyesyesnonononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime. Following Naive-Student, this model is additionally trained with pseudo-labels generated from Cityscapes Video and train-extra set (i.e., the coarse annotations are not used, but the images are).
more details
n/a43.439.334.959.647.957.445.935.826.8
CenterPolyyesyesnononononononono44nonoAnonymous
more details
0.04515.517.511.533.913.116.615.39.07.3
HRI-INSTyesyesyesyesnonononononononononoAnonymousHRI-INST
more details
n/a43.842.237.059.246.957.347.333.926.4
DH-ARIyesyesnonononononononononononoAnonymousDH-ARI
more details
n/a44.447.839.464.944.154.240.136.528.0
HRI-TRANSyesyesnonononononononononononoAnonymousHRI transformer instance segmentation
more details
n/a44.544.637.062.446.957.347.334.026.4
kMaX-DeepLab [Cityscapes-fine]yesyesnonononononononononoyesyesk-means Mask TransformerQihang Yu, Huiyu Wang, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, and Liang-Chieh ChenECCV 2022kMaX-DeepLab w/ ConvNeXt-L backbone (ImageNet-22k + 1k pretrained). This result is obtained by the kMaX-DeepLab trained for Panoptic Segmentation task. No test-time augmentation or other external dataset.
more details
n/a39.736.233.855.240.453.147.429.921.9
QueryInst-Parallel CompletionyesyesnonononononononononononoHai Wang ;Shilin Zhu ;PuPu ;Meng; Le; Apple; RongWe propose a novel feature complete network framework queryinst parallel completion. First, the global context module is introduced into the backbone network to obtain instance information. Then, parallel semantic branch and parallel global branch are proposed to extract the semantic information and global information of feature layer, so as to complete the ROI features. In addition, we also propose a feature transfer structure, which explicitly increases the connection between detection and segmentation branches, changes the gradient back-propagation path, and indirectly complements the ROI features.
more details
n/a35.441.431.558.429.244.031.625.021.9
CenterPoly v2yesyesnonononononononononoyesyesReal-time instance segmentation with polygons using an Intersection-over-Union lossKatia Jodogne-del Litto, Guillaume-Alexandre Bilodeau
more details
0.04516.618.010.732.817.723.913.710.06.3
Jiangsu-University-Environmental-PerceptionyesyesnonononononononononononoAnonymous
more details
n/a40.345.335.859.635.750.536.233.825.6

AP 50 % on class-level

namefinefinecoarsecoarse16-bit16-bitdepthdepthvideovideosubsubcodecodetitleauthorsvenuedescriptionRuntime [s]averagepersonridercartruckbustrainmotorcyclebicycle
R-CNN + MCG convex hullyesyesnononononononono22nonoThe Cityscapes Dataset for Semantic Urban Scene UnderstandingM. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, B. SchieleCVPR 2016We compute MCG object proposals [1] and use their convex hulls as instance candidates. These proposals are scored by a Fast R-CNN detector [2].

[1] P. Arbelaez, J. Pont-Tuset, J. Barron, F. Marqués, and J. Malik. Multiscale combinatorial grouping. In CVPR, 2014.
[2] R. Girshick. Fast R-CNN. In ICCV, 2015.
more details
60.012.95.63.926.013.826.315.88.63.1
Pixel-level Encoding for Instance SegmentationyesyesnonononoyesyesnonononononoPixel-level Encoding and Depth Layering for Instance-level Semantic LabelingJ. Uhrig, M. Cordts, U. Franke, and T. BroxGCPR 2016We predict three encoding channels from a single image using an FCN: semantic labels, depth classes, and an instance-aware representation based on directions towards instance centers. Using low-level computer vision techniques, we obtain pixel-level and instance-level semantic labeling paired with a depth estimate of the instances.
more details
n/a21.131.833.837.87.612.08.520.517.2
Instance-level Segmentation of Vehicles by Deep Contoursyesyesnononononononono22nonoInstance-level Segmentation of Vehicles by Deep ContoursJan van den Brand, Matthias Ochs and Rudolf MesterAsian Conference on Computer Vision - Workshop on Computer Vision Technologies for Smart VehicleOur method uses the fully convolutional network (FCN) for semantic labeling and for estimating the boundary of each vehicle. Even though a contour is in general a one pixel wide structure which cannot be directly learned by a CNN, our network addresses this by providing areas around the contours. Based on these areas, we separate the individual vehicle instances.
more details
0.23.70.00.029.20.00.00.00.00.0
Boundary-aware Instance Segmentationyesyesnononononononono22nonoBoundary-aware Instance SegmentationZeeshan Hayder, Xuming He, Mathieu SalzmannCVPR 2017End-to-end model for instance segmentation using VGG16 network

Previously listed as "Shape-Aware Instance Segmentation"
more details
n/a36.734.040.454.727.240.138.932.226.0
RecAttendyesyesnononononononono44nonoAnonymous
more details
n/a18.921.212.741.913.920.715.514.710.5
Joint Graph Decomposition and Node Labelingyesyesnononononononono88nonoJoint Graph Decomposition and Node Labeling: Problem, Algorithms, ApplicationsEvgeny Levinkov, Jonas Uhrig, Siyu Tang, Mohamed Omran, Eldar Insafutdinov, Alexander Kirillov, Carsten Rother, Thomas Brox, Bernt Schiele, Bjoern AndresComputer Vision and Pattern Recognition (CVPR) 2017
more details
n/a23.218.429.538.316.121.524.521.416.0
InstanceCutyesyesyesyesnonononononononononoInstanceCut: from Edges to Instances with MultiCutA. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, C. RotherComputer Vision and Pattern Recognition (CVPR) 2017InstanceCut represents the problem by two output modalities: (i) an instance-agnostic semantic segmentation and (ii) all instance-boundaries. The former is computed from a standard CNN for semantic segmentation, and the latter is derived from a new instance-aware edge detection model. To reason globally about the optimal partitioning of an image into instances, we combine these two modalities into a novel MultiCut formulation.
more details
n/a27.928.026.844.822.230.430.125.115.7
Semantic Instance Segmentation with a Discriminative Loss Functionyesyesnononononononono22yesyesSemantic Instance Segmentation with a Discriminative Loss FunctionBert De Brabandere, Davy Neven, Luc Van GoolDeep Learning for Robotic Vision, workshop at CVPR 2017This method uses a discriminative loss function, operating at the pixel level, that encourages a convolutional network to produce a representation of the image that can easily be clustered into instances with a simple post-processing step. The loss function encourages the network to map each pixel to a point in feature space so that pixels belonging to the same instance lie close together while different instances are separated by a wide margin.

Previously listed as "PPLoss".
more details
n/a35.932.040.743.228.539.135.737.929.8
SGNyesyesyesyesnonononononononononoSGN: Sequential Grouping Networks for Instance SegmentationShu Liu, Jiaya Jia, Sanja Fidler, Raquel UrtasunICCV 2017Instance segmentation using a sequence of neural networks, each solving a sub-grouping problem of increasing semantic complexity in order to gradually compose objects out of pixels.
more details
n/a44.945.247.759.736.345.453.739.531.8
Mask R-CNN [COCO]yesyesnonononononononononononoMask R-CNNKaiming He, Georgia Gkioxari, Piotr Dollár, Ross GirshickMask R-CNN, ResNet-50-FPN, Cityscapes [fine-only] + COCO
more details
n/a58.167.165.471.842.361.053.954.349.0
Mask R-CNN [fine-only]yesyesnonononononononononononoMask R-CNNKaiming He, Georgia Gkioxari, Piotr Dollár, Ross GirshickMask R-CNN, ResNet-50-FPN, Cityscapes fine-only
more details
n/a49.960.759.568.333.148.238.946.543.9
Deep Watershed Transformationyesyesnononononononono22nonoDeep Watershed Transformation for Instance SegmentationMin Bai and Raquel UrtasunCVPR 2017Instance segmentation using a watershed transformation inspired CNN. The input RGB image is augmented using the semantic segmentation from the recent PSPNet by H. Zhao et al.
Previously named "DWT".
more details
n/a35.334.036.948.531.340.136.232.922.9
Foveal Vision for Instance Segmentation of Road ImagesyesyesnonononoyesyesnonononononoFoveal Vision for Instance Segmentation of Road ImagesBenedikt Ortelt, Christian Herrmann, Dieter Willersinn, Jürgen BeyererVISAPP 2018Directly based on 'Pixel-level Encoding for Instance Segmentation'. Adds an improved angular distance measure and a foveal concept to better address small objects at the vanishing point of the road.
more details
n/a25.231.529.740.016.023.821.719.219.9
SegNetyesyesyesyesnonononononononononoAnonymous
more details
0.555.657.956.567.546.265.163.043.944.8
DCMEyesyesnononononononono33nonoDistance to Center of Mass Encoding for Instance SegmentationThomio Watanabe and Denis Wolf2018 21st International Conference on Intelligent Transportation Systems (ITSC)
more details
n/a7.75.93.325.64.08.310.03.41.4
RRLyesyesnonononoyesyesnonononononoAnonymous
more details
n/a56.167.163.778.435.854.349.650.948.8
PANet [fine-only]yesyesnonononononononononoyesyesPath Aggregation Network for Instance SegmentationShu Liu, Lu Qi, Haifang Qin, Jianping Shi, Jiaya JiaCVPR 2018PANet, ResNet-50 as base model, Cityscapes fine-only, training hyper-parameters are adopted from Mask R-CNN.
more details
n/a57.168.266.378.538.755.248.551.949.9
PANet [COCO]yesyesnonononononononononoyesyesPath Aggregation Network for Instance SegmentationShu Liu, Lu Qi, Haifang Qin, Jianping Shi, Jiaya JiaCVPR 2018PANet, ResNet-50 as base model, Cityscapes fine-only + COCO, training hyper-parameters are adopted from Mask R-CNN.
more details
n/a63.174.471.583.343.065.250.959.657.3
LCISyesyesnonononononononononononoAnonymous
more details
n/a30.833.337.442.521.927.627.932.023.9
Pixelwise Instance Segmentation with a Dynamically Instantiated NetworkyesyesyesyesnonononononononononoPixelwise Instance Segmentation with a Dynamically Instantiated NetworkAnurag Arnab and Philip H. S. TorrComputer Vision and Pattern Recognition (CVPR) 2017We propose an Instance Segmentation system that produces a segmentation map where each pixel is assigned an object class and instance identity label (this has recently been termed "Panoptic Segmentation"). Our method is based on an initial semantic segmentation module which feeds into an instance subnetwork. This subnetwork uses the initial category-level segmentation, along with cues from the output of an object detector, within an end-to-end CRF to predict instances. This part of our model is dynamically instantiated to produce a variable number of instances per image. Our end-to-end approach requires no post-processing and considers the image holistically, instead of processing independent proposals. As a result, it reasons about occlusions (unlike some related work, a single pixel cannot belong to multiple instances).

more details
n/a45.246.848.255.833.445.553.744.933.0
PolygonRNN++yesyesnonononononononononoyesyesEfficient Annotation of Segmentation Datasets with Polygon-RNN++D. Acuna, H. Ling, A. Kar, and S. FidlerCVPR 2018
more details
n/a45.555.049.370.029.747.541.434.136.8
GMIS: Graph Merge for Instance SegmentationyesyesyesyesnonononononononononoYiding Liu, Siyu Yang, Bin Li, Wengang Zhou, Jizheng Xu, Houqiang Li, Yan Lu
more details
n/a44.650.647.456.933.947.052.839.229.2
TCnetyesyesnonononononononononononoAnonymousTCnet
more details
n/a59.070.069.276.742.061.148.955.249.1
MaskRCNN_ROByesyesnonononononononononononoAnonymousMaskRCNN Instance segmentation baseline for ROB challenge using default parameters from Matterport's implementation of Mask RCNN
https://github.com/matterport/Mask_RCNN
more details
n/a25.250.041.860.27.315.20.027.40.0
Multitask LearningyesyesnonononononononononoyesyesMulti-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and SemanticsAlex Kendall, Yarin Gal and Roberto CipollaCVPR 2018Numerous deep learning applications benefit from multi-task learning with multiple regression and classification objectives. In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task's loss. Tuning these weights by hand is a difficult and expensive process, making multi-task learning prohibitive in practice. We propose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the homoscedastic uncertainty of each task. This allows us to simultaneously learn various quantities with different units or scales in both classification and regression settings. We demonstrate our model learning per-pixel depth regression, semantic and instance segmentation from a monocular input image. Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate models trained individually on each task.
more details
n/a39.038.146.354.828.440.825.042.236.5
Deep ColoringyesyesnonononononononononononoAnonymousAnonymous ECCV submission #2955
more details
n/a46.247.749.263.534.546.352.942.033.1
MRCNN_VSCMLab_ROByesyesnonononononononononononoAnonymousMaskRCNN+FPN with pre-trained COCO model.
ms-training with short edge [800, 1024]
inference with shore edge size 800
Randomly subsample ScanNet to the size close to CityScape

optimizer: Adam
learning rate: start from 1e-4 to 1e-3 with linear warm up schedule. decrease by factor of 0.1 at 200, 300 epoch.

epoch: 400
step per epoch: 500
roi_per_im: 512
more details
1.029.537.738.157.721.129.426.625.10.0
BAMRCNN_ROByesyesnonononononononononononoAnonymous
more details
n/a0.90.00.05.30.70.90.00.20.0
NL_ROI_ROByesyesnonononononononononononoAnonymousNon-local ROI on Mask R-CNN
more details
n/a45.858.154.770.032.246.333.136.235.6
RUSH_ROByesyesnonononononononononononoAnonymous
more details
n/a55.562.964.971.942.453.845.953.948.2
MaskRCNN_BOSHyesyesnonononononononononononoJin shengtao, Yi zhihao, Liu wei [Our team name is firefly]MaskRCNN segmentation baseline for Bosh autodrive challenge ,
using Matterport's implementation of Mask RCNN https://github.com/matterport/Mask_RCNN
55k iterations, default parameters (backbone :resenet 101)
19hours for training
more details
n/a28.034.231.846.916.825.126.122.620.9
NV-ADLRyesyesnonononononononononononoAnonymousNVIDIA Applied Deep Learning Research
more details
n/a61.573.367.781.246.461.952.656.852.2
Sogou_MMyesyesnonononononononononononoGlobal Concatenating Feature Enhancement for Instance SegmentationHang Yang, Xiaozhe Xin, Wenwen Yang, Bin LiGlobal Concatenating Feature Enhancement for Instance Segmentation
more details
n/a64.573.470.880.351.766.460.758.654.0
Instance Segmentation by Jointly Optimizing Spatial Embeddings and Clustering BandwidthyesyesnonononononononononoyesyesInstance Segmentation by Jointly Optimizing Spatial Embeddings and Clustering BandwidthDavy Neven, Bert De Brabandere, Marc Proesmans and Luc Van GoolCVPR 2019Fine only - ERFNet backbone
more details
0.150.965.158.875.333.145.232.448.448.8
Instance Annotationyesyesnononononononono22nonoInstance Segmentation as Image Segmentation AnnotationThomio Watanabe and Denis F. Wolf2019 IEEE Intelligent Vehicles Symposium (IV)Based on DCME
more details
4.41614.917.18.838.110.715.112.710.76.5
NJUSTyesyesnonononononononononononoAng Li, Chongyang ZhangMask R-CNN based on FPN enhancement and Mask Rescore, etc. Only one single model SE-ResNext-152 with COCO pre-train used;
more details
n/a64.176.070.581.148.165.557.858.455.7
BshapeNet+ [fine-only]yesyesnonononononononononononoBshapeNet: Object Detection and Instance Segmentation with Bounding Shape MasksBa Rom Kang, Ha Young KimBshapeNet+, ResNet-50-FPN as base model, Cityscapes [fine-only]
more details
n/a50.457.857.268.737.248.048.646.539.3
SSAPyesyesnonononononononononononoSSAP: Single-Shot Instance Segmentation With Affinity PyramidNaiyu Gao, Yanhu Shan, Yupei Wang, Xin Zhao, Yinan Yu, Ming Yang, Kaiqi HuangICCV 2019SSAP, ResNet-101, Cityscapes fine-only
more details
n/a51.862.552.375.741.855.451.838.236.4
Spatial Sampling Netyesyesnononononononono22nonoSpatial Sampling Network for Fast Scene UnderstandingDavide Mazzini, Raimondo SchettiniCVPR 2019 Workshop on Autonomous DrivingWe propose a network architecture to perform efficient scene understanding. This work presents three main novelties: the first is an Improved Guided Upsampling Module that can replace in toto the decoder part in common semantic segmentation networks.
Our second contribution is the introduction of a new module based on spatial sampling to perform Instance Segmentation. It provides a very fast instance segmentation, needing only thresholding as post-processing step at inference time. Finally, we propose a novel efficient network design that includes the new modules and we test it against different datasets for outdoor scene understanding.
more details
0.0088416.822.011.735.413.419.316.211.45.4
UPSNetyesyesnonononononononononoyesyesUPSNet: A Unified Panoptic Segmentation NetworkYuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, Raquel UrtasunCVPR 2019
more details
0.22759.669.166.476.944.662.454.653.949.2
Sem2InsyesyesnonononononononononononoAnonymousAnonymous NeurIPS19 submission #4671
more details
n/a36.439.942.746.329.937.135.232.227.7
BshapeNet+ [COCO]yesyesnonononononononononononoBshapeNet: Object Detection and Instance Segmentation with Bounding Shape MasksBa Rom Kang, Ha Young KimBshapeNet+ single model, ResNet-50-FPN as base model, Cityscapes [fine-only + COCO]
more details
n/a58.870.763.175.146.757.553.056.647.9
AdaptISyesyesnonononononononononononoAnonymousAdaptive Instance Selection network architecture for class-agnostic instance segmentation. Given an input image and a point (x, y), it generates a mask for the object located at (x, y). The network adapts to the input point with a help of AdaIN layers, thus producing different masks for different objects on the same image. AdaptIS generates pixel-accurate object masks, therefore it accurately segments objects of complex shape or severely occluded ones.
more details
n/a52.559.556.475.139.052.856.647.533.2
AInnoSegmentationyesyesnonononononononononononoFaen Zhang, Jiahong Wu, Haotian Cao, Zhizheng Yang, Jianfei Song, Ze Huang, Jiashui Huang, Shenglan BenAInnoSegmentation use SE-Resnet 152 as backbone and FPN model to extract multi-level features and use self-develop method to combine multi-features and use COCO datasets to pre-train model and so on
more details
n/a66.075.669.483.353.770.062.059.853.9
iFLYTEK-CVyesyesnonononononononononononoAnonymousiFLYTEK Research, CV Group
more details
n/a71.179.476.185.261.073.270.863.859.3
Panoptic-DeepLab [Cityscapes-fine]yesyesnonononononononononononoPanoptic-DeepLabBowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh ChenOur proposed bottom-up Panoptic-DeepLab is conceptually simple yet delivers state-of-the-art results. The Panoptic-DeepLab adopts dual-ASPP and dual-decoder modules, specific to semantic segmentation and instance segmentation respectively. The semantic segmentation prediction follows the typical design of any semantic segmentation model (e.g., DeepLab), while the instance segmentation prediction involves a simple instance center regression, where the model learns to predict instance centers as well as the offset from each pixel to its corresponding center. This submission exploits only Cityscapes fine annotations.
This entry fixes a minor inference bug (i.e., same trained model) for instance segmentation, compared to the previous submission.
more details
n/a57.363.961.177.942.053.456.153.350.5
snakeyesyesnonononononononononoyesyesDeep Snake for Real-Time Instance SegmentationSida Peng, Wen Jiang, Huaijin Pi, Xiuli Li, Hujun Bao, Xiaowei ZhouCVPR 2020
more details
0.21758.471.766.081.541.258.851.450.446.6
PolyTransformyesyesnonononononononononononoPolyTransform: Deep Polygon Transformer for Instance SegmentationJustin Liang, Namdar Homayounfar, Wei-Chiu Ma, Yuwen Xiong, Rui Hu, Raquel Urtasun
more details
n/a65.975.871.882.552.268.763.358.754.4
StixelPointNetyesyesnonononononononononononoLearning Stixel-Based Instance SegmentationMonty Santarossa, Lukas Schneider, Claudius Zelenka, Lars Schmarje, Reinhard Koch, Uwe Franke IV 2021An adapted version of the PointNet is trained on Stixels as input for instance segmentation.
more details
0.03519.324.624.931.822.327.30.011.611.4
Axial-DeepLab-XL [Cityscapes-fine]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a55.961.259.275.041.752.958.051.947.0
PolyTransform + SegFixyesyesnonononononononononoyesyesAnonymousopensegWe simply apply a novel post-processing scheme based on the PolyTransform (thanks to the authors of PolyTransform for providing their segmentation results). The performance of the baseline PolyTransform is 40.1% and our method achieves 41.2%. Besides, our method also could improve the results of PointRend and PANet by more than 1.0% without any re-training or fine-tuning the segmentation models.
more details
n/a66.176.272.182.852.468.763.358.854.3
GAIS-NetyesyesnonononoyesyesnonononoyesyesGeometry-Aware Instance Segmentation with Disparity MapsCho-Ying Wu, Xiaoyan Hu, Michael Happold, Qiangeng Xu, Ulrich NeumannScalability in Autonomous Driving, workshop at CVPR 2020Geometry-Aware Instance Segmentation with Disparity Maps
more details
n/a59.568.766.878.042.759.156.155.049.6
Axial-DeepLab-L [Mapillary Vistas]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a61.664.062.877.352.763.163.158.850.6
LevelSet R-CNN [fine-only]yesyesnonononononononononononoLevelSet R-CNN: A Deep Variational Method for Instance SegmentationNamdar Homayounfar*, Yuwen Xiong*, Justin Liang*, Wei-Chiu Ma, Raquel UrtasunECCV 2020Obtaining precise instance segmentation masks is of high importance in many modern applications such as robotic manipulation and autonomous driving. Currently, many state of the art models are based on the Mask R-CNN framework which, while very powerful, outputs masks at low resolutions which could result in imprecise boundaries. On the other hand, classic variational methods for segmentation impose desirable global and local data and geometry constraints on the masks by optimizing an energy functional. While mathematically elegant, their direct dependence on good initialization, non-robust image cues and manual setting of hyperparameters renders them unsuitable for modern applications. We propose LevelSet R-CNN, which combines the best of both worlds by obtaining powerful feature representations that are combined in an end-to-end manner with a variational segmentation framework. We demonstrate the effectiveness of our approach on COCO and Cityscapes datasets.
more details
n/a58.268.366.677.342.755.653.453.348.8
LevelSet R-CNN [COCO]yesyesnonononononononononononoLevelSet R-CNN: A Deep Variational Method for Instance SegmentationNamdar Homayounfar*, Yuwen Xiong*, Justin Liang*, Wei-Chiu Ma, Raquel UrtasunECCV 2020Obtaining precise instance segmentation masks is of high importance in many modern applications such as robotic manipulation and autonomous driving. Currently, many state of the art models are based on the Mask R-CNN framework which, while very powerful, outputs masks at low resolutions which could result in imprecise boundaries. On the other hand, classic variational methods for segmentation impose desirable global and local data and geometry constraints on the masks by optimizing an energy functional. While mathematically elegant, their direct dependence on good initialization, non-robust image cues and manual setting of hyperparameters renders them unsuitable for modern applications. We propose LevelSet R-CNN, which combines the best of both worlds by obtaining powerful feature representations that are combined in an end-to-end manner with a variational segmentation framework. We demonstrate the effectiveness of our approach on COCO and Cityscapes datasets.
more details
n/a65.776.372.583.249.466.858.861.857.1
Axial-DeepLab-L [Cityscapes-fine]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.

more details
n/a54.960.560.574.039.555.953.550.245.5
Deep Affinity Net [fine-only]yesyesnonononononononononononoDeep Affinity Net: Instance Segmentation via AffinityXingqian Xu, Mangtik Chiu, Thomas Huang, Honghui ShiA proposal-free method that uses FPN generated features and network predicted 4-neighbor affinities to reconstruct instance segments. During inference time, an efficient graph partitioning algorithm, Cascade-GAEC, is introduced to overcome the long execution time in the high-resolution graph partitioning problem.
more details
n/a48.051.453.266.738.851.249.840.232.8
Naive-Student (iterative semi-supervised learning with Panoptic-DeepLab)yesyesnonononononoyesyesnonononoSemi-Supervised Learning in Video Sequences for Urban Scene SegmentationLiang-Chieh Chen, Raphael Gontijo Lopes, Bowen Cheng, Maxwell D. Collins, Ekin D. Cubuk, Barret Zoph, Hartwig Adam, Jonathon ShlensSupervised learning in large discriminative models is a mainstay for modern computer vision. Such an approach necessitates investing in large-scale human-annotated datasets for achieving state-of-the-art results. In turn, the efficacy of supervised learning may be limited by the size of the human annotated dataset. This limitation is particularly notable for image segmentation tasks, where the expense of human annotation is especially large, yet large amounts of unlabeled data may exist. In this work, we ask if we may leverage semi-supervised learning in unlabeled video sequences to improve the performance on urban scene segmentation, simultaneously tackling semantic, instance, and panoptic segmentation. The goal of this work is to avoid the construction of sophisticated, learned architectures specific to label propagation (e.g., patch matching and optical flow). Instead, we simply predict pseudo-labels for the unlabeled data and train subsequent models with both human-annotated and pseudo-labeled data. The procedure is iterated for several times. As a result, our Naive-Student model, trained with such simple yet effective iterative semi-supervised learning, attains state-of-the-art results at all three Cityscapes benchmarks, reaching the performance of 67.8% PQ, 42.6% AP, and 85.2% mIOU on the test set. We view this work as a notable step towards building a simple procedure to harness unlabeled video sequences to surpass state-of-the-art performance on core computer vision tasks.
more details
n/a67.671.869.182.956.668.865.666.259.7
Axial-DeepLab-XL [Mapillary Vistas]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a64.266.165.579.052.967.866.062.553.6
Panoptic-DeepLab [Mapillary Vistas]yesyesnonononononononononononoPanoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic SegmentationBowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh ChenWe employ a stronger backbone, WR-41, in Panoptic-DeepLab.
For Panoptic-DeepLab, please refer to https://arxiv.org/abs/1911.10194.
For wide-ResNet-41 (WR-41) backbone, please refer to https://arxiv.org/abs/2005.10266.
more details
n/a66.469.568.881.756.269.260.866.258.9
EfficientPS [Mapillary Vistas]yesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaUnderstanding the scene in which an autonomous robot operates is critical for its competent functioning. Such scene comprehension necessitates recognizing instances of traffic participants along with general scene semantics which can be effectively addressed by the panoptic segmentation task. In this paper, we introduce the Efficient Panoptic Segmentation (EfficientPS) architecture that consists of a shared backbone which efficiently encodes and fuses semantically rich multi-scale features. We incorporate a new semantic head that aggregates fine and contextual features coherently and a new variant of Mask R-CNN as the instance head. We also propose a novel panoptic fusion module that congruously integrates the output logits from both the heads of our EfficientPS architecture to yield the final panoptic segmentation output. Additionally, we introduce the KITTI panoptic segmentation dataset that contains panoptic annotations for the popularly challenging KITTI benchmark. Extensive evaluations on Cityscapes, KITTI, Mapillary Vistas and Indian Driving Dataset demonstrate that our proposed architecture consistently sets the new state-of-the-art on all these four benchmarks while being the most efficient and fast panoptic segmentation architecture to date.
more details
n/a64.974.469.780.048.864.964.358.059.1
seamseg_rvcsubsetnonononononononononononoyesyesSeamless Scene SegmentationPorzi, Lorenzo and Rota Bulò, Samuel and Colovic, Aleksander and Kontschieder, PeterThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2019Seamless Scene Segmentation Resnet101, pretrained on Imagenet; supplied with altered MVD to include WildDash2 classes; does not contain other RVC label policies (i.e. no ADE20K/COCO-specific classes -> rvcsubset and not a proper submission)
more details
n/a39.452.640.755.437.244.217.836.730.3
UniDet_RVCyesyesnononononononono22nonoAnonymous
more details
300.052.459.756.268.441.057.652.846.137.7
EffPS_b1bs4_RVCyesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaEfficientPS with EfficientNet-b1 backbone. Trained with a batch size of 4.
more details
n/a38.548.449.463.832.941.33.735.333.1
Panoptic-DeepLab w/ SWideRNet [Cityscapes-fine]yesyesnonononononononononononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
more details
n/a61.067.666.080.649.456.854.858.554.3
Panoptic-DeepLab w/ SWideRNet [Mapillary Vistas]yesyesnonononononononononononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
more details
n/a67.569.869.082.857.069.069.065.258.1
PolyTransform + SegFix + BPRyesyesnonononononononononoyesyesLook Closer to Segment Better: Boundary Patch Refinement for Instance SegmentationChufeng Tang*, Hang Chen*, Xiao Li, Jianmin Li, Zhaoxiang Zhang, Xiaolin HuCVPR 2021Tremendous efforts have been made on instance segmentation but the mask quality is still not satisfactory. The boundaries of predicted instance masks are usually imprecise due to the low spatial resolution of feature maps and the imbalance problem caused by the extremely low proportion of boundary pixels. To address these issues, we propose a conceptually simple yet effective post-processing refinement framework to improve the boundary quality based on the results of any instance segmentation model, termed BPR. Following the idea of looking closer to segment boundaries better, we extract and refine a series of small boundary patches along the predicted instance boundaries. The refinement is accomplished by a boundary patch refinement network at higher resolution. The proposed BPR framework yields significant improvements over the Mask R-CNN baseline on Cityscapes benchmark, especially on the boundary-aware metrics. Moreover, by applying the BPR framework to the PolyTransform + SegFix baseline, we reached 1st place on the Cityscapes leaderboard.
more details
n/a66.577.072.483.852.768.663.359.454.9
Panoptic-DeepLab w/ SWideRNet [Mapillary Vistas + Pseudo-labels]yesyesnonononononoyesyesnonononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime. Following Naive-Student, this model is additionally trained with pseudo-labels generated from Cityscapes Video and train-extra set (i.e., the coarse annotations are not used, but the images are).
more details
n/a68.771.969.083.059.472.367.067.260.2
CenterPolyyesyesnononononononono44nonoAnonymous
more details
0.04539.549.746.761.224.835.033.535.529.6
HRI-INSTyesyesyesyesnonononononononononoAnonymousHRI-INST
more details
n/a70.474.972.884.560.674.869.164.861.8
DH-ARIyesyesnonononononononononononoAnonymousDH-ARI
more details
n/a68.478.273.886.355.468.760.365.059.2
HRI-TRANSyesyesnonononononononononononoAnonymousHRI transformer instance segmentation
more details
n/a71.478.272.887.360.674.869.166.961.8
kMaX-DeepLab [Cityscapes-fine]yesyesnonononononononononoyesyesk-means Mask TransformerQihang Yu, Huiyu Wang, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, and Liang-Chieh ChenECCV 2022kMaX-DeepLab w/ ConvNeXt-L backbone (ImageNet-22k + 1k pretrained). This result is obtained by the kMaX-DeepLab trained for Panoptic Segmentation task. No test-time augmentation or other external dataset.
more details
n/a61.363.564.975.250.965.964.456.049.7
QueryInst-Parallel CompletionyesyesnonononononononononononoHai Wang ;Shilin Zhu ;PuPu ;Meng; Le; Apple; RongWe propose a novel feature complete network framework queryinst parallel completion. First, the global context module is introduced into the backbone network to obtain instance information. Then, parallel semantic branch and parallel global branch are proposed to extract the semantic information and global information of feature layer, so as to complete the ROI features. In addition, we also propose a feature transfer structure, which explicitly increases the connection between detection and segmentation branches, changes the gradient back-propagation path, and indirectly complements the ROI features.
more details
n/a60.972.967.683.041.261.953.453.353.6
CenterPoly v2yesyesnonononononononononoyesyesReal-time instance segmentation with polygons using an Intersection-over-Union lossKatia Jodogne-del Litto, Guillaume-Alexandre Bilodeau
more details
0.04539.451.642.859.931.541.927.934.825.0
Jiangsu-University-Environmental-PerceptionyesyesnonononononononononononoAnonymous
more details
n/a67.678.473.285.350.670.757.165.659.8

AP 100 m on class-level

namefinefinecoarsecoarse16-bit16-bitdepthdepthvideovideosubsubcodecodetitle),(orsvenuedescriptionRuntime [s]averagepersonridercartruckbustrainmotorcyclebicycle
R-CNN + MCG convex hullyesyesnononononononono22nonoThe Cityscapes Dataset for Semantic Urban Scene UnderstandingM. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, B. SchieleCVPR 2016We compute MCG object proposals [1] and use their convex hulls as instance candidates. These proposals are scored by a Fast R-CNN detector [2].

[1] P. Arbelaez, J. Pont-Tuset, J. Barron, F. Marqués, and J. Malik. Multiscale combinatorial grouping. In CVPR, 2014.
[2] R. Girshick. Fast R-CNN. In ICCV, 2015.
more details
60.07.72.61.117.510.617.49.22.60.9
Pixel-level Encoding for Instance SegmentationyesyesnonononoyesyesnonononononoPixel-level Encoding and Depth Layering for Instance-level Semantic LabelingJ. Uhrig, M. Cordts, U. Franke, and T. BroxGCPR 2016We predict three encoding channels from a single image using an FCN: semantic labels, depth classes, and an instance-aware representation based on directions towards instance centers. Using low-level computer vision techniques, we obtain pixel-level and instance-level semantic labeling paired with a depth estimate of the instances.
more details
n/a15.324.420.336.45.510.65.210.59.2
Instance-level Segmentation of Vehicles by Deep Contoursyesyesnononononononono22nonoInstance-level Segmentation of Vehicles by Deep ContoursJan van den Brand, Matthias Ochs and Rudolf MesterAsian Conference on Computer Vision - Workshop on Computer Vision Technologies for Smart VehicleOur method uses the fully convolutional network (FCN) for semantic labeling and for estimating the boundary of each vehicle. Even though a contour is in general a one pixel wide structure which cannot be directly learned by a CNN, our network addresses this by providing areas around the contours. Based on these areas, we separate the individual vehicle instances.
more details
0.23.90.00.031.00.00.00.00.00.0
Boundary-aware Instance Segmentationyesyesnononononononono22nonoBoundary-aware Instance SegmentationZeeshan Hayder, Xuming He, Mathieu SalzmannCVPR 2017End-to-end model for instance segmentation using VGG16 network

Previously listed as "Shape-Aware Instance Segmentation"
more details
n/a29.330.322.758.224.938.629.915.314.3
RecAttendyesyesnononononononono44nonoAnonymous
more details
n/a16.819.65.546.814.221.513.17.26.1
Joint Graph Decomposition and Node Labelingyesyesnononononononono88nonoJoint Graph Decomposition and Node Labeling: Problem, Algorithms, ApplicationsEvgeny Levinkov, Jonas Uhrig, Siyu Tang, Mohamed Omran, Eldar Insafutdinov, Alexander Kirillov, Carsten Rother, Thomas Brox, Bernt Schiele, Bjoern AndresComputer Vision and Pattern Recognition (CVPR) 2017
more details
n/a16.813.516.638.411.319.216.910.48.3
InstanceCutyesyesyesyesnonononononononononoInstanceCut: from Edges to Instances with MultiCutA. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, C. RotherComputer Vision and Pattern Recognition (CVPR) 2017InstanceCut represents the problem by two output modalities: (i) an instance-agnostic semantic segmentation and (ii) all instance-boundaries. The former is computed from a standard CNN for semantic segmentation, and the latter is derived from a new instance-aware edge detection model. To reason globally about the optimal partitioning of an image into instances, we combine these two modalities into a novel MultiCut formulation.
more details
n/a22.119.714.038.924.834.423.113.78.0
Semantic Instance Segmentation with a Discriminative Loss Functionyesyesnononononononono22yesyesSemantic Instance Segmentation with a Discriminative Loss FunctionBert De Brabandere, Davy Neven, Luc Van GoolDeep Learning for Robotic Vision, workshop at CVPR 2017This method uses a discriminative loss function, operating at the pixel level, that encourages a convolutional network to produce a representation of the image that can easily be clustered into instances with a simple post-processing step. The loss function encourages the network to map each pixel to a point in feature space so that pixels belonging to the same instance lie close together while different instances are separated by a wide margin.

Previously listed as "PPLoss".
more details
n/a27.825.127.540.024.439.426.522.217.9
SGNyesyesyesyesnonononononononononoSGN: Sequential Grouping Networks for Instance SegmentationShu Liu, Jiaya Jia, Sanja Fidler, Raquel UrtasunICCV 2017Instance segmentation using a sequence of neural networks, each solving a sub-grouping problem of increasing semantic complexity in order to gradually compose objects out of pixels.
more details
n/a38.936.732.760.139.953.744.124.420.0
Mask R-CNN [COCO]yesyesnonononononononononononoMask R-CNNKaiming He, Georgia Gkioxari, Piotr Dollár, Ross GirshickMask R-CNN, ResNet-50-FPN, Cityscapes [fine-only] + COCO
more details
n/a45.851.339.367.942.858.846.831.427.9
Mask R-CNN [fine-only]yesyesnonononononononononononoMask R-CNNKaiming He, Georgia Gkioxari, Piotr Dollár, Ross GirshickMask R-CNN, ResNet-50-FPN, Cityscapes fine-only
more details
n/a37.646.235.665.531.146.027.524.924.3
Deep Watershed Transformationyesyesnononononononono22nonoDeep Watershed Transformation for Instance SegmentationMin Bai and Raquel UrtasunCVPR 2017Instance segmentation using a watershed transformation inspired CNN. The input RGB image is augmented using the semantic segmentation from the recent PSPNet by H. Zhao et al.
Previously named "DWT".
more details
n/a31.427.723.150.837.946.433.719.412.7
Foveal Vision for Instance Segmentation of Road ImagesyesyesnonononoyesyesnonononononoFoveal Vision for Instance Segmentation of Road ImagesBenedikt Ortelt, Christian Herrmann, Dieter Willersinn, Jürgen BeyererVISAPP 2018Directly based on 'Pixel-level Encoding for Instance Segmentation'. Adds an improved angular distance measure and a foveal concept to better address small objects at the vanishing point of the road.
more details
n/a20.424.519.639.314.524.218.511.111.1
SegNetyesyesyesyesnonononononononononoAnonymous
more details
0.543.248.135.161.443.261.744.626.125.4
DCMEyesyesnononononononono33nonoDistance to Center of Mass Encoding for Instance SegmentationThomio Watanabe and Denis Wolf2018 21st International Conference on Intelligent Transportation Systems (ITSC)
more details
n/a6.63.71.326.63.68.17.71.30.6
RRLyesyesnonononoyesyesnonononononoAnonymous
more details
n/a40.949.738.469.331.349.634.526.527.6
PANet [fine-only]yesyesnonononononononononoyesyesPath Aggregation Network for Instance SegmentationShu Liu, Lu Qi, Haifang Qin, Jianping Shi, Jiaya JiaCVPR 2018PANet, ResNet-50 as base model, Cityscapes fine-only, training hyper-parameters are adopted from Mask R-CNN.
more details
n/a44.253.943.273.437.250.736.029.030.6
PANet [COCO]yesyesnonononononononononoyesyesPath Aggregation Network for Instance SegmentationShu Liu, Lu Qi, Haifang Qin, Jianping Shi, Jiaya JiaCVPR 2018PANet, ResNet-50 as base model, Cityscapes fine-only + COCO, training hyper-parameters are adopted from Mask R-CNN.
more details
n/a49.258.746.775.841.761.938.935.734.4
LCISyesyesnonononononononononononoAnonymous
more details
n/a24.228.624.536.821.127.021.617.616.2
Pixelwise Instance Segmentation with a Dynamically Instantiated NetworkyesyesyesyesnonononononononononoPixelwise Instance Segmentation with a Dynamically Instantiated NetworkAnurag Arnab and Philip H. S. TorrComputer Vision and Pattern Recognition (CVPR) 2017We propose an Instance Segmentation system that produces a segmentation map where each pixel is assigned an object class and instance identity label (this has recently been termed "Panoptic Segmentation"). Our method is based on an initial semantic segmentation module which feeds into an instance subnetwork. This subnetwork uses the initial category-level segmentation, along with cues from the output of an object detector, within an end-to-end CRF to predict instances. This part of our model is dynamically instantiated to produce a variable number of instances per image. Our end-to-end approach requires no post-processing and considers the image holistically, instead of processing independent proposals. As a result, it reasons about occlusions (unlike some related work, a single pixel cannot belong to multiple instances).

more details
n/a36.838.430.447.635.649.844.827.520.3
PolygonRNN++yesyesnonononononononononoyesyesEfficient Annotation of Segmentation Datasets with Polygon-RNN++D. Acuna, H. Ling, A. Kar, and S. FidlerCVPR 2018
more details
n/a39.349.634.669.332.452.536.118.321.7
GMIS: Graph Merge for Instance SegmentationyesyesyesyesnonononononononononoYiding Liu, Siyu Yang, Bin Li, Wengang Zhou, Jizheng Xu, Houqiang Li, Yan Lu
more details
n/a42.747.938.066.339.560.847.223.918.1
TCnetyesyesnonononononononononononoAnonymousTCnet
more details
n/a45.053.141.969.138.456.841.431.327.7
MaskRCNN_ROByesyesnonononononononononononoAnonymousMaskRCNN Instance segmentation baseline for ROB challenge using default parameters from Matterport's implementation of Mask RCNN
https://github.com/matterport/Mask_RCNN
more details
n/a14.629.615.848.72.511.40.08.80.0
Multitask LearningyesyesnonononononononononoyesyesMulti-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and SemanticsAlex Kendall, Yarin Gal and Roberto CipollaCVPR 2018Numerous deep learning applications benefit from multi-task learning with multiple regression and classification objectives. In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task's loss. Tuning these weights by hand is a difficult and expensive process, making multi-task learning prohibitive in practice. We propose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the homoscedastic uncertainty of each task. This allows us to simultaneously learn various quantities with different units or scales in both classification and regression settings. We demonstrate our model learning per-pixel depth regression, semantic and instance segmentation from a monocular input image. Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate models trained individually on each task.
more details
n/a35.038.235.560.426.742.724.427.524.7
Deep ColoringyesyesnonononononononononononoAnonymousAnonymous ECCV submission #2955
more details
n/a39.038.834.461.138.554.239.225.320.1
MRCNN_VSCMLab_ROByesyesnonononononononononononoAnonymousMaskRCNN+FPN with pre-trained COCO model.
ms-training with short edge [800, 1024]
inference with shore edge size 800
Randomly subsample ScanNet to the size close to CityScape

optimizer: Adam
learning rate: start from 1e-4 to 1e-3 with linear warm up schedule. decrease by factor of 0.1 at 200, 300 epoch.

epoch: 400
step per epoch: 500
roi_per_im: 512
more details
1.024.830.319.657.722.432.424.112.10.0
BAMRCNN_ROByesyesnonononononononononononoAnonymous
more details
n/a0.20.00.01.20.00.00.00.00.0
NL_ROI_ROByesyesnonononononononononononoAnonymousNon-local ROI on Mask R-CNN
more details
n/a36.145.131.865.633.347.626.119.919.1
RUSH_ROByesyesnonononononononononononoAnonymous
more details
n/a45.253.244.369.340.854.934.733.830.8
MaskRCNN_BOSHyesyesnonononononononononononoJin shengtao, Yi zhihao, Liu wei [Our team name is firefly]MaskRCNN segmentation baseline for Bosh autodrive challenge ,
using Matterport's implementation of Mask RCNN https://github.com/matterport/Mask_RCNN
55k iterations, default parameters (backbone :resenet 101)
19hours for training
more details
n/a22.128.616.640.318.526.823.011.511.5
NV-ADLRyesyesnonononononononononononoAnonymousNVIDIA Applied Deep Learning Research
more details
n/a49.356.742.875.246.763.243.035.730.9
Sogou_MMyesyesnonononononononononononoGlobal Concatenating Feature Enhancement for Instance SegmentationHang Yang, Xiaozhe Xin, Wenwen Yang, Bin LiGlobal Concatenating Feature Enhancement for Instance Segmentation
more details
n/a51.154.845.072.652.266.551.435.330.8
Instance Segmentation by Jointly Optimizing Spatial Embeddings and Clustering BandwidthyesyesnonononononononononoyesyesInstance Segmentation by Jointly Optimizing Spatial Embeddings and Clustering BandwidthDavy Neven, Bert De Brabandere, Marc Proesmans and Luc Van GoolCVPR 2019Fine only - ERFNet backbone
more details
0.137.849.936.971.126.243.622.725.227.0
Instance Annotationyesyesnononononononono22nonoInstance Segmentation as Image Segmentation AnnotationThomio Watanabe and Denis F. Wolf2019 IEEE Intelligent Vehicles Symposium (IV)Based on DCME
more details
4.41613.614.15.641.410.117.710.95.53.7
NJUSTyesyesnonononononononononononoAng Li, Chongyang ZhangMask R-CNN based on FPN enhancement and Mask Rescore, etc. Only one single model SE-ResNext-152 with COCO pre-train used;
more details
n/a53.060.949.375.848.368.447.839.833.7
BshapeNet+ [fine-only]yesyesnonononononononononononoBshapeNet: Object Detection and Instance Segmentation with Bounding Shape MasksBa Rom Kang, Ha Young KimBshapeNet+, ResNet-50-FPN as base model, Cityscapes [fine-only]
more details
n/a40.547.936.867.836.749.436.227.321.9
SSAPyesyesnonononononononononononoSSAP: Single-Shot Instance Segmentation With Affinity PyramidNaiyu Gao, Yanhu Shan, Yupei Wang, Xin Zhao, Yinan Yu, Ming Yang, Kaiqi HuangICCV 2019SSAP, ResNet-101, Cityscapes fine-only
more details
n/a47.354.339.278.348.167.642.725.522.5
Spatial Sampling Netyesyesnononononononono22nonoSpatial Sampling Network for Fast Scene UnderstandingDavide Mazzini, Raimondo SchettiniCVPR 2019 Workshop on Autonomous DrivingWe propose a network architecture to perform efficient scene understanding. This work presents three main novelties: the first is an Improved Guided Upsampling Module that can replace in toto the decoder part in common semantic segmentation networks.
Our second contribution is the introduction of a new module based on spatial sampling to perform Instance Segmentation. It provides a very fast instance segmentation, needing only thresholding as post-processing step at inference time. Finally, we propose a novel efficient network design that includes the new modules and we test it against different datasets for outdoor scene understanding.
more details
0.0088416.418.65.841.018.124.414.36.52.6
UPSNetyesyesnonononononononononoyesyesUPSNet: A Unified Panoptic Segmentation NetworkYuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, Raquel UrtasunCVPR 2019
more details
0.22746.852.639.971.144.462.045.031.028.5
Sem2InsyesyesnonononononononononononoAnonymousAnonymous NeurIPS19 submission #4671
more details
n/a29.328.327.842.131.741.828.919.414.5
BshapeNet+ [COCO]yesyesnonononononononononononoBshapeNet: Object Detection and Instance Segmentation with Bounding Shape MasksBa Rom Kang, Ha Young KimBshapeNet+ single model, ResNet-50-FPN as base model, Cityscapes [fine-only + COCO]
more details
n/a47.353.737.270.346.161.149.033.627.4
AdaptISyesyesnonononononononononononoAnonymousAdaptive Instance Selection network architecture for class-agnostic instance segmentation. Given an input image and a point (x, y), it generates a mask for the object located at (x, y). The network adapts to the input point with a help of AdaIN layers, thus producing different masks for different objects on the same image. AdaptIS generates pixel-accurate object masks, therefore it accurately segments objects of complex shape or severely occluded ones.
more details
n/a48.249.745.969.445.665.058.033.818.3
AInnoSegmentationyesyesnonononononononononononoFaen Zhang, Jiahong Wu, Haotian Cao, Zhizheng Yang, Jianfei Song, Ze Huang, Jiashui Huang, Shenglan BenAInnoSegmentation use SE-Resnet 152 as backbone and FPN model to extract multi-level features and use self-develop method to combine multi-features and use COCO datasets to pre-train model and so on
more details
n/a53.959.646.776.453.670.852.838.832.4
iFLYTEK-CVyesyesnonononononononononononoAnonymousiFLYTEK Research, CV Group
more details
n/a55.760.951.077.457.370.255.339.234.6
Panoptic-DeepLab [Cityscapes-fine]yesyesnonononononononononononoPanoptic-DeepLabBowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh ChenOur proposed bottom-up Panoptic-DeepLab is conceptually simple yet delivers state-of-the-art results. The Panoptic-DeepLab adopts dual-ASPP and dual-decoder modules, specific to semantic segmentation and instance segmentation respectively. The semantic segmentation prediction follows the typical design of any semantic segmentation model (e.g., DeepLab), while the instance segmentation prediction involves a simple instance center regression, where the model learns to predict instance centers as well as the offset from each pixel to its corresponding center. This submission exploits only Cityscapes fine annotations.
This entry fixes a minor inference bug (i.e., same trained model) for instance segmentation, compared to the previous submission.
more details
n/a50.554.243.876.447.763.449.335.433.6
snakeyesyesnonononononononononoyesyesDeep Snake for Real-Time Instance SegmentationSida Peng, Wen Jiang, Huaijin Pi, Xiuli Li, Hujun Bao, Xiaowei ZhouCVPR 2020
more details
0.21743.252.237.674.738.657.138.923.523.0
PolyTransformyesyesnonononononononononononoPolyTransform: Deep Polygon Transformer for Instance SegmentationJustin Liang, Namdar Homayounfar, Wei-Chiu Ma, Yuwen Xiong, Rui Hu, Raquel Urtasun
more details
n/a54.860.049.077.353.968.057.240.133.2
StixelPointNetyesyesnonononononononononononoLearning Stixel-Based Instance SegmentationMonty Santarossa, Lukas Schneider, Claudius Zelenka, Lars Schmarje, Reinhard Koch, Uwe Franke IV 2021An adapted version of the PointNet is trained on Stixels as input for instance segmentation.
more details
0.03515.118.613.126.521.429.30.05.66.5
Axial-DeepLab-XL [Cityscapes-fine]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a49.651.442.574.447.563.650.235.631.6
PolyTransform + SegFixyesyesnonononononononononoyesyesAnonymousopensegWe simply apply a novel post-processing scheme based on the PolyTransform (thanks to the authors of PolyTransform for providing their segmentation results). The performance of the baseline PolyTransform is 40.1% and our method achieves 41.2%. Besides, our method also could improve the results of PointRend and PANet by more than 1.0% without any re-training or fine-tuning the segmentation models.
more details
n/a56.061.850.479.254.869.357.540.834.3
GAIS-NetyesyesnonononoyesyesnonononoyesyesGeometry-Aware Instance Segmentation with Disparity MapsCho-Ying Wu, Xiaoyan Hu, Michael Happold, Qiangeng Xu, Ulrich NeumannScalability in Autonomous Driving, workshop at CVPR 2020Geometry-Aware Instance Segmentation with Disparity Maps
more details
n/a44.651.841.371.240.054.941.529.426.3
Axial-DeepLab-L [Mapillary Vistas]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a54.354.245.176.657.473.158.137.033.1
LevelSet R-CNN [fine-only]yesyesnonononononononononononoLevelSet R-CNN: A Deep Variational Method for Instance SegmentationNamdar Homayounfar*, Yuwen Xiong*, Justin Liang*, Wei-Chiu Ma, Raquel UrtasunECCV 2020Obtaining precise instance segmentation masks is of high importance in many modern applications such as robotic manipulation and autonomous driving. Currently, many state of the art models are based on the Mask R-CNN framework which, while very powerful, outputs masks at low resolutions which could result in imprecise boundaries. On the other hand, classic variational methods for segmentation impose desirable global and local data and geometry constraints on the masks by optimizing an energy functional. While mathematically elegant, their direct dependence on good initialization, non-robust image cues and manual setting of hyperparameters renders them unsuitable for modern applications. We propose LevelSet R-CNN, which combines the best of both worlds by obtaining powerful feature representations that are combined in an end-to-end manner with a variational segmentation framework. We demonstrate the effectiveness of our approach on COCO and Cityscapes datasets.
more details
n/a47.555.242.574.941.757.544.533.430.1
LevelSet R-CNN [COCO]yesyesnonononononononononononoLevelSet R-CNN: A Deep Variational Method for Instance SegmentationNamdar Homayounfar*, Yuwen Xiong*, Justin Liang*, Wei-Chiu Ma, Raquel UrtasunECCV 2020Obtaining precise instance segmentation masks is of high importance in many modern applications such as robotic manipulation and autonomous driving. Currently, many state of the art models are based on the Mask R-CNN framework which, while very powerful, outputs masks at low resolutions which could result in imprecise boundaries. On the other hand, classic variational methods for segmentation impose desirable global and local data and geometry constraints on the masks by optimizing an energy functional. While mathematically elegant, their direct dependence on good initialization, non-robust image cues and manual setting of hyperparameters renders them unsuitable for modern applications. We propose LevelSet R-CNN, which combines the best of both worlds by obtaining powerful feature representations that are combined in an end-to-end manner with a variational segmentation framework. We demonstrate the effectiveness of our approach on COCO and Cityscapes datasets.
more details
n/a54.560.847.877.651.468.653.241.035.4
Axial-DeepLab-L [Cityscapes-fine]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.

more details
n/a48.851.342.474.545.266.645.334.830.0
Deep Affinity Net [fine-only]yesyesnonononononononononononoDeep Affinity Net: Instance Segmentation via AffinityXingqian Xu, Mangtik Chiu, Thomas Huang, Honghui ShiA proposal-free method that uses FPN generated features and network predicted 4-neighbor affinities to reconstruct instance segments. During inference time, an efficient graph partitioning algorithm, Cascade-GAEC, is introduced to overcome the long execution time in the high-resolution graph partitioning problem.
more details
n/a41.539.435.163.043.461.246.324.918.8
Naive-Student (iterative semi-supervised learning with Panoptic-DeepLab)yesyesnonononononoyesyesnonononoSemi-Supervised Learning in Video Sequences for Urban Scene SegmentationLiang-Chieh Chen, Raphael Gontijo Lopes, Bowen Cheng, Maxwell D. Collins, Ekin D. Cubuk, Barret Zoph, Hartwig Adam, Jonathon ShlensSupervised learning in large discriminative models is a mainstay for modern computer vision. Such an approach necessitates investing in large-scale human-annotated datasets for achieving state-of-the-art results. In turn, the efficacy of supervised learning may be limited by the size of the human annotated dataset. This limitation is particularly notable for image segmentation tasks, where the expense of human annotation is especially large, yet large amounts of unlabeled data may exist. In this work, we ask if we may leverage semi-supervised learning in unlabeled video sequences to improve the performance on urban scene segmentation, simultaneously tackling semantic, instance, and panoptic segmentation. The goal of this work is to avoid the construction of sophisticated, learned architectures specific to label propagation (e.g., patch matching and optical flow). Instead, we simply predict pseudo-labels for the unlabeled data and train subsequent models with both human-annotated and pseudo-labeled data. The procedure is iterated for several times. As a result, our Naive-Student model, trained with such simple yet effective iterative semi-supervised learning, attains state-of-the-art results at all three Cityscapes benchmarks, reaching the performance of 67.8% PQ, 42.6% AP, and 85.2% mIOU on the test set. We view this work as a notable step towards building a simple procedure to harness unlabeled video sequences to surpass state-of-the-art performance on core computer vision tasks.
more details
n/a57.959.950.880.259.573.255.645.139.2
Axial-DeepLab-XL [Mapillary Vistas]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a55.556.247.977.855.973.557.339.735.6
Panoptic-DeepLab [Mapillary Vistas]yesyesnonononononononononononoPanoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic SegmentationBowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh ChenWe employ a stronger backbone, WR-41, in Panoptic-DeepLab.
For Panoptic-DeepLab, please refer to https://arxiv.org/abs/1911.10194.
For wide-ResNet-41 (WR-41) backbone, please refer to https://arxiv.org/abs/2005.10266.
more details
n/a55.656.546.278.659.672.750.343.637.0
EfficientPS [Mapillary Vistas]yesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaUnderstanding the scene in which an autonomous robot operates is critical for its competent functioning. Such scene comprehension necessitates recognizing instances of traffic participants along with general scene semantics which can be effectively addressed by the panoptic segmentation task. In this paper, we introduce the Efficient Panoptic Segmentation (EfficientPS) architecture that consists of a shared backbone which efficiently encodes and fuses semantically rich multi-scale features. We incorporate a new semantic head that aggregates fine and contextual features coherently and a new variant of Mask R-CNN as the instance head. We also propose a novel panoptic fusion module that congruously integrates the output logits from both the heads of our EfficientPS architecture to yield the final panoptic segmentation output. Additionally, we introduce the KITTI panoptic segmentation dataset that contains panoptic annotations for the popularly challenging KITTI benchmark. Extensive evaluations on Cityscapes, KITTI, Mapillary Vistas and Indian Driving Dataset demonstrate that our proposed architecture consistently sets the new state-of-the-art on all these four benchmarks while being the most efficient and fast panoptic segmentation architecture to date.
more details
n/a52.958.548.577.148.569.348.336.336.3
seamseg_rvcsubsetnonononononononononononoyesyesSeamless Scene SegmentationPorzi, Lorenzo and Rota Bulò, Samuel and Colovic, Aleksander and Kontschieder, PeterThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2019Seamless Scene Segmentation Resnet101, pretrained on Imagenet; supplied with altered MVD to include WildDash2 classes; does not contain other RVC label policies (i.e. no ADE20K/COCO-specific classes -> rvcsubset and not a proper submission)
more details
n/a31.240.527.553.233.241.715.619.818.3
UniDet_RVCyesyesnononononononono22nonoAnonymous
more details
300.044.149.834.365.945.362.045.327.522.9
EffPS_b1bs4_RVCyesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaEfficientPS with EfficientNet-b1 backbone. Trained with a batch size of 4.
more details
n/a31.741.729.662.233.446.03.318.019.2
Panoptic-DeepLab w/ SWideRNet [Cityscapes-fine]yesyesnonononononononononononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
more details
n/a53.756.148.377.751.967.053.139.735.5
Panoptic-DeepLab w/ SWideRNet [Mapillary Vistas]yesyesnonononononononononononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
more details
n/a57.856.649.977.960.076.159.043.838.7
PolyTransform + SegFix + BPRyesyesnonononononononononoyesyesLook Closer to Segment Better: Boundary Patch Refinement for Instance SegmentationChufeng Tang*, Hang Chen*, Xiao Li, Jianmin Li, Zhaoxiang Zhang, Xiaolin HuCVPR 2021Tremendous efforts have been made on instance segmentation but the mask quality is still not satisfactory. The boundaries of predicted instance masks are usually imprecise due to the low spatial resolution of feature maps and the imbalance problem caused by the extremely low proportion of boundary pixels. To address these issues, we propose a conceptually simple yet effective post-processing refinement framework to improve the boundary quality based on the results of any instance segmentation model, termed BPR. Following the idea of looking closer to segment boundaries better, we extract and refine a series of small boundary patches along the predicted instance boundaries. The refinement is accomplished by a boundary patch refinement network at higher resolution. The proposed BPR framework yields significant improvements over the Mask R-CNN baseline on Cityscapes benchmark, especially on the boundary-aware metrics. Moreover, by applying the BPR framework to the PolyTransform + SegFix baseline, we reached 1st place on the Cityscapes leaderboard.
more details
n/a57.563.451.681.155.871.059.941.935.3
Panoptic-DeepLab w/ SWideRNet [Mapillary Vistas + Pseudo-labels]yesyesnonononononoyesyesnonononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime. Following Naive-Student, this model is additionally trained with pseudo-labels generated from Cityscapes Video and train-extra set (i.e., the coarse annotations are not used, but the images are).
more details
n/a58.958.650.779.862.777.656.045.539.8
CenterPolyyesyesnononononononono44nonoAnonymous
more details
0.04523.329.418.350.718.223.822.212.211.7
HRI-INSTyesyesyesyesnonononononononononoAnonymousHRI-INST
more details
n/a57.558.550.877.659.476.958.542.336.0
DH-ARIyesyesnonononononononononononoAnonymousDH-ARI
more details
n/a58.564.354.682.757.373.151.945.438.9
HRI-TRANSyesyesnonononononononononononoAnonymousHRI transformer instance segmentation
more details
n/a57.960.450.880.159.476.958.541.036.0
kMaX-DeepLab [Cityscapes-fine]yesyesnonononononononononoyesyesk-means Mask TransformerQihang Yu, Huiyu Wang, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, and Liang-Chieh ChenECCV 2022kMaX-DeepLab w/ ConvNeXt-L backbone (ImageNet-22k + 1k pretrained). This result is obtained by the kMaX-DeepLab trained for Panoptic Segmentation task. No test-time augmentation or other external dataset.
more details
n/a57.256.950.678.457.176.564.939.833.1
QueryInst-Parallel CompletionyesyesnonononononononononononoHai Wang ;Shilin Zhu ;PuPu ;Meng; Le; Apple; RongWe propose a novel feature complete network framework queryinst parallel completion. First, the global context module is introduced into the backbone network to obtain instance information. Then, parallel semantic branch and parallel global branch are proposed to extract the semantic information and global information of feature layer, so as to complete the ROI features. In addition, we also propose a feature transfer structure, which explicitly increases the connection between detection and segmentation branches, changes the gradient back-propagation path, and indirectly complements the ROI features.
more details
n/a48.158.343.776.838.861.643.631.830.7
CenterPoly v2yesyesnonononononononononoyesyesReal-time instance segmentation with polygons using an Intersection-over-Union lossKatia Jodogne-del Litto, Guillaume-Alexandre Bilodeau
more details
0.04524.828.316.249.026.135.919.313.110.2
Jiangsu-University-Environmental-PerceptionyesyesnonononononononononononoAnonymous
more details
n/a53.461.249.077.846.569.046.342.235.5

AP 50 m on class-level

namefinefinecoarsecoarse16-bit16-bitdepthdepthvideovideosubsubcodecodetitleauthorsvenuedescriptionRuntime [s]averagepersonridercartruckbustrainmotorcyclebicycle
R-CNN + MCG convex hullyesyesnononononononono22nonoThe Cityscapes Dataset for Semantic Urban Scene UnderstandingM. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, B. SchieleCVPR 2016We compute MCG object proposals [1] and use their convex hulls as instance candidates. These proposals are scored by a Fast R-CNN detector [2].

[1] P. Arbelaez, J. Pont-Tuset, J. Barron, F. Marqués, and J. Malik. Multiscale combinatorial grouping. In CVPR, 2014.
[2] R. Girshick. Fast R-CNN. In ICCV, 2015.
more details
60.010.32.71.121.214.025.214.22.71.0
Pixel-level Encoding for Instance SegmentationyesyesnonononoyesyesnonononononoPixel-level Encoding and Depth Layering for Instance-level Semantic LabelingJ. Uhrig, M. Cordts, U. Franke, and T. BroxGCPR 2016We predict three encoding channels from a single image using an FCN: semantic labels, depth classes, and an instance-aware representation based on directions towards instance centers. Using low-level computer vision techniques, we obtain pixel-level and instance-level semantic labeling paired with a depth estimate of the instances.
more details
n/a16.725.021.040.76.713.56.411.29.3
Instance-level Segmentation of Vehicles by Deep Contoursyesyesnononononononono22nonoInstance-level Segmentation of Vehicles by Deep ContoursJan van den Brand, Matthias Ochs and Rudolf MesterAsian Conference on Computer Vision - Workshop on Computer Vision Technologies for Smart VehicleOur method uses the fully convolutional network (FCN) for semantic labeling and for estimating the boundary of each vehicle. Even though a contour is in general a one pixel wide structure which cannot be directly learned by a CNN, our network addresses this by providing areas around the contours. Based on these areas, we separate the individual vehicle instances.
more details
0.24.90.00.039.00.00.00.00.00.0
Boundary-aware Instance Segmentationyesyesnononononononono22nonoBoundary-aware Instance SegmentationZeeshan Hayder, Xuming He, Mathieu SalzmannCVPR 2017End-to-end model for instance segmentation using VGG16 network

Previously listed as "Shape-Aware Instance Segmentation"
more details
n/a34.031.523.463.132.250.540.416.514.6
RecAttendyesyesnononononononono44nonoAnonymous
more details
n/a20.920.75.854.217.932.121.97.86.4
Joint Graph Decomposition and Node Labelingyesyesnononononononono88nonoJoint Graph Decomposition and Node Labeling: Problem, Algorithms, ApplicationsEvgeny Levinkov, Jonas Uhrig, Siyu Tang, Mohamed Omran, Eldar Insafutdinov, Alexander Kirillov, Carsten Rother, Thomas Brox, Bernt Schiele, Bjoern AndresComputer Vision and Pattern Recognition (CVPR) 2017
more details
n/a20.314.017.443.915.026.126.211.68.5
InstanceCutyesyesyesyesnonononononononononoInstanceCut: from Edges to Instances with MultiCutA. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, C. RotherComputer Vision and Pattern Recognition (CVPR) 2017InstanceCut represents the problem by two output modalities: (i) an instance-agnostic semantic segmentation and (ii) all instance-boundaries. The former is computed from a standard CNN for semantic segmentation, and the latter is derived from a new instance-aware edge detection model. To reason globally about the optimal partitioning of an image into instances, we combine these two modalities into a novel MultiCut formulation.
more details
n/a26.120.114.642.532.344.731.714.38.2
Semantic Instance Segmentation with a Discriminative Loss Functionyesyesnononononononono22yesyesSemantic Instance Segmentation with a Discriminative Loss FunctionBert De Brabandere, Davy Neven, Luc Van GoolDeep Learning for Robotic Vision, workshop at CVPR 2017This method uses a discriminative loss function, operating at the pixel level, that encourages a convolutional network to produce a representation of the image that can easily be clustered into instances with a simple post-processing step. The loss function encourages the network to map each pixel to a point in feature space so that pixels belonging to the same instance lie close together while different instances are separated by a wide margin.

Previously listed as "PPLoss".
more details
n/a31.025.128.244.028.647.732.523.518.0
SGNyesyesyesyesnonononononononononoSGN: Sequential Grouping Networks for Instance SegmentationShu Liu, Jiaya Jia, Sanja Fidler, Raquel UrtasunICCV 2017Instance segmentation using a sequence of neural networks, each solving a sub-grouping problem of increasing semantic complexity in order to gradually compose objects out of pixels.
more details
n/a44.536.833.363.250.767.459.225.320.0
Mask R-CNN [COCO]yesyesnonononononononononononoMask R-CNNKaiming He, Georgia Gkioxari, Piotr Dollár, Ross GirshickMask R-CNN, ResNet-50-FPN, Cityscapes [fine-only] + COCO
more details
n/a49.551.540.169.949.369.255.931.927.9
Mask R-CNN [fine-only]yesyesnonononononononononononoMask R-CNNKaiming He, Georgia Gkioxari, Piotr Dollár, Ross GirshickMask R-CNN, ResNet-50-FPN, Cityscapes fine-only
more details
n/a40.146.235.967.437.851.232.925.324.3
Deep Watershed Transformationyesyesnononononononono22nonoDeep Watershed Transformation for Instance SegmentationMin Bai and Raquel UrtasunCVPR 2017Instance segmentation using a watershed transformation inspired CNN. The input RGB image is augmented using the semantic segmentation from the recent PSPNet by H. Zhao et al.
Previously named "DWT".
more details
n/a36.827.423.753.547.164.345.120.213.1
Foveal Vision for Instance Segmentation of Road ImagesyesyesnonononoyesyesnonononononoFoveal Vision for Instance Segmentation of Road ImagesBenedikt Ortelt, Christian Herrmann, Dieter Willersinn, Jürgen BeyererVISAPP 2018Directly based on 'Pixel-level Encoding for Instance Segmentation'. Adds an improved angular distance measure and a foveal concept to better address small objects at the vanishing point of the road.
more details
n/a22.124.720.242.517.227.621.811.711.3
SegNetyesyesyesyesnonononononononononoAnonymous
more details
0.545.848.035.262.550.870.048.126.325.3
DCMEyesyesnononononononono33nonoDistance to Center of Mass Encoding for Instance SegmentationThomio Watanabe and Denis Wolf2018 21st International Conference on Intelligent Transportation Systems (ITSC)
more details
n/a9.54.11.435.55.114.712.91.50.6
RRLyesyesnonononoyesyesnonononononoAnonymous
more details
n/a41.849.438.470.832.053.636.726.527.2
PANet [fine-only]yesyesnonononononononononoyesyesPath Aggregation Network for Instance SegmentationShu Liu, Lu Qi, Haifang Qin, Jianping Shi, Jiaya JiaCVPR 2018PANet, ResNet-50 as base model, Cityscapes fine-only, training hyper-parameters are adopted from Mask R-CNN.
more details
n/a46.053.743.575.540.156.239.029.530.2
PANet [COCO]yesyesnonononononononononoyesyesPath Aggregation Network for Instance SegmentationShu Liu, Lu Qi, Haifang Qin, Jianping Shi, Jiaya JiaCVPR 2018PANet, ResNet-50 as base model, Cityscapes fine-only + COCO, training hyper-parameters are adopted from Mask R-CNN.
more details
n/a51.858.746.977.947.670.842.336.034.2
LCISyesyesnonononononononononononoAnonymous
more details
n/a25.829.325.338.127.027.724.318.116.3
Pixelwise Instance Segmentation with a Dynamically Instantiated NetworkyesyesyesyesnonononononononononoPixelwise Instance Segmentation with a Dynamically Instantiated NetworkAnurag Arnab and Philip H. S. TorrComputer Vision and Pattern Recognition (CVPR) 2017We propose an Instance Segmentation system that produces a segmentation map where each pixel is assigned an object class and instance identity label (this has recently been termed "Panoptic Segmentation"). Our method is based on an initial semantic segmentation module which feeds into an instance subnetwork. This subnetwork uses the initial category-level segmentation, along with cues from the output of an object detector, within an end-to-end CRF to predict instances. This part of our model is dynamically instantiated to produce a variable number of instances per image. Our end-to-end approach requires no post-processing and considers the image holistically, instead of processing independent proposals. As a result, it reasons about occlusions (unlike some related work, a single pixel cannot belong to multiple instances).

more details
n/a40.939.231.249.743.361.653.528.720.4
PolygonRNN++yesyesnonononononononononoyesyesEfficient Annotation of Segmentation Datasets with Polygon-RNN++D. Acuna, H. Ling, A. Kar, and S. FidlerCVPR 2018
more details
n/a43.450.135.272.741.362.144.819.122.1
GMIS: Graph Merge for Instance SegmentationyesyesyesyesnonononononononononoYiding Liu, Siyu Yang, Bin Li, Wengang Zhou, Jizheng Xu, Houqiang Li, Yan Lu
more details
n/a47.947.938.470.648.075.159.725.118.3
TCnetyesyesnonononononononononononoAnonymousTCnet
more details
n/a47.853.142.370.944.665.147.531.527.5
MaskRCNN_ROByesyesnonononononononononononoAnonymousMaskRCNN Instance segmentation baseline for ROB challenge using default parameters from Matterport's implementation of Mask RCNN
https://github.com/matterport/Mask_RCNN
more details
n/a14.629.315.549.61.612.80.08.10.0
Multitask LearningyesyesnonononononononononoyesyesMulti-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and SemanticsAlex Kendall, Yarin Gal and Roberto CipollaCVPR 2018Numerous deep learning applications benefit from multi-task learning with multiple regression and classification objectives. In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task's loss. Tuning these weights by hand is a difficult and expensive process, making multi-task learning prohibitive in practice. We propose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the homoscedastic uncertainty of each task. This allows us to simultaneously learn various quantities with different units or scales in both classification and regression settings. We demonstrate our model learning per-pixel depth regression, semantic and instance segmentation from a monocular input image. Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate models trained individually on each task.
more details
n/a37.038.936.364.029.146.628.228.524.7
Deep ColoringyesyesnonononononononononononoAnonymousAnonymous ECCV submission #2955
more details
n/a44.038.935.063.648.568.750.426.620.2
MRCNN_VSCMLab_ROByesyesnonononononononononononoAnonymousMaskRCNN+FPN with pre-trained COCO model.
ms-training with short edge [800, 1024]
inference with shore edge size 800
Randomly subsample ScanNet to the size close to CityScape

optimizer: Adam
learning rate: start from 1e-4 to 1e-3 with linear warm up schedule. decrease by factor of 0.1 at 200, 300 epoch.

epoch: 400
step per epoch: 500
roi_per_im: 512
more details
1.029.330.920.362.728.140.938.413.10.0
BAMRCNN_ROByesyesnonononononononononononoAnonymous
more details
n/a0.10.00.00.50.00.00.00.00.0
NL_ROI_ROByesyesnonononononononononononoAnonymousNon-local ROI on Mask R-CNN
more details
n/a40.845.232.669.141.259.738.720.819.2
RUSH_ROByesyesnonononononononononononoAnonymous
more details
n/a46.353.244.670.542.659.535.434.030.4
MaskRCNN_BOSHyesyesnonononononononononononoJin shengtao, Yi zhihao, Liu wei [Our team name is firefly]MaskRCNN segmentation baseline for Bosh autodrive challenge ,
using Matterport's implementation of Mask RCNN https://github.com/matterport/Mask_RCNN
55k iterations, default parameters (backbone :resenet 101)
19hours for training
more details
n/a26.729.517.443.725.341.132.512.711.8
NV-ADLRyesyesnonononononononononononoAnonymousNVIDIA Applied Deep Learning Research
more details
n/a53.556.843.777.954.973.254.136.231.0
Sogou_MMyesyesnonononononononononononoGlobal Concatenating Feature Enhancement for Instance SegmentationHang Yang, Xiaozhe Xin, Wenwen Yang, Bin LiGlobal Concatenating Feature Enhancement for Instance Segmentation
more details
n/a54.554.945.574.560.175.459.535.630.7
Instance Segmentation by Jointly Optimizing Spatial Embeddings and Clustering BandwidthyesyesnonononononononononoyesyesInstance Segmentation by Jointly Optimizing Spatial Embeddings and Clustering BandwidthDavy Neven, Bert De Brabandere, Marc Proesmans and Luc Van GoolCVPR 2019Fine only - ERFNet backbone
more details
0.137.349.437.072.524.243.519.626.026.5
Instance Annotationyesyesnononononononono22nonoInstance Segmentation as Image Segmentation AnnotationThomio Watanabe and Denis F. Wolf2019 IEEE Intelligent Vehicles Symposium (IV)Based on DCME
more details
4.41616.615.36.050.611.622.516.56.43.9
NJUSTyesyesnonononononononononononoAng Li, Chongyang ZhangMask R-CNN based on FPN enhancement and Mask Rescore, etc. Only one single model SE-ResNext-152 with COCO pre-train used;
more details
n/a55.460.950.077.653.275.952.040.033.6
BshapeNet+ [fine-only]yesyesnonononononononononononoBshapeNet: Object Detection and Instance Segmentation with Bounding Shape MasksBa Rom Kang, Ha Young KimBshapeNet+, ResNet-50-FPN as base model, Cityscapes [fine-only]
more details
n/a43.148.237.270.542.356.739.528.821.7
SSAPyesyesnonononononononononononoSSAP: Single-Shot Instance Segmentation With Affinity PyramidNaiyu Gao, Yanhu Shan, Yupei Wang, Xin Zhao, Yinan Yu, Ming Yang, Kaiqi HuangICCV 2019SSAP, ResNet-101, Cityscapes fine-only
more details
n/a51.454.539.881.854.977.854.425.922.3
Spatial Sampling Netyesyesnononononononono22nonoSpatial Sampling Network for Fast Scene UnderstandingDavide Mazzini, Raimondo SchettiniCVPR 2019 Workshop on Autonomous DrivingWe propose a network architecture to perform efficient scene understanding. This work presents three main novelties: the first is an Improved Guided Upsampling Module that can replace in toto the decoder part in common semantic segmentation networks.
Our second contribution is the introduction of a new module based on spatial sampling to perform Instance Segmentation. It provides a very fast instance segmentation, needing only thresholding as post-processing step at inference time. Finally, we propose a novel efficient network design that includes the new modules and we test it against different datasets for outdoor scene understanding.
more details
0.0088421.419.96.249.425.236.823.47.52.7
UPSNetyesyesnonononononononononoyesyesUPSNet: A Unified Panoptic Segmentation NetworkYuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, Raquel UrtasunCVPR 2019
more details
0.22750.752.940.773.652.572.953.331.728.4
Sem2InsyesyesnonononononononononononoAnonymousAnonymous NeurIPS19 submission #4671
more details
n/a32.827.928.243.138.850.839.419.414.6
BshapeNet+ [COCO]yesyesnonononononononononononoBshapeNet: Object Detection and Instance Segmentation with Bounding Shape MasksBa Rom Kang, Ha Young KimBshapeNet+ single model, ResNet-50-FPN as base model, Cityscapes [fine-only + COCO]
more details
n/a50.754.038.172.953.069.456.833.927.3
AdaptISyesyesnonononononononononononoAnonymousAdaptive Instance Selection network architecture for class-agnostic instance segmentation. Given an input image and a point (x, y), it generates a mask for the object located at (x, y). The network adapts to the input point with a help of AdaIN layers, thus producing different masks for different objects on the same image. AdaptIS generates pixel-accurate object masks, therefore it accurately segments objects of complex shape or severely occluded ones.
more details
n/a52.149.846.871.354.377.263.835.418.1
AInnoSegmentationyesyesnonononononononononononoFaen Zhang, Jiahong Wu, Haotian Cao, Zhizheng Yang, Jianfei Song, Ze Huang, Jiashui Huang, Shenglan BenAInnoSegmentation use SE-Resnet 152 as backbone and FPN model to extract multi-level features and use self-develop method to combine multi-features and use COCO datasets to pre-train model and so on
more details
n/a56.759.947.578.958.580.756.539.332.6
iFLYTEK-CVyesyesnonononononononononononoAnonymousiFLYTEK Research, CV Group
more details
n/a58.761.251.879.763.478.860.939.634.6
Panoptic-DeepLab [Cityscapes-fine]yesyesnonononononononononononoPanoptic-DeepLabBowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh ChenOur proposed bottom-up Panoptic-DeepLab is conceptually simple yet delivers state-of-the-art results. The Panoptic-DeepLab adopts dual-ASPP and dual-decoder modules, specific to semantic segmentation and instance segmentation respectively. The semantic segmentation prediction follows the typical design of any semantic segmentation model (e.g., DeepLab), while the instance segmentation prediction involves a simple instance center regression, where the model learns to predict instance centers as well as the offset from each pixel to its corresponding center. This submission exploits only Cityscapes fine annotations.
This entry fixes a minor inference bug (i.e., same trained model) for instance segmentation, compared to the previous submission.
more details
n/a53.154.344.678.952.770.254.336.333.6
snakeyesyesnonononononononononoyesyesDeep Snake for Real-Time Instance SegmentationSida Peng, Wen Jiang, Huaijin Pi, Xiuli Li, Hujun Bao, Xiaowei ZhouCVPR 2020
more details
0.21744.752.137.777.139.962.642.223.422.7
PolyTransformyesyesnonononononononononononoPolyTransform: Deep Polygon Transformer for Instance SegmentationJustin Liang, Namdar Homayounfar, Wei-Chiu Ma, Yuwen Xiong, Rui Hu, Raquel Urtasun
more details
n/a58.060.149.480.162.375.363.040.833.3
StixelPointNetyesyesnonononononononononononoLearning Stixel-Based Instance SegmentationMonty Santarossa, Lukas Schneider, Claudius Zelenka, Lars Schmarje, Reinhard Koch, Uwe Franke IV 2021An adapted version of the PointNet is trained on Stixels as input for instance segmentation.
more details
0.03518.319.313.829.527.742.80.06.36.9
Axial-DeepLab-XL [Cityscapes-fine]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a53.151.642.977.157.573.454.136.631.7
PolyTransform + SegFixyesyesnonononononononononoyesyesAnonymousopensegWe simply apply a novel post-processing scheme based on the PolyTransform (thanks to the authors of PolyTransform for providing their segmentation results). The performance of the baseline PolyTransform is 40.1% and our method achieves 41.2%. Besides, our method also could improve the results of PointRend and PANet by more than 1.0% without any re-training or fine-tuning the segmentation models.
more details
n/a59.261.951.081.863.576.463.541.534.2
GAIS-NetyesyesnonononoyesyesnonononoyesyesGeometry-Aware Instance Segmentation with Disparity MapsCho-Ying Wu, Xiaoyan Hu, Michael Happold, Qiangeng Xu, Ulrich NeumannScalability in Autonomous Driving, workshop at CVPR 2020Geometry-Aware Instance Segmentation with Disparity Maps
more details
n/a46.651.941.973.445.458.445.829.826.3
Axial-DeepLab-L [Mapillary Vistas]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a57.354.345.678.963.583.661.737.633.2
LevelSet R-CNN [fine-only]yesyesnonononononononononononoLevelSet R-CNN: A Deep Variational Method for Instance SegmentationNamdar Homayounfar*, Yuwen Xiong*, Justin Liang*, Wei-Chiu Ma, Raquel UrtasunECCV 2020Obtaining precise instance segmentation masks is of high importance in many modern applications such as robotic manipulation and autonomous driving. Currently, many state of the art models are based on the Mask R-CNN framework which, while very powerful, outputs masks at low resolutions which could result in imprecise boundaries. On the other hand, classic variational methods for segmentation impose desirable global and local data and geometry constraints on the masks by optimizing an energy functional. While mathematically elegant, their direct dependence on good initialization, non-robust image cues and manual setting of hyperparameters renders them unsuitable for modern applications. We propose LevelSet R-CNN, which combines the best of both worlds by obtaining powerful feature representations that are combined in an end-to-end manner with a variational segmentation framework. We demonstrate the effectiveness of our approach on COCO and Cityscapes datasets.
more details
n/a50.355.342.878.048.462.151.334.730.0
LevelSet R-CNN [COCO]yesyesnonononononononononononoLevelSet R-CNN: A Deep Variational Method for Instance SegmentationNamdar Homayounfar*, Yuwen Xiong*, Justin Liang*, Wei-Chiu Ma, Raquel UrtasunECCV 2020Obtaining precise instance segmentation masks is of high importance in many modern applications such as robotic manipulation and autonomous driving. Currently, many state of the art models are based on the Mask R-CNN framework which, while very powerful, outputs masks at low resolutions which could result in imprecise boundaries. On the other hand, classic variational methods for segmentation impose desirable global and local data and geometry constraints on the masks by optimizing an energy functional. While mathematically elegant, their direct dependence on good initialization, non-robust image cues and manual setting of hyperparameters renders them unsuitable for modern applications. We propose LevelSet R-CNN, which combines the best of both worlds by obtaining powerful feature representations that are combined in an end-to-end manner with a variational segmentation framework. We demonstrate the effectiveness of our approach on COCO and Cityscapes datasets.
more details
n/a58.161.148.780.460.276.661.041.935.3
Axial-DeepLab-L [Cityscapes-fine]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.

more details
n/a52.051.343.077.154.476.747.435.730.1
Deep Affinity Net [fine-only]yesyesnonononononononononononoDeep Affinity Net: Instance Segmentation via AffinityXingqian Xu, Mangtik Chiu, Thomas Huang, Honghui ShiA proposal-free method that uses FPN generated features and network predicted 4-neighbor affinities to reconstruct instance segments. During inference time, an efficient graph partitioning algorithm, Cascade-GAEC, is introduced to overcome the long execution time in the high-resolution graph partitioning problem.
more details
n/a46.939.636.066.554.676.656.826.218.8
Naive-Student (iterative semi-supervised learning with Panoptic-DeepLab)yesyesnonononononoyesyesnonononoSemi-Supervised Learning in Video Sequences for Urban Scene SegmentationLiang-Chieh Chen, Raphael Gontijo Lopes, Bowen Cheng, Maxwell D. Collins, Ekin D. Cubuk, Barret Zoph, Hartwig Adam, Jonathon ShlensSupervised learning in large discriminative models is a mainstay for modern computer vision. Such an approach necessitates investing in large-scale human-annotated datasets for achieving state-of-the-art results. In turn, the efficacy of supervised learning may be limited by the size of the human annotated dataset. This limitation is particularly notable for image segmentation tasks, where the expense of human annotation is especially large, yet large amounts of unlabeled data may exist. In this work, we ask if we may leverage semi-supervised learning in unlabeled video sequences to improve the performance on urban scene segmentation, simultaneously tackling semantic, instance, and panoptic segmentation. The goal of this work is to avoid the construction of sophisticated, learned architectures specific to label propagation (e.g., patch matching and optical flow). Instead, we simply predict pseudo-labels for the unlabeled data and train subsequent models with both human-annotated and pseudo-labeled data. The procedure is iterated for several times. As a result, our Naive-Student model, trained with such simple yet effective iterative semi-supervised learning, attains state-of-the-art results at all three Cityscapes benchmarks, reaching the performance of 67.8% PQ, 42.6% AP, and 85.2% mIOU on the test set. We view this work as a notable step towards building a simple procedure to harness unlabeled video sequences to surpass state-of-the-art performance on core computer vision tasks.
more details
n/a59.860.151.482.665.779.653.846.039.1
Axial-DeepLab-XL [Mapillary Vistas]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a57.456.448.480.262.278.258.140.535.4
Panoptic-DeepLab [Mapillary Vistas]yesyesnonononononononononononoPanoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic SegmentationBowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh ChenWe employ a stronger backbone, WR-41, in Panoptic-DeepLab.
For Panoptic-DeepLab, please refer to https://arxiv.org/abs/1911.10194.
For wide-ResNet-41 (WR-41) backbone, please refer to https://arxiv.org/abs/2005.10266.
more details
n/a57.056.646.980.763.179.548.644.236.8
EfficientPS [Mapillary Vistas]yesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaUnderstanding the scene in which an autonomous robot operates is critical for its competent functioning. Such scene comprehension necessitates recognizing instances of traffic participants along with general scene semantics which can be effectively addressed by the panoptic segmentation task. In this paper, we introduce the Efficient Panoptic Segmentation (EfficientPS) architecture that consists of a shared backbone which efficiently encodes and fuses semantically rich multi-scale features. We incorporate a new semantic head that aggregates fine and contextual features coherently and a new variant of Mask R-CNN as the instance head. We also propose a novel panoptic fusion module that congruously integrates the output logits from both the heads of our EfficientPS architecture to yield the final panoptic segmentation output. Additionally, we introduce the KITTI panoptic segmentation dataset that contains panoptic annotations for the popularly challenging KITTI benchmark. Extensive evaluations on Cityscapes, KITTI, Mapillary Vistas and Indian Driving Dataset demonstrate that our proposed architecture consistently sets the new state-of-the-art on all these four benchmarks while being the most efficient and fast panoptic segmentation architecture to date.
more details
n/a55.858.149.078.655.679.054.136.135.9
seamseg_rvcsubsetnonononononononononononoyesyesSeamless Scene SegmentationPorzi, Lorenzo and Rota Bulò, Samuel and Colovic, Aleksander and Kontschieder, PeterThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2019Seamless Scene Segmentation Resnet101, pretrained on Imagenet; supplied with altered MVD to include WildDash2 classes; does not contain other RVC label policies (i.e. no ADE20K/COCO-specific classes -> rvcsubset and not a proper submission)
more details
n/a32.140.027.253.733.341.823.319.218.0
UniDet_RVCyesyesnononononononono22nonoAnonymous
more details
300.048.050.035.168.255.670.452.628.822.9
EffPS_b1bs4_RVCyesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaEfficientPS with EfficientNet-b1 backbone. Trained with a batch size of 4.
more details
n/a35.141.429.863.442.059.96.018.819.2
Panoptic-DeepLab w/ SWideRNet [Cityscapes-fine]yesyesnonononononononononononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
more details
n/a55.456.248.880.157.273.251.340.735.6
Panoptic-DeepLab w/ SWideRNet [Mapillary Vistas]yesyesnonononononononononononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
more details
n/a59.656.850.480.363.383.758.845.138.6
PolyTransform + SegFix + BPRyesyesnonononononononononoyesyesLook Closer to Segment Better: Boundary Patch Refinement for Instance SegmentationChufeng Tang*, Hang Chen*, Xiao Li, Jianmin Li, Zhaoxiang Zhang, Xiaolin HuCVPR 2021Tremendous efforts have been made on instance segmentation but the mask quality is still not satisfactory. The boundaries of predicted instance masks are usually imprecise due to the low spatial resolution of feature maps and the imbalance problem caused by the extremely low proportion of boundary pixels. To address these issues, we propose a conceptually simple yet effective post-processing refinement framework to improve the boundary quality based on the results of any instance segmentation model, termed BPR. Following the idea of looking closer to segment boundaries better, we extract and refine a series of small boundary patches along the predicted instance boundaries. The refinement is accomplished by a boundary patch refinement network at higher resolution. The proposed BPR framework yields significant improvements over the Mask R-CNN baseline on Cityscapes benchmark, especially on the boundary-aware metrics. Moreover, by applying the BPR framework to the PolyTransform + SegFix baseline, we reached 1st place on the Cityscapes leaderboard.
more details
n/a60.763.552.283.764.277.766.642.735.2
Panoptic-DeepLab w/ SWideRNet [Mapillary Vistas + Pseudo-labels]yesyesnonononononoyesyesnonononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime. Following Naive-Student, this model is additionally trained with pseudo-labels generated from Cityscapes Video and train-extra set (i.e., the coarse annotations are not used, but the images are).
more details
n/a60.958.951.182.368.485.155.446.239.9
CenterPolyyesyesnononononononono44nonoAnonymous
more details
0.04524.529.518.654.321.123.824.012.711.6
HRI-INSTyesyesyesyesnonononononononononoAnonymousHRI-INST
more details
n/a61.058.951.881.067.986.761.743.736.1
DH-ARIyesyesnonononononononononononoAnonymousDH-ARI
more details
n/a61.664.455.385.262.182.058.946.338.7
HRI-TRANSyesyesnonononononononononononoAnonymousHRI transformer instance segmentation
more details
n/a61.360.751.883.067.986.761.742.236.1
kMaX-DeepLab [Cityscapes-fine]yesyesnonononononononononoyesyesk-means Mask TransformerQihang Yu, Huiyu Wang, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, and Liang-Chieh ChenECCV 2022kMaX-DeepLab w/ ConvNeXt-L backbone (ImageNet-22k + 1k pretrained). This result is obtained by the kMaX-DeepLab trained for Panoptic Segmentation task. No test-time augmentation or other external dataset.
more details
n/a61.257.151.481.465.185.675.940.333.1
QueryInst-Parallel CompletionyesyesnonononononononononononoHai Wang ;Shilin Zhu ;PuPu ;Meng; Le; Apple; RongWe propose a novel feature complete network framework queryinst parallel completion. First, the global context module is introduced into the backbone network to obtain instance information. Then, parallel semantic branch and parallel global branch are proposed to extract the semantic information and global information of feature layer, so as to complete the ROI features. In addition, we also propose a feature transfer structure, which explicitly increases the connection between detection and segmentation branches, changes the gradient back-propagation path, and indirectly complements the ROI features.
more details
n/a50.858.543.979.343.568.749.932.530.6
CenterPoly v2yesyesnonononononononononoyesyesReal-time instance segmentation with polygons using an Intersection-over-Union lossKatia Jodogne-del Litto, Guillaume-Alexandre Bilodeau
more details
0.04527.228.016.351.831.643.223.413.410.0
Jiangsu-University-Environmental-PerceptionyesyesnonononononononononononoAnonymous
more details
n/a55.961.549.180.347.376.354.443.135.3

Panoptic Semantic Labeling Task

PQ on class-level

namefinefinecoarsecoarse16-bit16-bitdepthdepthvideovideosubsubcodecodetitleauthorsvenuedescriptionRuntime [s]allthingsstuffroadsidewalkbuildingwallfencepoletraffic lighttraffic signvegetationterrainskypersonridercartruckbustrainmotorcyclebicycle
HANetyesyesnonononononononononononoAnonymousHolistic Attention Network for End-to-End Panoptic Segmentation
more details
n/a51.240.459.097.669.984.917.621.042.742.560.789.034.587.946.033.957.139.446.831.336.332.5
TASCNet-enhancedyesyesnonononononononononononoLearning to Fuse Things and StuffJie Li, Allan Raventos, Arjun Bhargava, Takaaki Tagawa, Adrien GaidonArxivWe proposed a joint network for panoptic segmentation, which is a variation of our previous work, TASCNet. (https://arxiv.org/pdf/1812.01192.pdf)
A shared backbone (ResNeXt-101) pretrained on COCO detection is used.
more details
n/a60.753.466.098.275.587.833.335.253.952.566.890.043.189.955.252.666.949.457.655.746.943.2
Sem2InsyesyesnonononononononononononoAnonymousAnonymous NeurIPS19 submission #4671
more details
n/a52.337.862.889.074.286.030.535.551.947.766.988.337.383.839.839.747.736.843.231.633.929.5
Pixelwise Instance Segmentation with a Dynamically Instantiated NetworkyesyesyesyesnonononononononononoPixelwise Instance Segmentation with a Dynamically Instantiated NetworkAnurag Arnab and Philip H.S TorrComputer Vision and Pattern Recognition (CVPR) 2017Results are produced using the method from our CVPR 2017 paper, "Pixelwise Instance Segmentation with a Dynamically Instantiated Network."

On the instance segmentation benchmark, the identical model achieved a mean AP of 23.4

This model also served as the fully supervised baseline in our ECCV 2018 paper, "Weakly- and Semi-Supervised Panoptic Segmentation".
more details
n/a55.444.063.798.375.287.631.235.743.347.765.189.638.188.744.743.150.840.649.946.242.434.7
Seamless Scene SegmentationyesyesnonononononononononoyesyesSeamless Scene SegmentationLorenzo Porzi, Samuel Rota Bulò, Aleksander Colovic and Peter KontschiederThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019Seamless Scene Segmentation is a CNN-based architecture that can be trained end-to-end to predict a complete class- and instance-specific labeling for each pixel in an image. To tackle this task, also known as "Panoptic Segmentation", we take advantage of a novel segmentation head that seamlessly integrates multi-scale features generated by a Feature Pyramid Network with contextual information conveyed by a light-weight DeepLab-like module.

In this submission we use a single model, with a ResNet50 backbone, pre-trained on ImageNet and Mapillary Vistas Research Edition, and fine-tuned on Cityscapes' fine training set. Inference is single-shot, without any form of test-time augmentation. Validation scores of the submitted model are 64.97 PQ, 68.04 PQ stuff, 60.75 PQ thing, 80.73 IoU.
more details
n/a62.656.067.598.376.888.836.639.759.051.665.790.445.889.657.753.568.952.662.254.751.247.0
SSAPyesyesnonononononononononononoSSAP: Single-Shot Instance Segmentation With Affinity PyramidNaiyu Gao, Yanhu Shan, Yupei Wang, Xin Zhao, Yinan Yu, Ming Yang, Kaiqi HuangICCV 2019SSAP, ResNet-101, Cityscapes fine-only
more details
n/a58.948.466.598.075.988.134.738.056.151.668.889.743.687.350.346.165.643.254.848.141.937.3
Panoptic-DeepLab [Cityscapes-fine]yesyesnonononononononononononoPanoptic-DeepLabBowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh ChenOur proposed bottom-up Panoptic-DeepLab is conceptually simple yet delivers state-of-the-art results. The Panoptic-DeepLab adopts dual-ASPP and dual-decoder modules, specific to semantic segmentation and instance segmentation respectively. The semantic segmentation prediction follows the typical design of any semantic segmentation model (e.g., DeepLab), while the instance segmentation prediction involves a simple instance center regression, where the model learns to predict instance centers as well as the offset from each pixel to its corresponding center.
This submission exploits only Cityscapes fine annotations.
more details
n/a62.352.169.798.578.189.038.838.664.361.570.990.846.090.454.150.366.844.958.151.147.444.4
iFLYTEK-CVyesyesyesyesnonononononononononoAnonymousiFLYTEK Research, CV Group
more details
n/a66.058.071.898.779.690.145.546.365.459.674.591.547.990.359.256.670.054.963.761.551.846.2
Unifying Training and Inference for Panoptic Segmentation [Cityscapes-fine]yesyesnonononononononononoyesyesUnifying Training and Inference for Panoptic SegmentationQizhu Li, Xiaojuan Qi, Philip H.S. TorrThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020We present an end-to-end network to bridge the gap between training and inference pipeline for panoptic segmentation. In contrast to recent works, our network exploits a parametrised, yet lightweight panoptic segmentation submodule, powered by an end-to-end learnt dense instance affinity, to capture the probability that any pair of pixels belong to the same instance. This panoptic submodule gives rise to a novel propagation mechanism for panoptic logits and enables the network to output a coherent panoptic segmentation map for both “stuff” and “thing” classes, without any post-processing.

This model uses a ResNet-50 backbone, and is trained with only Cityscapes' fine data.
more details
n/a61.052.767.198.276.087.533.537.556.356.069.189.944.389.554.052.764.349.557.452.447.843.9
Unifying Training and Inference for Panoptic Segmentation [COCO]yesyesnonononononononononoyesyesUnifying Training and Inference for Panoptic SegmentationQizhu Li, Xiaojuan Qi, Philip H.S. TorrThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020We present an end-to-end network to bridge the gap between training and inference pipeline for panoptic segmentation. In contrast to recent works, our network exploits a parametrised, yet lightweight panoptic segmentation submodule, powered by an end-to-end learnt dense instance affinity, to capture the probability that any pair of pixels belong to the same instance. This panoptic submodule gives rise to a novel propagation mechanism for panoptic logits and enables the network to output a coherent panoptic segmentation map for both “stuff” and “thing” classes, without any post-processing.

This model uses a ResNet-101 backbone, and is pretrained on COCO 2017 training images and finetuned on Cityscapes' fine data.
more details
n/a63.356.068.598.477.388.740.139.158.656.369.890.245.789.856.554.765.851.562.561.250.645.4
Axial-DeepLab-XL [Cityscapes-fine]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a62.853.869.398.578.588.838.141.462.456.571.890.845.790.354.550.867.850.259.355.648.343.8
Axial-DeepLab-L [Mapillary Vistas]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a65.656.971.998.679.589.944.147.566.859.772.991.349.490.955.952.968.855.264.562.250.644.8
Axial-DeepLab-L [Cityscapes-fine]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a62.753.469.598.477.788.637.940.663.359.471.690.745.290.453.850.866.747.661.157.747.042.4
Naive-Student (iterative semi-supervised learning with Panoptic-DeepLab)yesyesnonononononoyesyesnonononoSemi-Supervised Learning in Video Sequences for Urban Scene SegmentationLiang-Chieh Chen, Raphael Gontijo Lopes, Bowen Cheng, Maxwell D. Collins, Ekin D. Cubuk, Barret Zoph, Hartwig Adam, Jonathon ShlensSupervised learning in large discriminative models is a mainstay for modern computer vision. Such an approach necessitates investing in large-scale human-annotated datasets for achieving state-of-the-art results. In turn, the efficacy of supervised learning may be limited by the size of the human annotated dataset. This limitation is particularly notable for image segmentation tasks, where the expense of human annotation is especially large, yet large amounts of unlabeled data may exist. In this work, we ask if we may leverage semi-supervised learning in unlabeled video sequences to improve the performance on urban scene segmentation, simultaneously tackling semantic, instance, and panoptic segmentation. The goal of this work is to avoid the construction of sophisticated, learned architectures specific to label propagation (e.g., patch matching and optical flow). Instead, we simply predict pseudo-labels for the unlabeled data and train subsequent models with both human-annotated and pseudo-labeled data. The procedure is iterated for several times. As a result, our Naive-Student model, trained with such simple yet effective iterative semi-supervised learning, attains state-of-the-art results at all three Cityscapes benchmarks, reaching the performance of 67.8% PQ, 42.6% AP, and 85.2% mIOU on the test set. We view this work as a notable step towards building a simple procedure to harness unlabeled video sequences to surpass state-of-the-art performance on core computer vision tasks.
more details
n/a67.861.572.498.680.590.345.247.768.859.674.191.948.590.960.258.072.260.069.264.556.651.3
Axial-DeepLab-XL [Mapillary Vistas]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a66.658.772.398.779.890.347.650.067.658.973.491.546.491.057.254.969.857.369.062.752.546.6
EfficientPS [Cityscapes-fine]yesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaUnderstanding the scene in which an autonomous robot operates is critical for its competent functioning. Such scene comprehension necessitates recognizing instances of traffic participants along with general scene semantics which can be effectively addressed by the panoptic segmentation task. In this paper, we introduce the Efficient Panoptic Segmentation (EfficientPS) architecture that consists of a shared backbone which efficiently encodes and fuses semantically rich multi-scale features. We incorporate a new semantic head that aggregates fine and contextual features coherently and a new variant of Mask R-CNN as the instance head. We also propose a novel panoptic fusion module that congruously integrates the output logits from both the heads of our EfficientPS architecture to yield the final panoptic segmentation output. Additionally, we introduce the KITTI panoptic segmentation dataset that contains panoptic annotations for the popularly challenging KITTI benchmark. Extensive evaluations on Cityscapes, KITTI, Mapillary Vistas and Indian Driving Dataset demonstrate that our proposed architecture consistently sets the new state-of-the-art on all these four benchmarks while being the most efficient and fast panoptic segmentation architecture to date.
more details
n/a64.156.769.498.579.189.141.741.759.955.670.190.947.290.160.957.570.348.460.055.150.950.4
Panoptic-DeepLab [Mapillary Vistas]yesyesnonononononononononononoPanoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic SegmentationBowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh ChenWe employ a stronger backbone, WR-41, in Panoptic-DeepLab.
For Panoptic-DeepLab, please refer to https://arxiv.org/abs/1911.10194.
For wide-ResNet-41 (WR-41) backbone, please refer to https://arxiv.org/abs/2005.10266.
more details
n/a66.558.872.098.779.890.144.446.768.859.974.591.647.590.658.555.370.657.767.557.354.249.5
EfficientPS [Mapillary Vistas]yesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaUnderstanding the scene in which an autonomous robot operates is critical for its competent functioning. Such scene comprehension necessitates recognizing instances of traffic participants along with general scene semantics which can be effectively addressed by the panoptic segmentation task. In this paper, we introduce the Efficient Panoptic Segmentation (EfficientPS) architecture that consists of a shared backbone which efficiently encodes and fuses semantically rich multi-scale features. We incorporate a new semantic head that aggregates fine and contextual features coherently and a new variant of Mask R-CNN as the instance head. We also propose a novel panoptic fusion module that congruously integrates the output logits from both the heads of our EfficientPS architecture to yield the final panoptic segmentation output. Additionally, we introduce the KITTI panoptic segmentation dataset that contains panoptic annotations for the popularly challenging KITTI benchmark. Extensive evaluations on Cityscapes, KITTI, Mapillary Vistas and Indian Driving Dataset demonstrate that our proposed architecture consistently sets the new state-of-the-art on all these four benchmarks while being the most efficient and fast panoptic segmentation architecture to date.
more details
n/a67.160.971.698.679.890.144.345.666.158.673.791.448.790.461.658.671.558.868.165.351.851.4
seamseg_rvcsubsetnonononononononononononoyesyesSeamless Scene SegmentationPorzi, Lorenzo and Rota Bulò, Samuel and Colovic, Aleksander and Kontschieder, PeterThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2019Seamless Scene Segmentation Resnet101, pretrained on Imagenet; supplied with altered MVD to include WildDash2 classes; does not contain other RVC label policies (i.e. no ADE20K/COCO-specific classes -> rvcsubset and not a proper submission)
more details
n/a51.941.459.587.059.885.130.830.250.247.058.588.729.087.849.740.757.939.249.125.636.932.5
EffPS_b1bs4_RVCyesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaEfficientPS with EfficientNet-b1 backbone. Trained with a batch size of 4.
more details
n/a48.040.853.297.264.684.122.120.528.620.442.886.830.987.947.945.562.042.748.68.836.734.5
Panoptic-DeepLab w/ SWideRNet [Cityscapes-fine]yesyesnonononononononononononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
more details
n/a64.856.570.998.679.389.539.547.066.558.673.391.245.990.758.456.370.751.362.053.551.648.2
Panoptic-DeepLab w/ SWideRNet [Mapillary Vistas]yesyesnonononononononononononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
more details
n/a67.860.972.898.779.990.346.448.869.360.776.191.747.791.759.858.372.059.469.162.955.250.7
Panoptic-DeepLab w/ SWideRNet [Mapillary Vistas + Pseudo-labels]yesyesnonononononoyesyesnonononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
Following Naive-Student, this model is additionally trained with pseudo-labels generated from Cityscapes Video and train-extra set (i.e., the coarse annotations are not used, but the images are).
more details
n/a68.561.973.398.780.890.646.953.770.660.074.691.947.690.960.657.472.661.772.162.256.551.7
hri_panopticyesyesnonononononononononononoAnonymous
more details
n/a68.061.073.198.881.090.547.950.367.460.475.291.749.091.760.656.372.462.268.661.755.550.4
COPSyesyesnononononononono44nonoCombinatorial Optimization for Panoptic Segmentation: A Fully Differentiable Approach Ahmed Abbas, Paul SwobodaNeurIPS 2021COPS fully differentiable with ResNet 50 backbone.
more details
n/a60.051.865.998.175.187.635.037.151.353.566.289.842.788.851.849.064.045.857.259.645.341.8
kMaX-DeepLab [Cityscapes-fine]yesyesnonononononononononoyesyesk-means Mask TransformerQihang Yu, Huiyu Wang, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2022kMaX-DeepLab w/ ConvNeXt-L backbone (ImageNet-22k + 1k pretrained). This result is obtained by the kMaX-DeepLab trained for Panoptic Segmentation task. No test-time augmentation or other external dataset.
more details
n/a66.259.670.998.679.689.845.645.960.157.672.191.248.791.256.056.668.957.769.267.853.447.2

SQ on class-level

namefinefinecoarsecoarse16-bit16-bitdepthdepthvideovideosubsubcodecodetitleauthorsvenuedescriptionRuntime [s]allthingsstuffroadsidewalkbuildingwallfencepoletraffic lighttraffic signvegetationterrainskypersonridercartruckbustrainmotorcyclebicycle
HANetyesyesnonononononononononononoAnonymousHolistic Attention Network for End-to-End Panoptic Segmentation
more details
n/a77.775.379.497.782.388.469.769.465.469.074.190.574.892.273.864.381.482.582.877.971.667.7
TASCNet-enhancedyesyesnonononononononononononoLearning to Fuse Things and StuffJie Li, Allan Raventos, Arjun Bhargava, Takaaki Tagawa, Adrien GaidonArxivWe proposed a joint network for panoptic segmentation, which is a variation of our previous work, TASCNet. (https://arxiv.org/pdf/1812.01192.pdf)
A shared backbone (ResNeXt-101) pretrained on COCO detection is used.
more details
n/a81.079.782.098.384.190.474.273.068.073.877.791.378.293.277.474.284.685.386.581.275.572.5
Sem2InsyesyesnonononononononononononoAnonymousAnonymous NeurIPS19 submission #4671
more details
n/a78.977.280.289.182.788.975.075.367.371.876.489.976.788.673.671.780.085.385.478.773.669.4
Pixelwise Instance Segmentation with a Dynamically Instantiated NetworkyesyesyesyesnonononononononononoPixelwise Instance Segmentation with a Dynamically Instantiated NetworkAnurag Arnab and Philip H.S TorrComputer Vision and Pattern Recognition (CVPR) 2017Results are produced using the method from our CVPR 2017 paper, "Pixelwise Instance Segmentation with a Dynamically Instantiated Network."

On the instance segmentation benchmark, the identical model achieved a mean AP of 23.4

This model also served as the fully supervised baseline in our ECCV 2018 paper, "Weakly- and Semi-Supervised Panoptic Segmentation".
more details
n/a79.777.381.598.484.390.274.073.766.372.375.890.977.992.574.470.878.584.385.281.373.970.1
Seamless Scene SegmentationyesyesnonononononononononoyesyesSeamless Scene SegmentationLorenzo Porzi, Samuel Rota Bulò, Aleksander Colovic and Peter KontschiederThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019Seamless Scene Segmentation is a CNN-based architecture that can be trained end-to-end to predict a complete class- and instance-specific labeling for each pixel in an image. To tackle this task, also known as "Panoptic Segmentation", we take advantage of a novel segmentation head that seamlessly integrates multi-scale features generated by a Feature Pyramid Network with contextual information conveyed by a light-weight DeepLab-like module.

In this submission we use a single model, with a ResNet50 backbone, pre-trained on ImageNet and Mapillary Vistas Research Edition, and fine-tuned on Cityscapes' fine training set. Inference is single-shot, without any form of test-time augmentation. Validation scores of the submitted model are 64.97 PQ, 68.04 PQ stuff, 60.75 PQ thing, 80.73 IoU.
more details
n/a82.180.383.598.485.191.076.475.070.477.380.691.978.593.678.574.884.987.486.381.076.373.2
SSAPyesyesnonononononononononononoSSAP: Single-Shot Instance Segmentation With Affinity PyramidNaiyu Gao, Yanhu Shan, Yupei Wang, Xin Zhao, Yinan Yu, Ming Yang, Kaiqi HuangICCV 2019SSAP, ResNet-101, Cityscapes fine-only
more details
n/a82.482.982.098.184.890.774.874.668.072.476.891.478.891.881.376.388.291.591.783.976.174.3
Panoptic-DeepLab [Cityscapes-fine]yesyesnonononononononononononoPanoptic-DeepLabBowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh ChenOur proposed bottom-up Panoptic-DeepLab is conceptually simple yet delivers state-of-the-art results. The Panoptic-DeepLab adopts dual-ASPP and dual-decoder modules, specific to semantic segmentation and instance segmentation respectively. The semantic segmentation prediction follows the typical design of any semantic segmentation model (e.g., DeepLab), while the instance segmentation prediction involves a simple instance center regression, where the model learns to predict instance centers as well as the offset from each pixel to its corresponding center.
This submission exploits only Cityscapes fine annotations.
more details
n/a82.480.783.698.685.491.177.175.871.175.480.791.979.393.277.773.684.987.288.583.976.872.8
iFLYTEK-CVyesyesyesyesnonononononononononoAnonymousiFLYTEK Research, CV Group
more details
n/a83.281.384.698.886.692.078.676.372.878.282.092.479.393.578.475.685.388.488.184.077.073.4
Unifying Training and Inference for Panoptic Segmentation [Cityscapes-fine]yesyesnonononononononononoyesyesUnifying Training and Inference for Panoptic SegmentationQizhu Li, Xiaojuan Qi, Philip H.S. TorrThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020We present an end-to-end network to bridge the gap between training and inference pipeline for panoptic segmentation. In contrast to recent works, our network exploits a parametrised, yet lightweight panoptic segmentation submodule, powered by an end-to-end learnt dense instance affinity, to capture the probability that any pair of pixels belong to the same instance. This panoptic submodule gives rise to a novel propagation mechanism for panoptic logits and enables the network to output a coherent panoptic segmentation map for both “stuff” and “thing” classes, without any post-processing.

This model uses a ResNet-50 backbone, and is trained with only Cityscapes' fine data.
more details
n/a81.479.682.898.384.390.475.674.568.875.979.291.478.893.177.474.483.885.385.182.575.872.7
Unifying Training and Inference for Panoptic Segmentation [COCO]yesyesnonononononononononoyesyesUnifying Training and Inference for Panoptic SegmentationQizhu Li, Xiaojuan Qi, Philip H.S. TorrThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020We present an end-to-end network to bridge the gap between training and inference pipeline for panoptic segmentation. In contrast to recent works, our network exploits a parametrised, yet lightweight panoptic segmentation submodule, powered by an end-to-end learnt dense instance affinity, to capture the probability that any pair of pixels belong to the same instance. This panoptic submodule gives rise to a novel propagation mechanism for panoptic logits and enables the network to output a coherent panoptic segmentation map for both “stuff” and “thing” classes, without any post-processing.

This model uses a ResNet-101 backbone, and is pretrained on COCO 2017 training images and finetuned on Cityscapes' fine data.
more details
n/a82.481.083.498.685.690.976.975.669.776.279.691.779.793.178.175.284.287.788.484.277.173.3
Axial-DeepLab-XL [Cityscapes-fine]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a82.481.083.598.685.290.976.175.171.476.980.391.878.693.277.574.385.288.389.084.476.573.0
Axial-DeepLab-L [Mapillary Vistas]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a83.081.084.598.785.791.778.177.972.978.082.192.078.993.578.074.385.387.788.884.976.172.8
Axial-DeepLab-L [Cityscapes-fine]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a82.280.783.398.685.090.976.375.470.975.380.191.679.093.277.673.285.387.389.083.977.172.6
Naive-Student (iterative semi-supervised learning with Panoptic-DeepLab)yesyesnonononononoyesyesnonononoSemi-Supervised Learning in Video Sequences for Urban Scene SegmentationLiang-Chieh Chen, Raphael Gontijo Lopes, Bowen Cheng, Maxwell D. Collins, Ekin D. Cubuk, Barret Zoph, Hartwig Adam, Jonathon ShlensSupervised learning in large discriminative models is a mainstay for modern computer vision. Such an approach necessitates investing in large-scale human-annotated datasets for achieving state-of-the-art results. In turn, the efficacy of supervised learning may be limited by the size of the human annotated dataset. This limitation is particularly notable for image segmentation tasks, where the expense of human annotation is especially large, yet large amounts of unlabeled data may exist. In this work, we ask if we may leverage semi-supervised learning in unlabeled video sequences to improve the performance on urban scene segmentation, simultaneously tackling semantic, instance, and panoptic segmentation. The goal of this work is to avoid the construction of sophisticated, learned architectures specific to label propagation (e.g., patch matching and optical flow). Instead, we simply predict pseudo-labels for the unlabeled data and train subsequent models with both human-annotated and pseudo-labeled data. The procedure is iterated for several times. As a result, our Naive-Student model, trained with such simple yet effective iterative semi-supervised learning, attains state-of-the-art results at all three Cityscapes benchmarks, reaching the performance of 67.8% PQ, 42.6% AP, and 85.2% mIOU on the test set. We view this work as a notable step towards building a simple procedure to harness unlabeled video sequences to surpass state-of-the-art performance on core computer vision tasks.
more details
n/a83.881.685.498.886.792.179.178.574.780.183.592.579.693.878.675.185.588.088.385.977.773.9
Axial-DeepLab-XL [Mapillary Vistas]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a83.581.385.098.886.391.978.478.873.979.383.092.279.293.578.474.985.587.688.485.676.973.3
EfficientPS [Cityscapes-fine]yesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaUnderstanding the scene in which an autonomous robot operates is critical for its competent functioning. Such scene comprehension necessitates recognizing instances of traffic participants along with general scene semantics which can be effectively addressed by the panoptic segmentation task. In this paper, we introduce the Efficient Panoptic Segmentation (EfficientPS) architecture that consists of a shared backbone which efficiently encodes and fuses semantically rich multi-scale features. We incorporate a new semantic head that aggregates fine and contextual features coherently and a new variant of Mask R-CNN as the instance head. We also propose a novel panoptic fusion module that congruously integrates the output logits from both the heads of our EfficientPS architecture to yield the final panoptic segmentation output. Additionally, we introduce the KITTI panoptic segmentation dataset that contains panoptic annotations for the popularly challenging KITTI benchmark. Extensive evaluations on Cityscapes, KITTI, Mapillary Vistas and Indian Driving Dataset demonstrate that our proposed architecture consistently sets the new state-of-the-art on all these four benchmarks while being the most efficient and fast panoptic segmentation architecture to date.
more details
n/a82.680.983.898.685.991.278.576.070.276.380.392.078.993.479.075.585.287.987.582.676.173.5
Panoptic-DeepLab [Mapillary Vistas]yesyesnonononononononononononoPanoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic SegmentationBowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh ChenWe employ a stronger backbone, WR-41, in Panoptic-DeepLab.
For Panoptic-DeepLab, please refer to https://arxiv.org/abs/1911.10194.
For wide-ResNet-41 (WR-41) backbone, please refer to https://arxiv.org/abs/2005.10266.
more details
n/a83.581.185.398.786.692.079.878.174.479.983.292.479.893.478.274.285.387.788.583.977.473.2
EfficientPS [Mapillary Vistas]yesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaUnderstanding the scene in which an autonomous robot operates is critical for its competent functioning. Such scene comprehension necessitates recognizing instances of traffic participants along with general scene semantics which can be effectively addressed by the panoptic segmentation task. In this paper, we introduce the Efficient Panoptic Segmentation (EfficientPS) architecture that consists of a shared backbone which efficiently encodes and fuses semantically rich multi-scale features. We incorporate a new semantic head that aggregates fine and contextual features coherently and a new variant of Mask R-CNN as the instance head. We also propose a novel panoptic fusion module that congruously integrates the output logits from both the heads of our EfficientPS architecture to yield the final panoptic segmentation output. Additionally, we introduce the KITTI panoptic segmentation dataset that contains panoptic annotations for the popularly challenging KITTI benchmark. Extensive evaluations on Cityscapes, KITTI, Mapillary Vistas and Indian Driving Dataset demonstrate that our proposed architecture consistently sets the new state-of-the-art on all these four benchmarks while being the most efficient and fast panoptic segmentation architecture to date.
more details
n/a83.481.584.898.786.391.778.477.672.779.182.592.280.493.679.575.985.988.287.884.076.873.9
seamseg_rvcsubsetnonononononononononononoyesyesSeamless Scene SegmentationPorzi, Lorenzo and Rota Bulò, Samuel and Colovic, Aleksander and Kontschieder, PeterThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2019Seamless Scene Segmentation Resnet101, pretrained on Imagenet; supplied with altered MVD to include WildDash2 classes; does not contain other RVC label policies (i.e. no ADE20K/COCO-specific classes -> rvcsubset and not a proper submission)
more details
n/a78.578.078.988.679.187.473.470.967.272.775.990.668.992.776.572.284.784.984.479.471.969.7
EffPS_b1bs4_RVCyesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaEfficientPS with EfficientNet-b1 backbone. Trained with a batch size of 4.
more details
n/a78.378.778.097.480.187.373.066.962.063.371.589.474.593.078.371.684.485.886.180.872.470.4
Panoptic-DeepLab w/ SWideRNet [Cityscapes-fine]yesyesnonononononononononononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
more details
n/a83.481.784.598.786.091.676.876.972.979.282.592.279.793.678.575.485.487.889.286.477.273.8
Panoptic-DeepLab w/ SWideRNet [Mapillary Vistas]yesyesnonononononononononononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
more details
n/a83.881.885.398.786.692.278.577.874.980.483.292.579.793.378.475.685.588.389.385.777.573.9
Panoptic-DeepLab w/ SWideRNet [Mapillary Vistas + Pseudo-labels]yesyesnonononononoyesyesnonononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
Following Naive-Student, this model is additionally trained with pseudo-labels generated from Cityscapes Video and train-extra set (i.e., the coarse annotations are not used, but the images are).
more details
n/a83.981.785.698.886.692.279.878.674.980.483.992.679.793.878.275.785.488.989.085.377.474.0
hri_panopticyesyesnonononononononononononoAnonymous
more details
n/a84.382.185.998.987.292.480.279.575.080.983.992.881.093.478.976.785.688.889.783.878.974.7
COPSyesyesnononononononono44nonoCombinatorial Optimization for Panoptic Segmentation: A Fully Differentiable Approach Ahmed Abbas, Paul SwobodaNeurIPS 2021COPS fully differentiable with ResNet 50 backbone.
more details
n/a81.480.282.398.384.790.176.675.167.074.877.391.278.392.576.574.183.687.689.083.875.371.7
kMaX-DeepLab [Cityscapes-fine]yesyesnonononononononononoyesyesk-means Mask TransformerQihang Yu, Huiyu Wang, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2022kMaX-DeepLab w/ ConvNeXt-L backbone (ImageNet-22k + 1k pretrained). This result is obtained by the kMaX-DeepLab trained for Panoptic Segmentation task. No test-time augmentation or other external dataset.
more details
n/a84.082.685.198.786.491.979.478.573.678.782.792.280.393.479.776.486.288.589.487.978.774.2

RQ on class-level

namefinefinecoarsecoarse16-bit16-bitdepthdepthvideovideosubsubcodecodetitleauthorsvenuedescriptionRuntime [s]allthingsstuffroadsidewalkbuildingwallfencepoletraffic lighttraffic signvegetationterrainskypersonridercartruckbustrainmotorcyclebicycle
HANetyesyesnonononononononononononoAnonymousHolistic Attention Network for End-to-End Panoptic Segmentation
more details
n/a63.953.571.499.884.996.125.330.365.461.682.098.446.195.462.352.670.247.856.640.250.748.0
TASCNet-enhancedyesyesnonononononononononononoLearning to Fuse Things and StuffJie Li, Allan Raventos, Arjun Bhargava, Takaaki Tagawa, Adrien GaidonArxivWe proposed a joint network for panoptic segmentation, which is a variation of our previous work, TASCNet. (https://arxiv.org/pdf/1812.01192.pdf)
A shared backbone (ResNeXt-101) pretrained on COCO detection is used.
more details
n/a73.867.078.899.989.797.144.948.379.471.185.998.655.196.571.370.979.057.966.668.662.159.5
Sem2InsyesyesnonononononononononononoAnonymousAnonymous NeurIPS19 submission #4671
more details
n/a65.248.977.099.989.796.740.647.277.266.487.698.248.694.654.055.359.643.250.640.246.142.5
Pixelwise Instance Segmentation with a Dynamically Instantiated NetworkyesyesyesyesnonononononononononoPixelwise Instance Segmentation with a Dynamically Instantiated NetworkAnurag Arnab and Philip H.S TorrComputer Vision and Pattern Recognition (CVPR) 2017Results are produced using the method from our CVPR 2017 paper, "Pixelwise Instance Segmentation with a Dynamically Instantiated Network."

On the instance segmentation benchmark, the identical model achieved a mean AP of 23.4

This model also served as the fully supervised baseline in our ECCV 2018 paper, "Weakly- and Semi-Supervised Panoptic Segmentation".
more details
n/a68.157.076.199.989.197.142.248.465.366.085.998.548.995.960.060.964.748.158.656.857.449.5
Seamless Scene SegmentationyesyesnonononononononononoyesyesSeamless Scene SegmentationLorenzo Porzi, Samuel Rota Bulò, Aleksander Colovic and Peter KontschiederThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019Seamless Scene Segmentation is a CNN-based architecture that can be trained end-to-end to predict a complete class- and instance-specific labeling for each pixel in an image. To tackle this task, also known as "Panoptic Segmentation", we take advantage of a novel segmentation head that seamlessly integrates multi-scale features generated by a Feature Pyramid Network with contextual information conveyed by a light-weight DeepLab-like module.

In this submission we use a single model, with a ResNet50 backbone, pre-trained on ImageNet and Mapillary Vistas Research Edition, and fine-tuned on Cityscapes' fine training set. Inference is single-shot, without any form of test-time augmentation. Validation scores of the submitted model are 64.97 PQ, 68.04 PQ stuff, 60.75 PQ thing, 80.73 IoU.
more details
n/a75.369.679.499.990.397.647.953.083.966.781.698.458.495.773.571.581.160.272.067.567.064.3
SSAPyesyesnonononononononononononoSSAP: Single-Shot Instance Segmentation With Affinity PyramidNaiyu Gao, Yanhu Shan, Yupei Wang, Xin Zhao, Yinan Yu, Ming Yang, Kaiqi HuangICCV 2019SSAP, ResNet-101, Cityscapes fine-only
more details
n/a70.658.379.699.989.597.146.450.982.571.389.598.155.395.161.860.574.347.259.857.355.050.1
Panoptic-DeepLab [Cityscapes-fine]yesyesnonononononononononononoPanoptic-DeepLabBowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh ChenOur proposed bottom-up Panoptic-DeepLab is conceptually simple yet delivers state-of-the-art results. The Panoptic-DeepLab adopts dual-ASPP and dual-decoder modules, specific to semantic segmentation and instance segmentation respectively. The semantic segmentation prediction follows the typical design of any semantic segmentation model (e.g., DeepLab), while the instance segmentation prediction involves a simple instance center regression, where the model learns to predict instance centers as well as the offset from each pixel to its corresponding center.
This submission exploits only Cityscapes fine annotations.
more details
n/a74.864.782.199.891.497.650.350.990.381.687.998.758.097.069.768.378.651.465.660.861.761.0
iFLYTEK-CVyesyesyesyesnonononononononononoAnonymousiFLYTEK Research, CV Group
more details
n/a78.571.383.899.991.998.057.960.689.876.390.899.060.496.675.574.982.062.172.373.267.362.9
Unifying Training and Inference for Panoptic Segmentation [Cityscapes-fine]yesyesnonononononononononoyesyesUnifying Training and Inference for Panoptic SegmentationQizhu Li, Xiaojuan Qi, Philip H.S. TorrThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020We present an end-to-end network to bridge the gap between training and inference pipeline for panoptic segmentation. In contrast to recent works, our network exploits a parametrised, yet lightweight panoptic segmentation submodule, powered by an end-to-end learnt dense instance affinity, to capture the probability that any pair of pixels belong to the same instance. This panoptic submodule gives rise to a novel propagation mechanism for panoptic logits and enables the network to output a coherent panoptic segmentation map for both “stuff” and “thing” classes, without any post-processing.

This model uses a ResNet-50 backbone, and is trained with only Cityscapes' fine data.
more details
n/a73.966.279.699.990.296.944.350.381.873.887.198.456.396.269.770.876.758.067.563.563.160.3
Unifying Training and Inference for Panoptic Segmentation [COCO]yesyesnonononononononononoyesyesUnifying Training and Inference for Panoptic SegmentationQizhu Li, Xiaojuan Qi, Philip H.S. TorrThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020We present an end-to-end network to bridge the gap between training and inference pipeline for panoptic segmentation. In contrast to recent works, our network exploits a parametrised, yet lightweight panoptic segmentation submodule, powered by an end-to-end learnt dense instance affinity, to capture the probability that any pair of pixels belong to the same instance. This panoptic submodule gives rise to a novel propagation mechanism for panoptic logits and enables the network to output a coherent panoptic segmentation map for both “stuff” and “thing” classes, without any post-processing.

This model uses a ResNet-101 backbone, and is pretrained on COCO 2017 training images and finetuned on Cityscapes' fine data.
more details
n/a75.969.180.999.990.397.752.251.784.173.987.798.457.396.572.472.878.258.770.772.665.661.9
Axial-DeepLab-XL [Cityscapes-fine]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a75.266.381.799.992.197.650.155.287.473.489.398.958.197.070.368.479.556.866.765.863.160.0
Axial-DeepLab-L [Mapillary Vistas]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a78.170.184.099.992.898.156.461.091.776.588.899.262.797.271.671.280.663.072.673.366.561.6
Axial-DeepLab-L [Cityscapes-fine]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a75.366.082.199.991.497.549.753.989.478.989.499.057.297.069.469.478.354.568.668.860.958.4
Naive-Student (iterative semi-supervised learning with Panoptic-DeepLab)yesyesnonononononoyesyesnonononoSemi-Supervised Learning in Video Sequences for Urban Scene SegmentationLiang-Chieh Chen, Raphael Gontijo Lopes, Bowen Cheng, Maxwell D. Collins, Ekin D. Cubuk, Barret Zoph, Hartwig Adam, Jonathon ShlensSupervised learning in large discriminative models is a mainstay for modern computer vision. Such an approach necessitates investing in large-scale human-annotated datasets for achieving state-of-the-art results. In turn, the efficacy of supervised learning may be limited by the size of the human annotated dataset. This limitation is particularly notable for image segmentation tasks, where the expense of human annotation is especially large, yet large amounts of unlabeled data may exist. In this work, we ask if we may leverage semi-supervised learning in unlabeled video sequences to improve the performance on urban scene segmentation, simultaneously tackling semantic, instance, and panoptic segmentation. The goal of this work is to avoid the construction of sophisticated, learned architectures specific to label propagation (e.g., patch matching and optical flow). Instead, we simply predict pseudo-labels for the unlabeled data and train subsequent models with both human-annotated and pseudo-labeled data. The procedure is iterated for several times. As a result, our Naive-Student model, trained with such simple yet effective iterative semi-supervised learning, attains state-of-the-art results at all three Cityscapes benchmarks, reaching the performance of 67.8% PQ, 42.6% AP, and 85.2% mIOU on the test set. We view this work as a notable step towards building a simple procedure to harness unlabeled video sequences to surpass state-of-the-art performance on core computer vision tasks.
more details
n/a80.275.383.899.992.998.157.260.892.174.388.899.461.096.976.677.284.568.278.475.272.969.5
Axial-DeepLab-XL [Mapillary Vistas]yesyesnonononononononononoyesyesAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic SegmentationHuiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2020 (spotlight)Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
more details
n/a79.072.084.0100.092.598.360.763.591.474.388.499.258.697.372.973.381.765.478.073.268.263.5
EfficientPS [Cityscapes-fine]yesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaUnderstanding the scene in which an autonomous robot operates is critical for its competent functioning. Such scene comprehension necessitates recognizing instances of traffic participants along with general scene semantics which can be effectively addressed by the panoptic segmentation task. In this paper, we introduce the Efficient Panoptic Segmentation (EfficientPS) architecture that consists of a shared backbone which efficiently encodes and fuses semantically rich multi-scale features. We incorporate a new semantic head that aggregates fine and contextual features coherently and a new variant of Mask R-CNN as the instance head. We also propose a novel panoptic fusion module that congruously integrates the output logits from both the heads of our EfficientPS architecture to yield the final panoptic segmentation output. Additionally, we introduce the KITTI panoptic segmentation dataset that contains panoptic annotations for the popularly challenging KITTI benchmark. Extensive evaluations on Cityscapes, KITTI, Mapillary Vistas and Indian Driving Dataset demonstrate that our proposed architecture consistently sets the new state-of-the-art on all these four benchmarks while being the most efficient and fast panoptic segmentation architecture to date.
more details
n/a76.870.281.799.992.097.853.155.085.472.987.298.759.896.477.176.282.655.168.566.766.868.7
Panoptic-DeepLab [Mapillary Vistas]yesyesnonononononononononononoPanoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic SegmentationBowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh ChenWe employ a stronger backbone, WR-41, in Panoptic-DeepLab.
For Panoptic-DeepLab, please refer to https://arxiv.org/abs/1911.10194.
For wide-ResNet-41 (WR-41) backbone, please refer to https://arxiv.org/abs/2005.10266.
more details
n/a78.872.583.599.992.197.955.759.892.575.089.699.259.596.974.874.582.765.876.368.370.167.6
EfficientPS [Mapillary Vistas]yesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaUnderstanding the scene in which an autonomous robot operates is critical for its competent functioning. Such scene comprehension necessitates recognizing instances of traffic participants along with general scene semantics which can be effectively addressed by the panoptic segmentation task. In this paper, we introduce the Efficient Panoptic Segmentation (EfficientPS) architecture that consists of a shared backbone which efficiently encodes and fuses semantically rich multi-scale features. We incorporate a new semantic head that aggregates fine and contextual features coherently and a new variant of Mask R-CNN as the instance head. We also propose a novel panoptic fusion module that congruously integrates the output logits from both the heads of our EfficientPS architecture to yield the final panoptic segmentation output. Additionally, we introduce the KITTI panoptic segmentation dataset that contains panoptic annotations for the popularly challenging KITTI benchmark. Extensive evaluations on Cityscapes, KITTI, Mapillary Vistas and Indian Driving Dataset demonstrate that our proposed architecture consistently sets the new state-of-the-art on all these four benchmarks while being the most efficient and fast panoptic segmentation architecture to date.
more details
n/a79.674.683.399.992.498.356.558.790.874.189.499.160.596.677.577.183.266.777.677.767.569.5
seamseg_rvcsubsetnonononononononononononoyesyesSeamless Scene SegmentationPorzi, Lorenzo and Rota Bulò, Samuel and Colovic, Aleksander and Kontschieder, PeterThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2019Seamless Scene Segmentation Resnet101, pretrained on Imagenet; supplied with altered MVD to include WildDash2 classes; does not contain other RVC label policies (i.e. no ADE20K/COCO-specific classes -> rvcsubset and not a proper submission)
more details
n/a64.853.073.398.275.697.441.942.674.764.777.197.942.194.765.056.468.546.258.132.251.346.6
EffPS_b1bs4_RVCyesyesnonononononononononoyesyesEfficientPS: Efficient Panoptic SegmentationRohit Mohan, Abhinav ValadaEfficientPS with EfficientNet-b1 backbone. Trained with a batch size of 4.
more details
n/a59.251.964.499.880.796.330.330.746.132.259.997.141.494.561.263.673.449.856.410.950.749.0
Panoptic-DeepLab w/ SWideRNet [Cityscapes-fine]yesyesnonononononononononononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
more details
n/a77.069.282.799.992.297.751.561.291.374.088.899.057.696.974.474.782.758.469.561.966.965.3
Panoptic-DeepLab w/ SWideRNet [Mapillary Vistas]yesyesnonononononononononononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
more details
n/a80.274.484.499.992.397.959.162.792.675.591.499.259.898.376.277.184.267.377.373.471.268.6
Panoptic-DeepLab w/ SWideRNet [Mapillary Vistas + Pseudo-labels]yesyesnonononononoyesyesnonononoScaling Wide Residual Networks for Panoptic SegmentationLiang-Chieh Chen, Huiyu Wang, Siyuan QiaoWe revisit the architecture design of Wide Residual Networks. We design a baseline model by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.
Following Naive-Student, this model is additionally trained with pseudo-labels generated from Cityscapes Video and train-extra set (i.e., the coarse annotations are not used, but the images are).
more details
n/a80.975.684.899.993.398.458.768.394.374.688.999.259.896.977.575.985.069.581.072.873.069.9
hri_panopticyesyesnonononononononononononoAnonymous
more details
n/a79.974.184.199.992.997.959.863.389.974.789.698.960.498.276.873.484.670.176.573.670.467.5
COPSyesyesnononononononono44nonoCombinatorial Optimization for Panoptic Segmentation: A Fully Differentiable Approach Ahmed Abbas, Paul SwobodaNeurIPS 2021COPS fully differentiable with ResNet 50 backbone.
more details
n/a72.664.678.599.888.697.245.649.476.671.585.798.454.696.167.766.176.652.264.371.160.258.3
kMaX-DeepLab [Cityscapes-fine]yesyesnonononononononononoyesyesk-means Mask TransformerQihang Yu, Huiyu Wang, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh ChenECCV 2022kMaX-DeepLab w/ ConvNeXt-L backbone (ImageNet-22k + 1k pretrained). This result is obtained by the kMaX-DeepLab trained for Panoptic Segmentation task. No test-time augmentation or other external dataset.
more details
n/a77.971.982.399.992.297.757.458.481.673.287.299.060.697.770.374.179.965.277.477.167.963.6

3D Vehicle Detection Task

All average metrics

name3d3d16-bit16-bitdepthdepthvideovideosubsubcodecodetitleauthorsvenuedescriptionRuntime [s]DSAPBEVOS YawOS PitchRollSizeSim
3D-GCKyesyesnonononononononononoSingle-Shot 3D Detection of Vehicles from Monocular RGB Images via Geometry Constrained Keypoints in Real-TimeNils Gählert, Jun-Jun Wan, Nicolas Jourdan, Jan Finkbeiner, Uwe Franke, and Joachim DenzlerIV 20203D-GCK is based on the standard SSD 2D object detection framework and lifts the 2D detections to 3D space by predicting additional regression and classification parameters. Hence, the runtime is kept close to pure 2D object detection. The additional parameters are transformed to 3D bounding box keypoints within the network under geometric constraints. 3D-GCK features a full 3D description including all three angles of rotation without supervision by any labeled ground truth data for the object's orientation, as it focuses on certain keypoints within the image plane.
more details
0.0437.442.596.181.9100.070.7
HW-Noah-AVPNet2.3yesyesnonononononononononoAnonymous
more details
0.0440.143.596.088.0100.082.1
iFlytek-ZBGKRD-fcos3d-depth-normyesyesnonononononononononoAnonymous
more details
n/a42.947.696.680.4100.080.4

DS on class-level

name3d3d16-bit16-bitdepthdepthvideovideosubsubcodecodetitleauthorsvenuedescriptionRuntime [s]allcartruckbustrainmotorcyclebicycle
3D-GCKyesyesnonononononononononoSingle-Shot 3D Detection of Vehicles from Monocular RGB Images via Geometry Constrained Keypoints in Real-TimeNils Gählert, Jun-Jun Wan, Nicolas Jourdan, Jan Finkbeiner, Uwe Franke, and Joachim DenzlerIV 20203D-GCK is based on the standard SSD 2D object detection framework and lifts the 2D detections to 3D space by predicting additional regression and classification parameters. Hence, the runtime is kept close to pure 2D object detection. The additional parameters are transformed to 3D bounding box keypoints within the network under geometric constraints. 3D-GCK features a full 3D description including all three angles of rotation without supervision by any labeled ground truth data for the object's orientation, as it focuses on certain keypoints within the image plane.
more details
0.0437.467.529.032.323.132.939.9
HW-Noah-AVPNet2.3yesyesnonononononononononoAnonymous
more details
0.0440.177.230.029.924.537.242.0
iFlytek-ZBGKRD-fcos3d-depth-normyesyesnonononononononononoAnonymous
more details
n/a42.975.833.341.723.639.643.5