본문 바로가기

CV/Paper

Instance segmentation survey

기존의 자료들이 paper review만 하는 informatic slide가 아니라 풀려고 하는 문제를 어떻게 접근했고, 그 접근 방식이 어떻게 발전했는지 보여주면서 자연스럽게 자신이 고민하고 있는 문제를 어떻게 풀어갈지 Inference 할 수 있게 도와주는 slide인 거 같다.
overview


Instance Segmentation

Outline

Introduction

Network Architecture

- FCN-driven Methods (Segmentation-first)

Instancecut CVPR17
Deep watershed CVPR17
Pixelwise instance segmentation with a dynamically instantiated network.CVPR17
SGN CVPR17

- RCNN-driven Methods (Instance-first)

DeepMask NIPS15
MNC ECCV16
(InstanceFCN ECCV16)
Learning to Refine Object Segments ECCV16

(FCIS CVPR17)
Mask R-CNN

- Advanced Works

Cascade HTC
Sliding Window TensorMask
Panoptic Segmentation Panoptic FPN

Efficiency

YOLACT
Centermask

Augmentation & Regularization

InstaBoost
Mask Scoring R-CNN


Introduction

Definition

from CS231N

  • Image Classification: Image level Classification
  • Object Detection: Multi-object Localization + Classification
  • Semantic Segmentation: Pixel-level Classification
  • Instance Segmentation: Detection + Instance-level Classification

Some Differences from Semantic Segmentation

  • Differentiates the objects individuals in the same class.
  • Essential to tasks such as counting the number of objects.

Some Differences from Object Detection

  • A bounding box is a very coarse object boundary, many pixels irrelevant to the detected object are also included in the bounding box.

Network Architecture

Kaiming He's Slide

RCNN-driven Methods

MNC
Mask R-CNN
Mask Scoring R-CNN
HTC
TensorMask
Panoptic FPN

FCN-driven Methods

Deep Mask
InstanceFCN
FCIS

1.Contribution (non-technical/technical 한 측면에서의 paper의 가치 예를들어 1)coco challenge 2017 1등 방법론이다, 2)이 페이퍼 이후 ~ 의 웤들이 이 프레임워크를 사용했다 3) 처음으로 deep learning으로 ~를 풀었다 등등 )
2.Key idea (좀더 technical 한 측면에서 novelty 요약 scoring head를 덧붙여서 기존에 잘 못했던것을 풀어냈다.)
3.Detailed method (technical detail)
4.Results


Deep Mask: Learning to Segment Object Candidates_NIPS15

#FAIR, #483 cited #earliest_instance #Review: DeepMask

1. Contribution

  • One of the earliest CNN approach for Instance Segmentation
  • DeepMask is object proposal based instance segmentation that beats other methods by a large margin while considering a smaller number of proposals.
  • The generalization capabilities for unseen categories
  • The 2015 NIPS paper with more than 480 citations

2. Key Idea

  • Unlike all previous approaches for generating segmentation proposals, this work does not rely on edges, superpixels, or any other form of low-level segmentation but, the first to learn to generate segmentation proposals directly from raw image data. (기존의 방법들과 다르게 raw image에서 바로 segmentation proposal을 생성 - edges, superpixels,등과 같은 low-level segmentation의 형태를 거치지 않는다.)
  • DeepMask jointly predicts the class-agnostic mask including the object score

3. Technical Details

3.1. Network Architecture

Model Architecture (Top), Positive Samples (Green, Left Bottom), Negative Samples (Red, Right Bottom)

The above image illustrates an overall view of our model, which we call DeepMask. The top branch is
responsible for predicting a high-quality object segmentation mask and the bottom branch predicts
the likelihood that an object is present and satisfies the following two constraints:

  1. the patch contains an object roughly centered in the input patch
  2. the object is fully contained in the patch and in a given scale range
3.2. Joint Learning

The network is trained to jointly learn the pixel-wise segmentation map fsegm(xk)at each location (i,j) and the predicted object score fscore(xk). Given an input patch xk, the model is trained to jointly infer a pixel-wise segmentation mask and n object score. The loss function is a sum of binary logistic regression losses, one for each location of the segmentation network and one for the object score, over all training triplets (xk, mk, yk):

DeepMask joint learning loss

4. Results

3.1. MS COCO (Boxes & Segmentation Masks)

Average Recall (AR) Detection Boxes (Left) and Segmentation Masks (Right) on MS COCO Validation Set (AR@n: the AR when n region proposals are generated. AUCx: x is the size of objects)

The above table show Results on the MS COCO dataset for both bounding box and segmentation proposals. This report AR at the different numbers of proposals (10, 100 and 1000) and also AUC (AR averaged across all proposal counts). For segmentation proposals, we report overall AUC and also AUC at different scales (small/medium/large objects indicated by superscripts S/M/L). See the text for details.

3.2. Fast R-CNN results on PASCAL

The above figure shows the mean average precision (mAP) for Fast R-CNN with varying number of proposals. Most notably, with just 100 DeepMask proposals Fast R-CNN achieves mAP of 68.2% and outperforms the best results obtained with 2000 SelectiveSearch proposals (mAP of 66.9%). We emphasize that with 20× fewer proposals
DeepMask outperforms SelectiveSearch


MNC: Instance-aware Semantic Segmentation via Multi-task Network Cascades_CVPR16

#MSRA #He #712cited #oral #Review: MNC

1. Contribution

  • Three Stages: Differentiating Instances, Estimating Masks, and Categorizing Objects.
  • MNC has won the 1st place in 2015 COCO segmentation challenge
  • The 2016 CVPR paper with more than 710 citations

2. Key Idea

Multi-task Network Cascades for instance-aware semantic segmentation. At the top right corner is a simplified illustration.


InstanceFCN: Instance-sensitive Fully Convolutional Networks_ECCV16

#MSRA #He #228cited #Review: InstanceFCN

1. Contribution

  • Fully Convolutional Network (FCN), With Instance-Sensitive Score Maps, Better than DeepMask, Competitive with MNC(Multi-task Network Cascade)
  • By using Fully Convolutional Network (FCN), Instance-Sensitive Score Maps are introduced and all Fully Connected (FC) layers are removed. Competitive results of instance segment proposal on both PASCAL VOC and MS COCO are obtained.
  • The 2016 ECCV with more than 220 citations

2. Key Idea

  • Fully Convolutional Network (FCN), With Instance-Sensitive Score Maps, Better than DeepMask, Competitive with MNC(Multi-task Network Cascade)
  • By using Fully Convolutional Network (FCN), Instance-Sensitive Score Maps are introduced and all Fully Connected (FC) layers are removed. Competitive results of instance segment proposal on both PASCAL VOC and MS COCO are obtained.

3. Technical Details

3.1. Network Architecture

  • On top of the feature map, there are two fully convolutional branches, one for estimating segment instances and the other for scoring the instances.
  • The idea is very similar to that of positive-sensitive score maps in R-FCN. But R-FCN uses positive-sensitive score maps for object detection while InstanceFCN uses instance-sensitive score maps for generating proposals.
3.2. Instance-Sensitive Score Maps
3.2.1. Compared with FCN

  • In FCN (Top), when two persons are too close, the score map generated is difficult to make them separate.
  • However, using InstanceFCN (Bottom), each score map is responsible for capturing the relative position of object instance. For example, the top-left score map is responsible for capturing the top-left part of object instance. After assembling, a separated person mask can be generated.
  • Some examples of instance masks with k=3 as shown below:

Examples of instance-sensitive maps and assembled instances on the PASCAL VOC validation set. For simplicity we only show the cases of k = 3 (9 instancesensitive score maps) in this figure.

3.2.2. Compared with DeepMask

  • In DeepMask, FC layers are used, which makes model large.
  • In InstanceFCN, there are no FC layers which makes model more compact.

4. Result - MS COCO Validation Set

Segment Proposals on the First 5k Images of MS COCO Validation Set


1. Contribution

2. Key Idea

3. Technical Details

3.1. Network Architecture
3.2. Joint Learning

4. Results

FCIS: Fully Convolutional Instance-aware Semantic Segmentation_CVPR17

#MSRA, #CVPR2017spotlight #406cited #Review: FCIS

1. Implication

  • the FIRST fully convolutional end-to-end solution for instance segmentation
  • By introducing the Position-Sensitive Inside/Outside Score Maps, convolutional representation is fully shared for both detection and segmentation sub-tasks. High accuracy and efficiency are obtained.
  • FCIS won the 1st place in the 2016 COCO segmentation challenge, outperform the second-place entry by 12% in accuracy relatively. It also ranked 2nd in the 2016 COCO detection leaderboard at that moment.
  • Much faster than previous winner work(MNC: 1.4s/image, FCIS: 0.24s/image)

2. Position-Sensitive Inside/Outside Score Maps

(a) Conventional fully convolutional network (FCN) for semantic segmentation. A single score map is used for each category, which is unaware of individual object instances. (b) InstanceFCN for instance segment proposal, where 3 × 3 position-sensitive score maps are used to encode relative position information. A downstream network is used for segment proposal classification. (c) Our fully convolutional instance-aware semantic segmentation method (FCIS), where position-sensitive inside/outside score maps are used to perform object segmentation and detection jointly and simultaneously.

  • R-FCN produces Positive-Sensitive Score Maps for object detection while InstanceFCN produces Instance-Sensitive Score Maps for generating segment proposals. And it is easier to understand the Position-Sensitive Inside/Outside Score Maps if you have understood R-FCN & InstanceFCN.
  • FCIS, where position-sensitive inside/outside score maps are used to perform object segmentation and detection jointly and simultaneously
  • Each score map is responsible for predicting the relative position of the object instance. Each score map is responsible for capturing relative position of object instance. For example: the top-left score map is responsible for capturing top-left part of object instance. After assembling, a separated person mask can be generated.
  • Different from R-FCN & InstanceFCN, there are two sets of score maps.
  • To assemble a ROI inside map, the top-left, top-center, top-right, … and bottom-right parts are captured at each of the positive-sensitive inside score map. Similar for positive-sensitive outside score map.
  • Finally, two score maps are generated. One is ROI inside map. One is ROI outside map.

  • Based on these two maps, there are two pathways, one is for instance mask, pixel-wise softmax is used for the segmentation loss. One is for category likelihood, detection score is obtained by average pooling over all pixels’ likelihood. Thus, convolutional representation is fully shared for both detection and segmentation sub-tasks.
  • Some examples:

3. Network Architecture

FCIS (Fully Convolutional Instance-aware Semantic Segmentation) Architecture

  • During training, ROI is positive if IoU with the nearest ground-truth is larger than 0.5. There are 3 loss terms: A softmax detection loss over C+1 categories, a softmax segmentation loss of ground-truth category only, and a bbox regression loss. The latter two are only effective on positive ROIs.

4. Result

Instance-aware semantic segmentation results of different entries for the COCO segmentation challenge (2015 and 2016) on COCO test-dev set.

  • FAIRCNN: Actually it is the team name of MultiPathNet, 2nd place in 2015.
  • MNC+++: MNC submission results which won the 1st place in 2015.
  • G-RMI: 2nd place in 2016, by Google Research and Machine Intelligence team. (The approach is not the one won in object detection challenge.)
  • FCIS baseline: It’s already better than MultiPathNet and MNC.
  • +Multi-scale testing: using pyramid of testing images, where the shorter sides are of {480, 576, 688, 864, 1200, 1400} pixels for testing.
  • +horizontal flip: Flip the image horizontally and test again, then average the results.
  • +multi-scale training: multi-scale training at the same scales as in multi-scale inference is applied.
  • +ensemble: 6 networks are ensembled.
  • Finally, FCIS with above tricks is 3.8% (11% relatively) higher than G-RMI.

Mask R-CNN_ICCV17

#FAIR, #5290cited, #BestPaper, #Taeoh's GoodSlide, #he's iccv17tutorial

1. Implication

  • Mask R-CNN = Faster R-CNN with FCN on RoIs
  • COCO challenges의 모든 tasks (instance segmentation, bounding-box object detection, person keypoint detection)에서 이전 모델보다 높은 성능을 보여준다.
  • 정확한 spatial location 정보를 유지하면서 학습하기 위해 RoIAlign(vs RoIPool)을 제안하였다. => 성능 향상
  • Mask prediction과 Class prediction을 decoupling 하고, Class-agnostic binary mask를 추론한다. => 성능 향상

2. Network Architecture

• Mask R-CNN = Faster R-CNN with FCN on RoIs

3. Result

3.1 Ablation Study: Multinomial vs Binary Masks

from He's slide

3.2 Ablation Study: RoIPool vs RoIAlign

from He's slide

3.3 Instance Segmentation Results in COCO

from He's slide
from He's slide


Mask Scoring R-CNN_CVPR19

1. Implication

Comparisons of Mask R-CNN and our proposed MS R-CNN. (a) shows the results of Mask R-CNN, the mask score has less relationship with MaskIoU. (b) shows the results of MS R-CNN, we penalize the detection with high score and low MaskIoU, and the mask score can correlate with MaskIoU better. (c) shows the quantitative results, where we average the score between each MaskIoU interval, we can see that our method can have a better correspondence between score and MaskIoU.

  • Previous methods including Mask R-CNN treat the confidence of instance classification the same as the mask quality (measured with IoU, Intersection-over-Union) although they are usually not well correlated.
  • The new method uses a network to learn the quality of the predicted instance masks via regression (measured with a MaskIoU score) and then penalize the instance mask score if the classification score is high while the actual mask quality is low.
  • Mask Scoring R-CNN demonstrates new SOTA results, consistently outperforming Mask R-CNN on the COCO benchmark for instance segmentation.

2. Network Architecture

The network architecture of Mask Scoring R-CNN. The input image is fed into a backbone network to generate RoIs via RPN and RoI features via RoIAlign. The RCNN head and Mask head are standard components of Mask R-CNN. For predicting MaskIoU, we use the predicted mask and RoI feature as input. The MaskIoU head has 4 convolution layers (all have kernel=3 and the final one uses stride=2 for downsampling) and 3 fully connected layers (the final one outputs C classes MaskIoU.)

3. Result

Comparing different instance segmentation methods on COCO 2017 test-dev.

  • The results show that no matter what backbone network is used, MS R-CNN can always outperform Mask R-CNN by more than one percent.

HTC: Hybrid Task Cascade for Instance Segmentation_CVPR19

#MMDet

1. Implication

  • Hybrid Task Cascade (HTC) which is a successful instance segmentation cascade is to fully leverage the reciprocal relationship between detection and segmentation.
  • The 1st in the COCO 2018 Challenge Object Detection Task

2. Network Architecture

The architecture evolution from Cascade Mask R-CNN to Hybrid Task Cascade.

  • Hybrid Task Cascade (HTC), a new cascade architecture for instance segmentation. It interweaves box and mask branches for joint multi-stage processing and adopts a semantic segmentation branch to provide spatial context.
  • This framework progressively refines mask predictions and integrates complementary features together in each stage.

3. Result: COCO test-dev

Comparison with state-of-the-art methods on COCO test-dev dataset.


InstaBoost: Boosting Instance Segmentation via Probability Map Guided Copy-Pasting_ICCV19

#augmentation

(a) original image with ground truth mask label (b) The result of random InstaBoost (c) appearance consistency heatmap of this image (d) the result of appearance consistency heatmap guided InstaBoost

1. Implication

  • This paper proposes a random InstaBoost augmentation technique that pastes objects in neighboring of its original position.
  • and appearance consistency heatmap guided InstaBoost: a probability map representing reasonable placement that aligns with real-world experience.
  • This method is simple to implement and does not increase the computational complexity and easily integrated into the training pipeline of any instance segmentation model

2. Result

2.1 InstaBoost-Demo
2.2 Instance Segmentation result on COCO test-dev

2.3 Object Detection results on COCO test-dev


TensorMask: A Foundation for Dense Object Segmentation_ICCV19

#FAIR

1. Implication

  • TensorMask establishes the first dense sliding-window instance segmentation system that achieves result near to Mask R-CNN
  • TensorMask that explicitly captures this geometry and enables novel operators on 4D tensors.
  • Enabled by the TensorMask framework, we develop a pyramid structure over a scale-indexed list of 4D tensors, which we call a tensor bipyramid

2. Comparison with Mask R-CNN for instance segmentation on COCO test-dev

These results demonstrate that dense sliding-window methods can close the gap to ‘detect-then-segment’ systems


CenterMask: Real-Time Anchor-Free Instance Segmentation_CVPR20

Accuracy-speed Tradeoff.

1. Implication

2. Network Architecture

3. Result

2.1 *_CenterMask *_instance segmentation and detection performance on COCO tes-dev2017

2.2 CenterMask with other backbones on COCO val2017.


Panoptic Segmentation(network, module)


PointRend: Image Segmentation as Rendering_arxiv

#FAIR

1. Implication

2. Network Architecture

3. Result


Panoptic Feature Pyramid Networks_CVPR19

#FAIR

1. Implication

2. Network Architecture

3. Result


UPSNet: A Unified Panoptic Segmentation Network_CVPR19

#Uber ATG

1. Implication

2. Network Architecture

3. Result


SOGNet: Scene Overlap Graph Network for Panoptic Segmentation_arxiv

1. Implication

2. Network Architecture

3. Result