【SeSL】 Details of Interactive segmentation & PseudoSeg

  • Details of Interactive segmentation & PseudoSeg

RITM for interacitve segmentation

Paper: ritm_interactive_segmentation [paper] [code]

  • abstract
    • Goal: Without backward (hard to deploy on mobile), a simple (only) feedforward model
    • Finding: the choice of a training dataset (a combination of COCO and LVIS) greatly impacts.
  • Methods
    • Model: No need to reinvent segmentation model → DeepLabV3+ (efficient) / HRNet+OCR (high-resolution output, thus more preferable)
    • Clicks encoding: the disks with a small radius (local effect) (then the distance transform (global effect))
    • RGB → N channel input: Conv1E, Conv2S
    • Training strategy (=interactive sampling strategy):
      1. mislabelled region → Erosion regions
      2. random and iterative point sampling
    • Incorporating Masks From Previous Steps (Input = RGB + positives + negatives + masks from previous steps)
    • Binary cross entropy (BCE) → Normalized Focal Loss (NFL) for fast training
    • Zoom in
      • f-BRS-Rethinking (Sec. 4) / test-time augmentations
      • Starting from the third click, we crop an image according to the bounding box and apply the interactive segmentation only to this Zoom-In area. We extend the bounding box by 40$ along the sides in order to preserve the context. If a user provides a click outside the bounding box, we expand the zoom-in area.
  • Datasets
    • Semantic Boundaries Dataset, Pascal VOC « OpenImages, LVIS (the highest annotation quality, allowing to achieve prediction quality.)
    • COCO segmentation images (more common and general objects)
    • COCO에서 LVIS와 동일한 Mask를 가지는 데이터를 제외하고, COCO* + LVIS 사용. (COCO and IVIS share the same set of images)
    • The further development paper should contain the table (Table3) which is that “model 를 학습시키기 위해 사용한 데이터에 따른 모델 성능”
  • Evaluation
    • Dataset: GrabCut, Berkeley, SBD, DAVIS (use each of instance masks separately, not full segmentation mask)
    • For evaluation, simulated (random) user interaction with unique probability distributions [see CVPR 2018 paper]
    • 400 x 400 resolution images
    • Training details: crop and other augmentation / 55 epochs,
    • Evaluation Zoom-In[see fbrs_IS], averaging prediction from original and flipped images.
    • NoC@90 : the number of clicks to achieve 90 IoU
    • NoC_{100} : the number of images not to achieve 90 IoU after 20~100 clicks
  • Experiments
    • HRNet-18 + Conv1S is the best, although it has fewer parameters.
    • Optimal iteratively sampled clicks for training = 3

PseudoSeg

Paper: PseudoSeg [paper][code]

image-20220421134648219

  • Abstract and Introduction
    • In SeSL in classification, the trend is a combination of consistency training (FixMatch, CVPR20, BMVC20) and pseudo-labeling.
    • For more challenging work, segmentation tasks, (1) “well-calibrated structured pseudo labels“ strategy (network structure agnostic) with (2) strongly augmented data improve (3) consistency training for segmentation.
    • Similar works
      • A multi-stage training strategy: additional saliency estimation model (Oh et al., 2017; Huang et al., 2018; Wei et al., 2018), utilizing pixel similarity to propagate the initial score map (Ahn & Kwak, 2018; Wang et al., 2020), or mining and co-segment the same category of objects across images (Sun et al., 2020; Zhang et al., 2020b)
      • However, without needing pixel-labeled data, we use a small number of fully-annotated data.
  • Methods
    • Refinement for grad-CAM results.
    • Calibration fusion.
    • Additional regulization losses.
    • Augmentation.

© All rights reserved By Junha Song.