【SeSL】 Details of Interactive segmentation & PseudoSeg
- Details of Interactive segmentation & PseudoSeg
RITM for interacitve segmentation
Paper: ritm_interactive_segmentation [paper] [code]
- abstract
- Goal: Without backward (hard to deploy on mobile), a simple (only) feedforward model
- Finding: the choice of a training dataset (a combination of COCO and LVIS) greatly impacts.
- Methods
- Model: No need to reinvent segmentation model → DeepLabV3+ (efficient) / HRNet+OCR (high-resolution output, thus more preferable)
- Clicks encoding: the disks with a small radius (local effect) (then the distance transform (global effect))
- RGB → N channel input: Conv1E, Conv2S
- Training strategy (=interactive sampling strategy):
- mislabelled region → Erosion regions
- random and iterative point sampling
- Incorporating Masks From Previous Steps (Input = RGB + positives + negatives + masks from previous steps)
- Binary cross entropy (BCE) → Normalized Focal Loss (NFL) for fast training
- Zoom in
- f-BRS-Rethinking (Sec. 4) / test-time augmentations
- Starting from the third click, we crop an image according to the bounding box and apply the interactive segmentation only to this Zoom-In area. We extend the bounding box by 40$ along the sides in order to preserve the context. If a user provides a click outside the bounding box, we expand the zoom-in area.
- Datasets
- Semantic Boundaries Dataset, Pascal VOC « OpenImages, LVIS (the highest annotation quality, allowing to achieve prediction quality.)
- COCO segmentation images (more common and general objects)
- COCO에서 LVIS와 동일한 Mask를 가지는 데이터를 제외하고, COCO* + LVIS 사용. (COCO and IVIS share the same set of images)
- The further development paper should contain the table (Table3) which is that “model 를 학습시키기 위해 사용한 데이터에 따른 모델 성능”
- Evaluation
- Dataset: GrabCut, Berkeley, SBD, DAVIS (use each of instance masks separately, not full segmentation mask)
- For evaluation, simulated (random) user interaction with unique probability distributions [see CVPR 2018 paper]
- 400 x 400 resolution images
- Training details: crop and other augmentation / 55 epochs,
- Evaluation Zoom-In[see fbrs_IS], averaging prediction from original and flipped images.
- NoC@90 : the number of clicks to achieve 90 IoU
- NoC_{100} : the number of images not to achieve 90 IoU after 20~100 clicks
- Experiments
- HRNet-18 + Conv1S is the best, although it has fewer parameters.
- Optimal iteratively sampled clicks for training = 3
PseudoSeg
Paper: PseudoSeg [paper][code]
- Abstract and Introduction
- In SeSL in classification, the trend is a combination of consistency training (FixMatch, CVPR20, BMVC20) and pseudo-labeling.
- For more challenging work, segmentation tasks, (1) “well-calibrated structured pseudo labels“ strategy (network structure agnostic) with (2) strongly augmented data improve (3) consistency training for segmentation.
- Similar works
- A multi-stage training strategy: additional saliency estimation model (Oh et al., 2017; Huang et al., 2018; Wei et al., 2018), utilizing pixel similarity to propagate the initial score map (Ahn & Kwak, 2018; Wang et al., 2020), or mining and co-segment the same category of objects across images (Sun et al., 2020; Zhang et al., 2020b)
- However, without needing pixel-labeled data, we use a small number of fully-annotated data.
- Methods
- Refinement for grad-CAM results.
- Calibration fusion.
- Additional regulization losses.
- Augmentation.