【SSL】 Survey on Self-supervised learning

This post is a research note for survey on SSL

0. Research key points

  1. After MoCo
  2. Relationship: CL-based & non-CL-based
  3. Self-supervised adversarial robustness

1. Reference

  1. Presentation materials on SSL: Roadmap by RCV lab

  2. Awesome Self-Supervised Learning

  3. mmselfsup

  4. facebookresearch/vissl

  5. DIno

2. To Read List

  1. RCV presentation materials, My past posts relative to SSL
  2. Survey: Self-supervised Learning: Generative or Contrastive, arXiv
  3. Towards Understanding and Simplifying MoCo, CVPR22 (not methods, only relative to key points below)
  4. SEER2021: Self-supervised Pretraining of Visual Features in the Wild
  5. SEER2022: Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision
  6. DINO: Emerging Properties in Self-Supervised Vision Transformers
  7. Improving Contrastive Learning by Visualizing Feature Transformation, ICCV orcal 2021 (from 2021 awesome SSL)
  8. How Well Do Self-Supervised Models Transfer?, CVPR21
  9. Rethinking pre-training and self-training, 2020, Google Brain

2.1 RCV presentation

  1. MoCo (Momentum Contrast)
    • Contrastive learning can be interpreted as an dictionary look-up task.
    • End-to-end & Memory bank « Momentum Contrast: Due to consistent key representations.
    • MoCo outperms supervised pre-training (Setting: Freeze encoder, and train task-specific head with the ground truth)
    • MoCo v2 (using SimCLR ideas): MLP projection head, Blur and color distortion augmentation.
  2. CMC (Contrastive Multiview Coding)
    • View-invariant representations (Color => (L + ab) = positive pair)
    • Or, with more than two views (+ depth, segment) → Increasing representation quality of model
  3. PIRL (Pretext-Invariant Representation Learning)
    • The goal of SSL: To construct image representations that are semantically meaningful == Image representations should be invariant under any transformations (augmentation).
    • The backbone take patchs individually. The features of patchs are concatenated and linearly projected to 128 dimension.
    • PIRL shows competitive performance MoCo in classificatino task.
  4. SimCLR (a simple framework for contrastive learning of visual representations)
    • Multiple data augmentation & Non-linear projection head & Not a memory bank
    • Contrastive loss (InfoNCE) with 2N images (N + augmentated N & 1 positive sample, 2(N-1) Negative samples)
  5. BYOL (Bootstrap your own latent)
    • Motivation of iteratiely updated target network.
    • 2 MLP (projection, prediction) & Stop gradient & non-CL loss (only consider similarity)
    • Augmentation & Momentum updated target network
  6. SwAV (Swapping Assignments between multi views)
    • Conventional limitation: (1) Images from the same class are treated as different instances (2) not using all the combinations of augmentations.
    • Clustering-based Approaches (with prototypes), Swapped prediction learning (prediction → Code)
  7. SimSiam (Simple Siamese)
    • In order to prevent networks from collapsing, (1) SimCLR:negative pairs (2) momentum encoder (3) online clustering
    • Siamese network with none of the above work well with same encoder, predictor and stop-gradient.
  8. MoCo v3
    • BYOL: prediction head, Symmetrized loss
    • ViT backbone implement

2.2 Survey on SSL: Generative or Contrastive (8~13p)

  • SSL aims at recovering, which is still in the paradigm of supervised settings. 반면, unsup은 좀 더 넓은 범위를 다룬다. 예를 들어 clustering, community discovery.
  • Contrastive SSL
    1. Basic
      • Contrastive learning aims at ”learn to compare” through a Noise Contrastive Estimation (NCE). We can extend NCE into InfoNCE with more dissimilar pairs involved.
      • Notation: (1) SSL with Generative model = trained on ImageNet. (2) SSL with Discriminative model = trained with InfoNCE.
    2. Context-Instance Contrast (Before MoCo)
      • Predict Relative Position: jigsaw, rotation
      • Maximize Mutual Infor- mation: CPC, AMDIM, CMC
    3. Instance-Instance Contrast
      • Basic
        1. Discarding Mutual infomation [129], CL studies the relationships between different samples’ instance-level (한 이미지 그 자체) local representations, rather than context-level (한 이미지 내 모든 것, dog and grass).
      • Cluster Discrimination
        1. DeepCluster [17] to pull similar images near in the embedding space. (K-means). It is time-consuming due to two-stage training & poor performing
        2. SwAV
          • Assignment as codes (centroids)
          • small model & small batch-size
          • upgraed version: SEER trained on Instagram images.
      • Instance Discrimination
        1. CMC
          • However, since it only samples one negative sample for each postitive one, It seems to be constrained by Deep InfoMax.
        2. MoCo
          • Substantially increases the amount of negative samples.
          • Momentum encoders prevents the fluctuation of loss convergence.
          • Auxiliary tech: (1) batch shuffling (2) temperature
        3. PIRL
          • MoCo has too simple positive pair without any transformation and augmentation. So, PIRL adapt an jigsawed image as similar pairs.
        4. SimCLR
          • Illustrate a positive sample with data augmentation
          • Try to handle the large-scale negative samples problem.
        5. InfoMin
          • the views should only share the label information. (그 이외의 정보까지 공유하는 augmented images는 contrastive learning에 최적화 된 데이터셋이 아니다.)
        6. BYOL
          • Discard negative sampling with the experimental motivation.
          • Cross-entropy → MSE
          • Batch-size is no longer a critical point, compared to MoCo and SimCLR.
        7. SimSiam
          • Demenstrate the importance of ‘stop gradient’
        8. ReLIC (ICLR 21)
          • Add an extra KL-divergence regularizer. & Show analysis of generalization ability and robustness
    4. Self-supervised Contrastive Pre-training for Semisupervised Self-training
      1. Rethinking pre-training and self-training : The model with joint pre-training and self- training is the best.

2.3 Towards Understanding and Simplifying MoCo, CVPR22

SimSiam 분석 논문: [2,5,60]

A Survey on Self-supervised Learning in Computer Vision

3. Survey papers contents

@ Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey, 19.02.16, 670cited

0. Abstract
1. Introduction
	* motivation
	* term definition
2. Formulation of different learning
	* supervised
	* semi-supervised
	* weakly..
3. Common deep network architectures
	* image features (alexnet, vgg, resnet ..)
	* video features
4. Commonly used pretext and downstream tasks
5. Dataset
6. Image Feature learning
	* Generation-based Learning
	* Context-Based Learning
7. Video feature learning
8. Performance comparison
9. Future directions
10. Conclusion

@ Contrastive Representation Learning: A Framework and Review, 20.10.10 ~ 20.10.27, 212cited

0. Abstract
1. Introduction
2. what is contrastive learning
	* representation learning
	* contrastive representation learning
3. A taxonomy for contrastive learing
	* CRL framework
	* a taxonomy of similarity
		- multisensory signals
		- data transformation
		- context-instance relationship
		- sequential coherence and consistency
		- natural clustering
	* a taxonomy of encoders (Ex, dictionary, memory bank)
		- end-to-end encoders
		- online-offline encoders
		- pre-trained encoders (BERT)
	* a taxonomy of transform heads (projection head..)
	* a taxonomy of contrastive loss functions
4. Development of contrastive learning
5. Application (language, vision, graph, audio)
6. Discussion and outlook
7. Conclusion

@ A Survey on Contrastive Self-supervised Learning, 20.10.31 ~ 21.02.07, 135cited

0. Abstract
1. Introduction
2. Pretext Tasks
3. Architectures
	* ene-to-end
	* memory bank
	* momentum encoder
	* clustering
4. Encoders (backbone details)
5. Training
6. Downstream Tasks
7. Benchmarks (datasets)
8. Contrastive learning in NLP
9. Discussions and Future directions
	* lack of theoretical foundation
	* selection of data augmentation and pretext tasks
	* proper negative sampling during training
	* dataset biases
10. Conclusion

@ Self-supervised Learning: Generative or Contrastive, 20.06.15 ~ 21.03.20, 231cited

4. I am writing the survey paper.


© All rights reserved By Junha Song.