【SSL】 Survey on Self-supervised learning
This post is a research note for survey on SSL
0. Research key points
- After MoCo
- Relationship: CL-based & non-CL-based
- Self-supervised adversarial robustness
1. Reference
Presentation materials on SSL: Roadmap by RCV lab
2. To Read List
- RCV presentation materials, My past posts relative to SSL
- Survey: Self-supervised Learning: Generative or Contrastive, arXiv
- Towards Understanding and Simplifying MoCo, CVPR22 (not methods, only relative to key points below)
- SEER2021: Self-supervised Pretraining of Visual Features in the Wild
- SEER2022: Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision
- DINO: Emerging Properties in Self-Supervised Vision Transformers
- Improving Contrastive Learning by Visualizing Feature Transformation, ICCV orcal 2021 (from 2021 awesome SSL)
- How Well Do Self-Supervised Models Transfer?, CVPR21
- Rethinking pre-training and self-training, 2020, Google Brain
2.1 RCV presentation
- MoCo (Momentum Contrast)
- Contrastive learning can be interpreted as an dictionary look-up task.
- End-to-end & Memory bank « Momentum Contrast: Due to consistent key representations.
- MoCo outperms supervised pre-training (Setting: Freeze encoder, and train task-specific head with the ground truth)
- MoCo v2 (using SimCLR ideas): MLP projection head, Blur and color distortion augmentation.
- CMC (Contrastive Multiview Coding)
- View-invariant representations (Color => (L + ab) = positive pair)
- Or, with more than two views (+ depth, segment) → Increasing representation quality of model
- PIRL (Pretext-Invariant Representation Learning)
- The goal of SSL: To construct image representations that are semantically meaningful == Image representations should be invariant under any transformations (augmentation).
- The backbone take patchs individually. The features of patchs are concatenated and linearly projected to 128 dimension.
- PIRL shows competitive performance MoCo in classificatino task.
- SimCLR (a simple framework for contrastive learning of visual representations)
- Multiple data augmentation & Non-linear projection head & Not a memory bank
- Contrastive loss (InfoNCE) with 2N images (N + augmentated N & 1 positive sample, 2(N-1) Negative samples)
- BYOL (Bootstrap your own latent)
- Motivation of iteratiely updated target network.
- 2 MLP (projection, prediction) & Stop gradient & non-CL loss (only consider similarity)
- Augmentation & Momentum updated target network
- SwAV (Swapping Assignments between multi views)
- Conventional limitation: (1) Images from the same class are treated as different instances (2) not using all the combinations of augmentations.
- Clustering-based Approaches (with prototypes), Swapped prediction learning (prediction → Code)
- SimSiam (Simple Siamese)
- In order to prevent networks from collapsing, (1) SimCLR:negative pairs (2) momentum encoder (3) online clustering
- Siamese network with none of the above work well with same encoder, predictor and stop-gradient.
- MoCo v3
- BYOL: prediction head, Symmetrized loss
- ViT backbone implement
2.2 Survey on SSL: Generative or Contrastive (8~13p)
- SSL aims at recovering, which is still in the paradigm of supervised settings. 반면, unsup은 좀 더 넓은 범위를 다룬다. 예를 들어 clustering, community discovery.
- Contrastive SSL
- Basic
- Contrastive learning aims at ”learn to compare” through a Noise Contrastive Estimation (NCE). We can extend NCE into InfoNCE with more dissimilar pairs involved.
- Notation: (1) SSL with Generative model = trained on ImageNet. (2) SSL with Discriminative model = trained with InfoNCE.
- Context-Instance Contrast (Before MoCo)
- Predict Relative Position: jigsaw, rotation
- Maximize Mutual Infor- mation: CPC, AMDIM, CMC
- Instance-Instance Contrast
- Basic
- Discarding Mutual infomation [129], CL studies the relationships between different samples’ instance-level (한 이미지 그 자체) local representations, rather than context-level (한 이미지 내 모든 것, dog and grass).
- Cluster Discrimination
- DeepCluster [17] to pull similar images near in the embedding space. (K-means). It is time-consuming due to two-stage training & poor performing
- SwAV
- Assignment as codes (centroids)
- small model & small batch-size
- upgraed version: SEER trained on Instagram images.
- Instance Discrimination
- CMC
- However, since it only samples one negative sample for each postitive one, It seems to be constrained by Deep InfoMax.
- MoCo
- Substantially increases the amount of negative samples.
- Momentum encoders prevents the fluctuation of loss convergence.
- Auxiliary tech: (1) batch shuffling (2) temperature
- PIRL
- MoCo has too simple positive pair without any transformation and augmentation. So, PIRL adapt an jigsawed image as similar pairs.
- SimCLR
- Illustrate a positive sample with data augmentation
- Try to handle the large-scale negative samples problem.
- InfoMin
- the views should only share the label information. (그 이외의 정보까지 공유하는 augmented images는 contrastive learning에 최적화 된 데이터셋이 아니다.)
- BYOL
- Discard negative sampling with the experimental motivation.
- Cross-entropy → MSE
- Batch-size is no longer a critical point, compared to MoCo and SimCLR.
- SimSiam
- Demenstrate the importance of ‘stop gradient’
- ReLIC (ICLR 21)
- Add an extra KL-divergence regularizer. & Show analysis of generalization ability and robustness
- CMC
- Basic
- Self-supervised Contrastive Pre-training for Semisupervised Self-training
- Rethinking pre-training and self-training : The model with joint pre-training and self- training is the best.
- Basic
2.3 Towards Understanding and Simplifying MoCo, CVPR22
SimSiam 분석 논문: [2,5,60]
A Survey on Self-supervised Learning in Computer Vision
3. Survey papers contents
@ Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey, 19.02.16, 670cited
0. Abstract
1. Introduction
* motivation
* term definition
2. Formulation of different learning
* supervised
* semi-supervised
* weakly..
3. Common deep network architectures
* image features (alexnet, vgg, resnet ..)
* video features
4. Commonly used pretext and downstream tasks
5. Dataset
6. Image Feature learning
* Generation-based Learning
* Context-Based Learning
7. Video feature learning
8. Performance comparison
9. Future directions
10. Conclusion
@ Contrastive Representation Learning: A Framework and Review, 20.10.10 ~ 20.10.27, 212cited
0. Abstract
1. Introduction
2. what is contrastive learning
* representation learning
* contrastive representation learning
3. A taxonomy for contrastive learing
* CRL framework
* a taxonomy of similarity
- multisensory signals
- data transformation
- context-instance relationship
- sequential coherence and consistency
- natural clustering
* a taxonomy of encoders (Ex, dictionary, memory bank)
- end-to-end encoders
- online-offline encoders
- pre-trained encoders (BERT)
* a taxonomy of transform heads (projection head..)
* a taxonomy of contrastive loss functions
4. Development of contrastive learning
5. Application (language, vision, graph, audio)
6. Discussion and outlook
7. Conclusion
@ A Survey on Contrastive Self-supervised Learning, 20.10.31 ~ 21.02.07, 135cited
0. Abstract
1. Introduction
2. Pretext Tasks
3. Architectures
* ene-to-end
* memory bank
* momentum encoder
* clustering
4. Encoders (backbone details)
5. Training
6. Downstream Tasks
7. Benchmarks (datasets)
8. Contrastive learning in NLP
9. Discussions and Future directions
* lack of theoretical foundation
* selection of data augmentation and pretext tasks
* proper negative sampling during training
* dataset biases
10. Conclusion
@ Self-supervised Learning: Generative or Contrastive, 20.06.15 ~ 21.03.20, 231cited