This post is a research note for survey on SSL
0. Research key points
- After MoCo
- Relationship: CL-based & non-CL-based
- Self-supervised adversarial robustness
Presentation materials on SSL: Roadmap by RCV lab
Awesome Self-Supervised Learning
2. To Read List
- RCV presentation materials, My past posts relative to SSL
- Survey: Self-supervised Learning: Generative or Contrastive, arXiv
- Towards Understanding and Simplifying MoCo, CVPR22 (not methods, only relative to key points below)
- SEER2021: Self-supervised Pretraining of Visual Features in the Wild
- SEER2022: Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision
- DINO: Emerging Properties in Self-Supervised Vision Transformers
- Improving Contrastive Learning by Visualizing Feature Transformation, ICCV orcal 2021 (from 2021 awesome SSL)
- How Well Do Self-Supervised Models Transfer?, CVPR21
- Rethinking pre-training and self-training, 2020, Google Brain
2.1 RCV presentation
- MoCo (Momentum Contrast)
- Contrastive learning can be interpreted as an dictionary look-up task.
- End-to-end & Memory bank « Momentum Contrast: Due to consistent key representations.
- MoCo outperms supervised pre-training (Setting: Freeze encoder, and train task-specific head with the ground truth)
- MoCo v2 (using SimCLR ideas): MLP projection head, Blur and color distortion augmentation.
- CMC (Contrastive Multiview Coding)
- View-invariant representations (Color => (L + ab) = positive pair)
- Or, with more than two views (+ depth, segment) → Increasing representation quality of model
- PIRL (Pretext-Invariant Representation Learning)
- The goal of SSL: To construct image representations that are semantically meaningful == Image representations should be invariant under any transformations (augmentation).
- The backbone take patchs individually. The features of patchs are concatenated and linearly projected to 128 dimension.
- PIRL shows competitive performance MoCo in classificatino task.
- SimCLR (a simple framework for contrastive learning of visual representations)
- Multiple data augmentation & Non-linear projection head & Not a memory bank
- Contrastive loss (InfoNCE) with 2N images (N + augmentated N & 1 positive sample, 2(N-1) Negative samples)
- BYOL (Bootstrap your own latent)
- Motivation of iteratiely updated target network.
- 2 MLP (projection, prediction) & Stop gradient & non-CL loss (only consider similarity)
- Augmentation & Momentum updated target network
- SwAV (Swapping Assignments between multi views)
- Conventional limitation: (1) Images from the same class are treated as different instances (2) not using all the combinations of augmentations.
- Clustering-based Approaches (with prototypes), Swapped prediction learning (prediction → Code)
- SimSiam (Simple Siamese)
- In order to prevent networks from collapsing, (1) SimCLR:negative pairs (2) momentum encoder (3) online clustering
- Siamese network with none of the above work well with same encoder, predictor and stop-gradient.
- MoCo v3
- BYOL: prediction head, Symmetrized loss
- ViT backbone implement
2.2 Survey on SSL: Generative or Contrastive (8~13p)
- SSL aims at recovering, which is still in the paradigm of supervised settings. 반면, unsup은 좀 더 넓은 범위를 다룬다. 예를 들어 clustering, community discovery.
- Contrastive SSL
- Contrastive learning aims at ”learn to compare” through a Noise Contrastive Estimation (NCE). We can extend NCE into InfoNCE with more dissimilar pairs involved.
- Notation: (1) SSL with Generative model = trained on ImageNet. (2) SSL with Discriminative model = trained with InfoNCE.
- Context-Instance Contrast (Before MoCo)
- Predict Relative Position: jigsaw, rotation
- Maximize Mutual Infor- mation: CPC, AMDIM, CMC
- Instance-Instance Contrast
- Discarding Mutual infomation , CL studies the relationships between different samples’ instance-level (한 이미지 그 자체) local representations, rather than context-level (한 이미지 내 모든 것, dog and grass).
- Cluster Discrimination
- DeepCluster  to pull similar images near in the embedding space. (K-means). It is time-consuming due to two-stage training & poor performing
- Assignment as codes (centroids)
- small model & small batch-size
- upgraed version: SEER trained on Instagram images.
- Instance Discrimination
- However, since it only samples one negative sample for each postitive one, It seems to be constrained by Deep InfoMax.
- Substantially increases the amount of negative samples.
- Momentum encoders prevents the fluctuation of loss convergence.
- Auxiliary tech: (1) batch shuffling (2) temperature
- MoCo has too simple positive pair without any transformation and augmentation. So, PIRL adapt an jigsawed image as similar pairs.
- Illustrate a positive sample with data augmentation
- Try to handle the large-scale negative samples problem.
- the views should only share the label information. (그 이외의 정보까지 공유하는 augmented images는 contrastive learning에 최적화 된 데이터셋이 아니다.)
- Discard negative sampling with the experimental motivation.
- Cross-entropy → MSE
- Batch-size is no longer a critical point, compared to MoCo and SimCLR.
- Demenstrate the importance of ‘stop gradient’
- ReLIC (ICLR 21)
- Add an extra KL-divergence regularizer. & Show analysis of generalization ability and robustness
- Self-supervised Contrastive Pre-training for Semisupervised Self-training
- Rethinking pre-training and self-training : The model with joint pre-training and self- training is the best.
2.3 Towards Understanding and Simplifying MoCo, CVPR22
SimSiam 분석 논문: [2,5,60]
A Survey on Self-supervised Learning in Computer Vision
3. Survey papers contents
@ Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey, 19.02.16, 670cited
0. Abstract 1. Introduction * motivation * term definition 2. Formulation of different learning * supervised * semi-supervised * weakly.. 3. Common deep network architectures * image features (alexnet, vgg, resnet ..) * video features 4. Commonly used pretext and downstream tasks 5. Dataset 6. Image Feature learning * Generation-based Learning * Context-Based Learning 7. Video feature learning 8. Performance comparison 9. Future directions 10. Conclusion
@ Contrastive Representation Learning: A Framework and Review, 20.10.10 ~ 20.10.27, 212cited
0. Abstract 1. Introduction 2. what is contrastive learning * representation learning * contrastive representation learning 3. A taxonomy for contrastive learing * CRL framework * a taxonomy of similarity - multisensory signals - data transformation - context-instance relationship - sequential coherence and consistency - natural clustering * a taxonomy of encoders (Ex, dictionary, memory bank) - end-to-end encoders - online-offline encoders - pre-trained encoders (BERT) * a taxonomy of transform heads (projection head..) * a taxonomy of contrastive loss functions 4. Development of contrastive learning 5. Application (language, vision, graph, audio) 6. Discussion and outlook 7. Conclusion
@ A Survey on Contrastive Self-supervised Learning, 20.10.31 ~ 21.02.07, 135cited
0. Abstract 1. Introduction 2. Pretext Tasks 3. Architectures * ene-to-end * memory bank * momentum encoder * clustering 4. Encoders (backbone details) 5. Training 6. Downstream Tasks 7. Benchmarks (datasets) 8. Contrastive learning in NLP 9. Discussions and Future directions * lack of theoretical foundation * selection of data augmentation and pretext tasks * proper negative sampling during training * dataset biases 10. Conclusion
@ Self-supervised Learning: Generative or Contrastive, 20.06.15 ~ 21.03.20, 231cited