Self-supervised visual feature learning with deep neural networks: A survey #17

standing-o · 2023-03-28T08:25:38Z

Self-supervised visual feature learning with deep neural networks: A survey

To avoid extensive cost of collecting and annotating largescale datasets, as a subset of unsupervised learning methods, self-supervised learning methods are proposed to learn general image and video features from large-scale unlabeled data without using any human-annotated labels.
Terminology, DNN Architecture, Evaluation metrics, Quantitative performance comparisons

General pipeline of Self-supervised learning (SSL)
- ConvNets trained with pretext tasks can learn kernels that to capture low-level features
  and high-level features that are helpful for other downstream tasks.
Term Definition
- Pretext tasks: are pre-designed tasks for networks to solve, and visual features are learned by
  learning objective functions of pretext tasks.
- Downstream tasks: are computer vision applications that can be used to evaluate the quality of features learned by self-supervised learning. These applications can greatly benefit from the
  pre-trained models when training data are scarce.
- Pseudo label: The labels used in pretext task is referred as Pseudo labels which are generated based on the structure of data for pretext tasks.
  - Since no human annotations are needed to generate pseudo labels during selfsupervised training, a main advantage of self-supervised learning methods is that they can be easily scaled to large-scale datasets with very low cost.

Formulation
- SSL also trained with data X_i along with its pseudo label P_i while P_i is automatically generated for a pre-defined pretext task without involving any human annotation.
Given a set of N training data D = {P_i}^N_i=0, the training loss function is defined as:
Architecture for learning image features
- AlexNet, VGG, ResNet, GoogLeNet, DenseNet, RNN

Self-supervised visual feature learning schema.
The ConvNet is trained by minimizing errors between pseudo labels P and predictions O of the ConvNet. Since the pseudo labels are generated based on the structure of the data, no human annotations are involved during the whole process.

Generation-based methods, Context-based pretext tasks, Free Semantic Label-based Methods, Cross Modal-based Methods

Fine-tuned
Semantic Segmentation, Object detection, Action recognition
Image classification
- When choosing image classification as a downstream task to evaluate the quality of image features learned from selfsupervised learning methods, the self-supervised learned model is applied on each image to extract features which then are used to train a classifier such as SVM. The classification performance on testing data is compared with other self-supervised models to evaluate the quality of the learned features.
- Qualitative Evaluation : qualitative visualization methods to evaluate the quality of self-supervised learning features
  - Kernel Viusualization, Feature map visualization, Nearest neighbor retrieval
The performance of the transfer learning on these high-level vision tasks demonstrates the generalization ability of the learned features.

Image generation with GAN, with Inpainting, with Super resolution, with Colorization

Learning with labels generated by game engines, with labels generated by hard-code programs

Performance : comparable to the supervised methods on some downstream tasks
Reproducibility : most of the networks use AlexNet as a base network to pre-train on ImageNet dataset
and then evaluate on same downstream tasks for quality evaluation.
Evaluation Metrics : Another fact is that more evaluation metrics are needed to evaluate the quality of the learned features in different levels. The current solution is to use the performance on downstream tasks to indicate the quality of the features.

The text was updated successfully, but these errors were encountered:

standing-o added Survey SSL labels Mar 28, 2023