Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self-supervised visual feature learning with deep neural networks: A survey #17

Open
standing-o opened this issue Mar 28, 2023 · 0 comments

Comments

@standing-o
Copy link
Owner

standing-o commented Mar 28, 2023

Self-supervised visual feature learning with deep neural networks: A survey

Abstract

  • To avoid extensive cost of collecting and annotating largescale datasets, as a subset of unsupervised learning methods, self-supervised learning methods are proposed to learn general image and video features from large-scale unlabeled data without using any human-annotated labels.
  • Terminology, DNN Architecture, Evaluation metrics, Quantitative performance comparisons

Introduction

  • General pipeline of Self-supervised learning (SSL)
    • ConvNets trained with pretext tasks can learn kernels that to capture low-level features
      and high-level features that are helpful for other downstream tasks.
  • Term Definition
    • Pretext tasks: are pre-designed tasks for networks to solve, and visual features are learned by
      learning objective functions of pretext tasks.
    • Downstream tasks: are computer vision applications that can be used to evaluate the quality of features learned by self-supervised learning. These applications can greatly benefit from the
      pre-trained models when training data are scarce.
    • Pseudo label: The labels used in pretext task is referred as Pseudo labels which are generated based on the structure of data for pretext tasks.
      • Since no human annotations are needed to generate pseudo labels during selfsupervised training, a main advantage of self-supervised learning methods is that they can be easily scaled to large-scale datasets with very low cost.

Self-Supervised Learning

  • Formulation
    • SSL also trained with data Xi along with its pseudo label Pi while Pi is automatically generated for a pre-defined pretext task without involving any human annotation.
  • Given a set of N training data D = {Pi}Ni=0, the training loss function is defined as:
  • Architecture for learning image features
    • AlexNet, VGG, ResNet, GoogLeNet, DenseNet, RNN

Commonly used Pretext and Downstream tasks

  • Self-supervised visual feature learning schema.

  • The ConvNet is trained by minimizing errors between pseudo labels P and predictions O of the ConvNet. Since the pseudo labels are generated based on the structure of the data, no human annotations are involved during the whole process.

Pretext Task

  • Generation-based methods, Context-based pretext tasks, Free Semantic Label-based Methods, Cross Modal-based Methods

Downstream Task

  • Fine-tuned
  • Semantic Segmentation, Object detection, Action recognition
  • Image classification
    • When choosing image classification as a downstream task to evaluate the quality of image features learned from selfsupervised learning methods, the self-supervised learned model is applied on each image to extract features which then are used to train a classifier such as SVM. The classification performance on testing data is compared with other self-supervised models to evaluate the quality of the learned features.
    • Qualitative Evaluation : qualitative visualization methods to evaluate the quality of self-supervised learning features
      • Kernel Viusualization, Feature map visualization, Nearest neighbor retrieval
  • The performance of the transfer learning on these high-level vision tasks demonstrates the generalization ability of the learned features.

Image feature learning

Generation-based image feature learning

  • Image generation with GAN, with Inpainting, with Super resolution, with Colorization

Context-based image feature learning

  • Learning with context similarity, with Spatiakl context structure

Free semantic label-based image feature learning

  • Learning with labels generated by game engines, with labels generated by hard-code programs

Summary

  • Performance : comparable to the supervised methods on some downstream tasks
  • Reproducibility : most of the networks use AlexNet as a base network to pre-train on ImageNet dataset
    and then evaluate on same downstream tasks for quality evaluation.
  • Evaluation Metrics : Another fact is that more evaluation metrics are needed to evaluate the quality of the learned features in different levels. The current solution is to use the performance on downstream tasks to indicate the quality of the features.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant