index.html


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" class="gr__shape2prog_csail_mit_edu"><head><meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

        <script src="./head.js"></script>        <meta name="viewport" content="width=device-width, initial-scale=1">    <link rel="shortcut icon" href="http://shape2prog.csail.mit.edu/images/favicon.ico">
        <meta name="description" content="What makes for good views for contrastive learning">
        <meta name="keywords" content="MIT,Contrastive Learning,Views,Unsupervised Learning,Transfer Learning">

        <title>Decorr</title>
        <link rel="stylesheet" href="./font.css">
        <link rel="stylesheet" href="./main.css">

    </head>

    <body data-gr-c-s-loaded="true">

        <div class="outercontainer">
            <div class="container">

                <div class="content project_title">
                    <h1>On Feature Decorrelation in Self-Supervised Learning</h1>
                </div>

                <div class="content project_headline">
                    <center><h2>

                      <font size="3"><a href="https://patrickhua.github.io/">Tianyu Hua</a><sup>*12</sup></font>&nbsp;&nbsp;
                      <font size="3"><a href="https://scholar.google.com/citations?user=hn0u5VgAAAAJ">Wenxiao Wang</a><sup>*1</sup></font>&nbsp;&nbsp;
                      <font size="3"><a href="https://sherryxzh.github.io/">Zihui Xue</a><sup>23</sup></font>&nbsp;&nbsp;
                      <font size="3"><a href="https://oliverrensu.github.io/">Sucheng Ren</a><sup>24</sup></font>&nbsp;&nbsp;
                      <font size="3"><a href="https://people.csail.mit.edu/yuewang/">Yue Wang</a><sup>5</sup></font>&nbsp;&nbsp;
                      <font size="3"><a href="http://people.csail.mit.edu/hangzhao/">Hang Zhao</a><sup>12</sup></font>&nbsp;&nbsp;

                    </h2></center>
                    <center><h2>
                        <font size="3"><sup>1</sup>Tsinghua University</font>&nbsp;&nbsp;
                        <font size="3"><sup>2</sup>Shanghai Qi Zhi Institute</font>&nbsp;&nbsp;</br>
                        <font size="3"><sup>3</sup>UT Austin</font>&nbsp;&nbsp;
                        <font size="3"><sup>4</sup>South China University of Technology</font>&nbsp;&nbsp;
                        <font size="3"><sup>5</sup>MIT</font>&nbsp;&nbsp;
                        <font size="3">* Equal contribution</font>&nbsp;&nbsp;
                    </h2></center>
                    <div class="content project_title">
                        <a href="https://link.zhihu.com/?target=https%3A//docs.google.com/spreadsheets/u/1/d/e/2PACX-1vRfaTmsNweuaA0Gjyu58H_Cx56pGwFhcTYII0u1pg0U7MbhlgY0R6Y-BbK3xFhAiwGZ26u3TAtN5MnS/pubhtml">
                        <h2>(Accepted to ICCV2021 as an oral presentation!)</h2></a>
                    </div>

                    <!-- <center></center> -->
                    <center><h2>
                    <br>
                    <br>
                    <!-- https://youtu.be/zJfNcF1SgAQ
                     -->
                     <iframe width="820" height="515"
                    src="https://www.youtube.com/embed/zJfNcF1SgAQ">
                    </iframe>

                    <br>
                    <br>
                    <br>

                    </h2></center>
                </div>

                <div class="content project_headline">
                    <div class="img" style="text-align:center">
                        <img class="img_responsive" src="./teaser.png" alt="Teaser" style="margin:auto;max-width:90%">
                    </div>
                    <div class="text">
                        <p>Figure 1: 
                            An overview of the key components of this work: (a) is a sketch of the framework used in this work; (b) and (c) are two reachable collapse patterns in self-supervised settings; (d) is an illustration of the goal of feature decorrelation; (d) is a sketch of the framework used in this work.
                            
                            
                            <!-- Constructing views for contrastive learning is important. But what are good views? We hypothesize good views should be those that only share label information w.r.t the downstream task, while throwing away nuisance factors, which we call <i>InfoMin</i> principle. -->
                        </p>
                    </div>
                </div>


                <div class="content">
                    <div class="text">
                        <h3>Abstract</h3>
                        <p>
                            
                            
In self-supervised representation learning, a common idea behind most of the state-of-the-art approaches is to enforce the robustness of the representations to predefined augmentations. 
A potential issue of this idea is the existence of completely collapsed solutions (<i>i.e.</i>, constant features), which are typically avoided implicitly by carefully chosen implementation details. 
In this work, we study a relatively concise framework containing the most common components from recent approaches. 
We verify the existence of <b>complete collapse</b> and discover another reachable collapse pattern that is usually overlooked, namely <b>dimensional collapse</b>. 
We connect dimensional collapse with strong correlations between axes and consider such connection as a strong motivation for <b>decorrelation</b> (<i>i.e.</i>, standardizing the covariance matrix). 
The capability of correlation as an unsupervised metric and the gains from feature decorrelation are verified empirically to highlight the importance and the potential of this insight.


                        </p>
                    </div>
                </div>


                <div class="content">
                    <div class="text">
                        <h3>Publication</h3>
                        <ul>
                            <li>
                                <div class="title"><a name="infomin">On Feature Decorrelation in Self-Supervised Learning</a></div>
                                <div class="authors">
                                    <a href="https://patrickhua.github.io/">Tianyu Hua</a>,
                                    <a href="https://scholar.google.com/citations?user=hn0u5VgAAAAJ">Wenxiao Wang</a>,
                                    <a href="https://sherryxzh.github.io/">Zihui Xue</a>,
                                    <a href="https://oliverrensu.github.io/">Sucheng Ren</a>,
                                    <a href="https://people.csail.mit.edu/yuewang/">Yue Wang</a>, and
                                    <a href="http://people.csail.mit.edu/hangzhao/">Hang Zhao</a>
                                </div>
                                <div>
                                    <span class="tag"><a href="https://arxiv.org/abs/2105.00470.pdf">Paper</a></span>
                                    <span class="tag"><a href="https://arxiv.org/abs/2105.00470">arXiv</a></span>
                                    <span class="tag"><a href="bib.txt">BibTeX</a></span>
                                </div>
                            </li>
                        </ul>
                    </div>
                </div>

                <div class="content">
                    <div class="text">

                        <h3>Different types of mode collapse when no negative sample is used</h3>


                        <div class="content project_headline">
                            <div class="img" style="text-align:center">
                                <img class="img_responsive" src="./collapse_patterns.png" alt="Teaser" style="margin:auto;max-width:90%">
                            </div>
                            <div class="text">
                                <p>Figure 2: 

                                    Direct visualization of 2-dimensional projection spaces on CIFAR-10. Different colors correspond to different classes. Figure (a), (b) and (c) are from our framework. For completeness, we visualize the 2-dimensional projection spaces of SimCLR (by setting the output dimension of the projector to be 2) and a supervised baseline (by letting the penultimate layer to contain 2 neurons) in Figure (d) and (e).
                                    <!-- Schematic of contrastive representation learning with a learned view generator. An input image is split into two views using an invertible view generator. To learn the view generator, we optimize the losses in yellow: minimizing information between views while ensuring we can classify the object from each view. The encoders used to estimate mutual information are always trained to maximize the InfoNCE lower bound. After learning the view generator, we reset the weights of the encoders, and train with a fixed view generator without the additional supervised classification losses. -->
                                
                                
                                </p>
                            </div>
                        </div>

                        <!-- <h3>Normally, people </h3> -->
                        <h4>(a) Complete collapse</h1>
                        <p>  Normally, it is complete collapse that people is referring to when talking about collapse. In the case of complete collapse, all features are residing in a tiny region with negligible variance.
                            
                            
                            <!-- We verify the existence of complete collapse in self-supervised settings and address it successfully by standardizing variance. -->
                            <!-- we can use an unsupervised adversarial objective to minimize the mutual information between views, while using an supervised objective (only on small amount of labeled data) to retain relevant information. -->
                        </p>

                        <h4>(b) Dimensional collapse</h4>
                        <p>Standardizing variance is a typical solution to avoid complete collapse. After doing so, we discovered another collapse pattern, dimensional collapse. In dimensional collapse, feature space are stretched diagonally such that it satifies the standard deviation requirement but maximizes feature correlation.
                            
                            
                            <!-- satisfies the feature normalization requirement but still -->
                            <!-- After normalizing the feature space, we discover another reachable collapse pattern ignored by existing works, namely dimensional collapse. -->
                        </p>
                            
                        <h4>(c) Decorrelated feature space</h4>
                        <p>
                            Instead of standardizing variance, we standardize covariance (<i>i.e.</i> feature decorrelation), which naturally mitigates the issue of dimensional collapse. 
                            Using feature decorrelation techniques, we achieve comparable results to representative latest methods.
                        </p>


                        <div class="content project_headline">
                            <div class="img" style="text-align:center">
                                <img class="img_responsive" src="./results.png" alt="Teaser" style="margin:auto;max-width:90%">
                            </div>
                            <div class="text">
                                <p>Figure 3: 
                                    Top-1 accuracies(%) of DBN and Shuffled-DBN in linear evaluation with 200-epoch pretraining. For completeness and reference, we include results of some representative methods from our reproduction. For a fair comparison, we use the same projector and augmentations as we describe in Section 4.1 for all methods in the reproduction.    
                                    
                                    <!-- The augmentation that we manually designed following the principle of InfoMin. As can be see from the left figure, lower I<sub>NCE</sub> typically results in higher accuracy before we touch a turning point (which we might haven't touched yet). -->
                                </p>
                            </div>
                            </div>
                    </div>

                </div>


            </div>
        </div>

    
<div id="download_plus_animation"></div></body></html>