MoCo V1 | HUHU BLOG

type

status

date

slug

summary

Abstract

contrastive learning as dictionary look-up

Unsupervised learning trains encoders to perform dictionary look-up: an encoded "query" should be similar to its matching key and dissimilar to others.

为什么可以视为 “字典查询” 呢？

因为它需要训练一个编码器来从一个动态的字典中找到与查询最匹配的键1。

主要是因为，对比学习在训练过程中，其实就是训练一个编码器来从一个动态的字典中( “一堆样本：{} “ 中)，找到与查询最匹配的键1。无监督学习训练编码器来执行字典查找：一个编码的“查询”应该与字典中的一个“键”（正样本）相似，而与其他所有“键”（负样本）不同2。

所以整个过程就类似于：给你一个待查询的 query ，然后在一个 “字典” 中寻找 pos。

📢

说明：在这里所说的 “字典” 指的是：n 个 sample data 经过 encode 得到的 representation 组成的集合，其中每个 resprenstation 称为 key。同时，“字典” 只是一种 memory bank 的统称，在实现的过程中不是用字典来实现，可以使用 queue 等容器数据结构。

How does building a large and consistent dictionary on-the-fly facilitate contrastive unsupervised learning?

large

从连续高维空间做更多的采样。字典越大，表示的视觉信息就越丰富，可以帮助模型学到更本质的特征，并且在对比query时能够学到更多能将物体区分开的特征，模型能够学习到的视觉特征就越丰富，同时也更能体现原有数据集的分布，有利于模型的泛化能力。如果字典很小，模型就会很容易学会一个 shortcut ，导致模型无法很好地泛化。
如果字典小、key 少，模型可能学到 shortcut 捷径，不能泛化

consistent

字典里的 key {} 应该由相同的 or 相似的编码器生成
如果字典的 key 是由不同的编码器得到的，query q 做字典查询时，很有可能 找到和 query 使用同一个 or 相似编码器生成的 key，而不是语义相似的 key。另一种形式的 shortcut solution

为什么是动态的，以及这个动态更新是怎么样的？就是这个新 sample 的batch 是 pos 还是 neg，以及什么时候更新这个queue？？？

不更新会导致字典里的representation不改变，也就是说，当前compare的queue中的key可能是好久之前的encode产生的，但是新产生的pos key是新的ecoder产生的，导致不一致？（我感觉这样）
从官方源代码中可以看到，每次从Dataset中sample的data，经过 transform 和 augmentation 变为 q 和 k ，也就是 anchor 和 pos，把 queue中的数据作为 neg key，然后计算损失
如何更新queue：新 sample 得到的 key，也就是这一次迭代的 pos key，进入队列，然后最老的出队列。

My own thinking

大：使其能够学到更多的细节，从而能够得到更好的representation

一致：指的是，一次train，这个待查询的字典，里面的特征都是由同一个参数的encoder编码得到的，如果不是这样，一个负样本经过不同的encoder可能判别为：正和负，即就像binsim里面一样，这个embedding space改变了，如果在进行查询比较就不合适！所以要尽量保持一样。有两种方法：

就是每次使用的batch全都由同一个encoder编码得到，也就是end-to-end做法，但是我们还要保证数据集足够大，也就是batch足够大，end-to-end的话，需要经过GPU，这样的话，会受到显存的限制

就是这个字典里面的representation预先离线计算好，然后每次batch就可以很大，具体也就是类似 memory bank的做法，使用队列作为字典，假设batch_szie为64，队列大小为65536，第一次训练就预先将由同一encoder计算好的49个填入队列中，然后sample一个batch数据，经过encoder，得到字典，然后进行对比学习，梯度传播，然后动量更新key的encoder，这个动量更新保证了：下一次batch的数据，经过encoder得到的embedding space和上一次或者之前的几次，几乎都是一样的们不会有很大的偏差，保证了一致性！

Method

We present Momentum Contrast (MoCo) as a way of building large and consistent dictionaries for unsupervised learning with a contrastive loss

maintain the dictionary as a queue of data samples & momentum-based encoder

Queue

使用 queue 作为字典的好处：The queue decouples the dictionary size from the mini-batch size, allowing it to be large.

动态字典更新过程：the encoded representations of the current mini-batch are enqueued, and the oldest are dequeued.

queue 数据结构: 剥离字典的大小和显卡内存的限制，让字典的大小和模型每次做前向传播的 batch size 的大小分开

字典很大（成千上万），意味着要输入很多很多的图片，如果作为一次batch进行前向传播，受显存限制
current mini-batch enqueued and the oldest mini-batch dequeued 当前 mini-batch 入队，最早进入队列的 mini-batch 出队。从一致性的角度来看，最早计算的那些mini-batch的key已经过时了。因此，与最新计算的mini-batch相比，这些过时的key的一致性最差。

momentum-based encoder

Moreover, as the dictionary keys come from the preceding several mini-batches, a slowly progressing key encoder, implemented as a momentum-based moving average of the query encoder, is proposed to maintain consistency

🤔

Question： Using a queue can make the dictionary large, but it also makes it intractable to update the key encoder by back-propagation (the gradient should propagate to all samples in the queue) 不能不更新，因为刚开始随机化，我们就是要训练出一个很好的encoder。但是如何更新呢？？？要求：更新后得到的encoder要尽可能和之前的相似，不能差距过大！

From ChatGPT 👏：

当数据集非常大时，进行反向传播更新参数可能会面临两个主要问题：计算和存储。

首先，反向传播需要计算损失函数对于每个参数的导数，以更新参数。在大型数据集中，每一个数据点都会对最终权重值的计算产生影响，因此需要进行大量的计算。

另一个问题是存储。反向传播需要将中间结果存储在内存中以进行计算，例如每个层的激活函数及其梯度。如果数据集非常庞大，那么这些存储所需的内存可能会超出计算机能够提供的内存容量，这可能会导致内存不足或者非常缓慢的计算。

解决这些问题的一种方法是使用随机梯度下降（SGD）或者一些其它的随机优化方法，这些方法仅仅利用了一部分数据点进行反向传播和参数更新。

所以更新的参数，我们不能从 key encoder 获得，那怎么办呢？ 🤔🤔🤔

作者提到了一个简单的方法，即在每个训练迭代结束后直接复制更新好的编码器参数fq，并将其应用于key编码器fk进行更新。尽管这个方法看起来简单，但作者指出这种方法并不理想。原因是快速改变编码器会降低队列中所有key的特征一致性。

📢

解释：假设我们有一个队列，新元素从左边进入，旧元素从右边出去。如果我们的 mini-batch size 是 1，意味着每次只更新一个 key。k1、k2、k3、k4、k5 都是由不同的编码器生成的。由于这些编码器快速变化，会导致所有 key 之间的一致性下降。

MoCo：使用 queue，只有当前 mini-batch 的特征是由当前的编码器得到的；之前的 key 是由不同时刻的编码器抽取的特征，为了保持consistent 提出momentum encoder。

动量参数 m 较大时 0.99，的更新缓慢，不过多的依赖于当前时刻的编码器，即不随着当前时刻的编码器快速改变，尽可能保证字典里的 key 都是由相似的编码器生成的特征，保证特征的 consistent

Pretext task—— instance discrimination

instance discrimination: query 和 key 匹配如果它们来自于同一张图片的不同视角, i.e., 不同的裁剪。

个体判别任务：每个个体都当作一个类别。

Relations to previous mechanisms

end-to-end

就是一个batch作为一个字典，保证了很好的一致性，但是字典大小受限于显存。

Memory bank

Memory bank的方法涉及使用新编码器在与k1、k2和k3对应的样本原始位置生成新特征。这些新key以蓝色表示，然后被放回到Memory bank的位置1、2和3，以此达到Memory bank的更新。这个过程在接下来的更新中使用新key（如4、5和6）重复进行。

这里存在一个问题：这些特征都是在不同时刻的编码器中生成的，而这些编码器通过梯度回传很快地更新。因此，我们得到的这些特征缺乏一致性。此外，由于memory bank里存储了所有的图片，这意味着我们必须等待一个完整的epoch才能更新整个memory bank。

当你开始下一个epoch的训练时，如果选择了1、5和8这三个key，则它们的特征都是在上一个epoch中计算出来的，但具体是哪个时间点计算出来的已经不确定了。因此，在query和key以及 key 和 key之间存在非常大的特征差异。总结一下memory bank方法，它牺牲了特征一致性以获得构建大型字典的优势。

MoCo

queue: 解决受限于显存问题

momentum-based encoder：解决一致性问题

Algorithm

写的非常简介但是很清楚！！！

Conclusion

MoCo 设计的初衷：去构造一个大的字典，从而让正负样本能够更有效地去对比，提供一个稳定的自监督信号，最后去训练这个模型

论文基于contrastive learning框架提出了MoCo，其主要思想就是训练encoder，将图像编码为query vector和key vector，对于能够匹配的图像对，让query vector和key vector越相似，反之越疏远。一次图像查找的行为就类似于一次字典查找，即通过encoder将查询的图像编码为query vector去与字典中的key vectors进行对比，查找出与query vector最相似的key vector。

由上述分析，在整个contrastive learning中，构建储存key vectors的字典十分重要。论文对一个好的字典需要满足的条件作出了以下归纳：large、consistent。现在所有基于constrastive learning的方法都只能顾及到这两个方面中的一个，所以论文提出了MoCo as a way of building a large and consistent dictionaries。1

newbing chat log

user: You are now a senior deep learning expert, and I need to understand the principles and usage of the following content in detail. Please explain in depth：Why building a large and consistent dictionary on-the-fly promotes contrastive unsupervised learning.

bing: [1]: https://dl.acm.org/doi/abs/10.1145/3539597.3570458 "Unbiased and Efficient Self-Supervised Incremental Contrastive Learning ..." [2]: https://www.mdpi.com/1660406 "Sensors | Free Full-Text | Adaptive Contrastive Learning with Label ..." [3]: https://authors.library.caltech.edu/120303/1/2022.11.09.515746v1.full.pdf "ConstrastivePose: A contrastive learning approach for self-supervised ..." [4]: https://paperswithcode.com/method/moco "MoCo Explained | Papers With Code" [5]: https://arxiv.org/abs/1911.05722 "Momentum Contrast for Unsupervised Visual Representation Learning" [6]: https://ui.adsabs.harvard.edu/abs/2019arXiv191105722H/abstract "Momentum Contrast for Unsupervised Visual Representation Learning" [7]: https://ieeexplore.ieee.org/document/9157636 "Momentum Contrast for Unsupervised Visual Representation Learning ..."

Contrastive unsupervised learning (CUL) is a self-supervised approach that trains encoders to perform dictionary look-up: an encoded “query” should be similar to its matching key and dissimilar to others[^1^][4]. Learning is formulated as minimizing a contrastive loss. Building a large and consistent dictionary on-the-fly promotes CUL because it increases the diversity and quality of the keys that the query can be compared with[^2^][5]. A large dictionary can cover more possible variations of the data, and a consistent dictionary can ensure that the keys are updated with the latest encoder parameters. One way to build such a dictionary is to use a queue and a moving-averaged encoder, as proposed by MoCo[^2^][5].

user: I don't understand what you mean by 'a consistent dictionary can ensure that the keys are updated with the latest encoder parameters.' Can you please explain it in more detail?

bing: [1]: https://arxiv.org/abs/1911.05722 "Momentum Contrast for Unsupervised Visual Representation Learning" [2]: https://ieeexplore.ieee.org/document/9879059/ "Consistent Explanations by Contrastive Learning | IEEE Conference ..." [3]: https://ieeexplore.ieee.org/document/9157636 "Momentum Contrast for Unsupervised Visual Representation Learning ..." [4]: https://arxiv.org/abs/2302.01541 "[2302.01541] Contrastive Learning with Consistent Representations"

A consistent dictionary means that the keys are encoded with the same or similar encoder parameters as the queries[^1^][1]. This is important because contrastive learning relies on comparing the similarity between queries and keys, and if the encoder parameters change too much, the similarity measure may not be reliable. One way to achieve a consistent dictionary is to use a moving-averaged encoder, which updates its parameters slowly by averaging the current encoder parameters and the previous ones[^1^][1]. This way, the keys are updated with the latest encoder parameters, but not too drastically.

user: I don't understand what you mean by if the encoder parameters change too much, the similarity measure may not be reliable

bing: [1]: https://ai.googleblog.com/2018/05/advances-in-semantic-textual-similarity.html "Advances in Semantic Textual Similarity – Google AI Blog" [2]: https://www.e2enetworks.com/blog/autoencoders-for-image-similarity-a-functional-guide "Autoencoders for Image Similarity: A Functional Guide" [3]: https://developers.google.com/machine-learning/clustering/similarity/measuring-similarity "Measuring Similarity from Embeddings | Machine Learning - Google Developers" [4]: https://link.springer.com/article/10.1007/s11760-022-02330-z "Cross-type image similarity measurement | SpringerLink" [5]: https://engineering.nyu.edu/mechatronics/Control_Lab/Criag/Craig_RPI/SenActinMecha/S&A_Optical_Encoders.pdf "Optical Encoders - New York University"

Encoder parameters are the weights and biases of the neural network that transforms the input data into a lower-dimensional representation, also called an embedding[^1^][2]. The similarity measure is a function that takes two embeddings and returns a number measuring their similarity[^2^][3]. For example, cosine similarity is a common similarity measure that computes the angle between two vectors. If the encoder parameters change too much, the embeddings of the same input data may also change significantly, which can affect the similarity measure. For instance, two vectors that were previously close to each other may become far apart after the encoder parameters change, resulting in a lower similarity score. This can make the contrastive learning less effective, as it relies on comparing the similarity between queries and keys. Therefore, it is important to keep the encoder parameters consistent across different batches of data.

user: Since that's the case, why do we still need to update the key encoder? Wouldn't it be better to keep it unchanged?

bing: [1]: https://hammer-wang.github.io/5cents/representation-learning/moco/ "Momentum Contrast for Unsupervised Visual Representation Learning ..." [2]: https://sh-tsang.medium.com/review-moco-momentum-contrast-for-unsupervised-visual-representation-learning-99b590c042a9 "Review — MoCo: Momentum Contrast for Unsupervised Visual ... - Medium" [3]: https://ai.googleblog.com/2021/06/extending-contrastive-learning-to.html "Extending Contrastive Learning to the Supervised Setting" [4]: https://arxiv.org/abs/2207.13532 "Contrastive Masked Autoencoders are Stronger Vision Learners" [5]: https://www.sciencedirect.com/topics/computer-science/contrastive-learning "Contrastive Learning - an overview | ScienceDirect Topics"

We still need to update the key encoder because we want it to learn from the latest data and improve its representation quality[^1^][1]. If we keep it unchanged, it may become outdated and less effective over time. However, we don't want to update it too frequently or too drastically, because that would make the keys inconsistent with the queries[^1^][1]. Therefore, we use a momentum update rule, which updates the key encoder slowly by averaging the current encoder parameters and the previous ones[^1^][1]. This way, we can balance between consistency and improvement of the key encoder.

Reference

Momentum Contrast for Unsupervised Visual Representation Learning

对比学习论文综述【论文精读】

MoCo 论文逐段精读【论文精读】

[论文笔记]MoCo v1、v2

CV中的无监督学习方法：MoCo

MoCo v1原理解析

何恺明团队顶作 MoCo 系列解读