首页 > 编程学习 > Few-shot learning

Few-shot learning

发布时间:2022/11/14 23:14:06

《学习笔记》

文章目录

  • Generals
    • supervised vs few-shot learning
    • siamese Net
    • Triplet loss
    • cosine similarity
    • Entropy Regularisation
      • 为什么一个概率分布的entropy的mean需要足够小?
    • Cosine similarity + soft Max
    • 1 brach Vs 2 branch
      • 1 brach:
      • 2 branch
  • Process
  • 数据
  • Trian
  • Inner loop for MAML ProtoNet
    • gradients
  • AirDet: Few-Shot Detection without Fine-tuning for Autonomous Exploration
    • Support object
    • 为什么AirDet 不需要 fine tune?
      • 为什么要fine tune ?
    • SCS module 是怎么从cross-scale relations里提取multi-scale feature的?
    • Detection head
      • 组成
      • 用处
        • class prototype 用在head哪里?
        • 怎么通过multi-shot support feature 得到class prototype
    • Shot aggregation
    • proposal(SCS) 与 support feature(GLR)里有location怎么样的信息
      • cross-relation怎么建立
      • 怎么做BBOX regression?
  • [6]Few-Shot Object Detection with Attention-RPN and Multi-Relation Detector
    • 贡献
    • 启发
      • FSOD
      • Attention-Based Region Proposal Network
        • Similarity map (attention map)
  • [18]Simarpn++
    • 对AirDet贡献
    • 挑战
      • Problem1
      • Solution for problem 1
      • Problem 2
      • Solution for problem 2
    • 贡献:

Generals

supervised vs few-shot learning

  • inference的data是test data
  • supervised train的data有inference碰到的categories

  • inference的data是query data
  • 但是few-shot train的data里不一定有inference碰到的categories
  • k-way:support set(training data) 有 k 个categories
  • n-shot: every category has n samples
  • training set: 猫的training data有10张,狗的training data也有10张,则这种情况,k = 2, n = 10
  • 通常来说,随着categories的增多,检测的准确率会下降。因为种类的增多越来越会像“大海捞针”,检测击中目标的难度会增加
  • 重点在于similarities的计算。从一个大的数据集学习到一个similarity function

siamese Net

  • 用了正负样本的set
  • 用了binary loss
    在这里插入图片描述

Triplet loss

  • 由 “Anchor” , “positive” & “negative”。Anchor选取的应和“positive”的来自同一类别。
  • 这里的negative 与 positive 的比较,不再是简单的两个features的差别,而是通过了一个anchor得到的
    d + = ∣ ∣ f ( x + ) − f ( x a ) ∣ ∣ 2 2 d^+ = ||f(x^+)-f(x^a)||^2_2 d+=f(x+)f(xa)22
    d − = ∣ ∣ f ( x + ) − f ( x a ) ∣ ∣ 2 2 d^- = ||f(x^+)-f(x^a)||^2_2 d=f(x+)f(xa)22
  • 这样可以帮助相同的feature聚集在一起,不同的分开 。
  • 如下图,理想情况是增加 d − d^- d, 减少 d + d^+ d+
  • 自定义超参数 α \alpha α>0为margin。
  • d − ≥ d + + α d^{-} \geq d^{+} + \alpha dd++α, no loss(因为是正确的,就该是这样的差距)
    loss function为:
    在这里插入图片描述
    L o s s ( x a , x + , x − ) = m a x { 0 , d + + α − d − } Loss(x^a,x^+,x^-) = max\{0, d^+ + \alpha - d^{-}\} Loss(xa,x+,x)=max{0,d++αd}
    在这里插入图片描述

cosine similarity

  • 需要确保在[-1, +1]的range之间。如果两个vector不在这个范围,则需要归一化。

在这里插入图片描述
在这里插入图片描述

  • 夹角越大说明两个vector越不一样,则similarity的结果(X->W)越小, 即x映射到x的长度越短。这里的similarity与字面“差距”理解的意思相反,也与“夹角”相反
    在这里插入图片描述

Entropy Regularisation

  • 防止overfitting过拟合(对Transformer的信心比较大,觉得它的self-attetion足够叼,所以用了entropy regularisation在label assignment)

在这里插入图片描述

为什么一个概率分布的entropy的mean需要足够小?

这里既是解释为什么 entropy regularization的结果需要足够小。
原因:
(左图)当Softmax 出来的probability一样时,意味着Model不太能分清图片,Entropy 很高。

(右图)当Softmax 出来的probability,不一样,且有明显的值(p2)时,意味着Model能分清图片,Entropy 很低。

  • Entropy 高:意味数据分散,混沌不堪
  • Entropy 低:意味数据集中
    在这里插入图片描述

Cosine similarity + soft Max

增加准确性。
在这里插入图片描述

1 brach Vs 2 branch

1 brach:

代表:YOLO
主要:soft Max一梭子到底
优点:对大数据集好
缺点:小数据集容易有overfitting
在这里插入图片描述

2 branch

代表:RCNN
主要:基于siamese network,使用metric-learning 做的object detection. “feature alignment”, “GCN”,以及 "Transfomer"被用于计算“support-query” similarity
优点:靠学习一个class-agnostic达到强泛化
缺点:support-query interactions 被限制在了detect head
在这里插入图片描述

Process

  1. 从训练数据里获取对应meta-learning 的training set (100 张)

    • train data:【80】
    • val data【80-90】
    • test data【90-100】
      从train data的80个classes里学会 val data的10个classes 以及 少量来自 test data的classes
  2. 使用batch 去 sample data (需要随机获取数据两次)

    • 随机获取数据第1次:在每个iteration, 随机从80个classes里选择少量的classes,并从被选择的这些classes里面随机的获取一些数据,构成传说中的support set
    • 随机获取数据第2次:类似上一步,从被选择的这些classes里面再次随机的获取第二批数据,构成传说中的query set(用来Evaluate 用support train的model)
    • 这一步是通过这样training Process去模拟 最初few-shot Process的构想。因此,每个batch都说明,有新的一组 support,query的few-shot被模型学习和训练了。
  3. 设置好iteration 和 batch

    • ‘#’ N_way:每个batch 有多少个classes
    • ‘#’ K_shot : 每个batch里的每个class有多少的数据量
    • ‘#’ batches_per_class: 由于数据量很足,还需要估算出每个class的总数据量可以被分成多少段的k-shot。即: classX total # # K s h o t \frac{\text{classX total \#}}{ \#K shot} #KshotclassX total #
    • ‘#’ iterations: ∑ batches per class # # N w a y \frac{ \sum \text{batches per class} \#}{\# N way} #Nwaybatches per class#
  4. 建立Prototypical 网络

    • 灵感:与KNN类似,使用distance metric 作为区分方式,然后用 s o f t m a x ( distance similarity ) softmax(\text{distance similarity}) softmax(distance similarity) 得到它离哪个类别相近。
    • 可以直接用Dense Net 去获取support set的features,然后对每个class对应的数据集得到的features vector取‘‘average’’。这里“Average”的结果用于作为每个类别的“Prototype
    • 然后将每个classe与query set的feature vectors做“squared eucliean distance”的差值。得到的差值(Error),使用 crossentropy + softmax的方式得到预测的值(即所属class)。
    • 这里用的Corss-entropy Error去训练网络,然后Distance metric需要differentiable的(例如:squared euclidean distance)
    • 分类问题中,一般是generative 的问题,使用Bayes定义为: p ( y = c ∣ x ) = p ( y ) p ( y ∣ x ) p ( x ) p(y=c|x) = \frac{p(y)p(y|x)}{p(x)} p(y=cx)=p(x)p(y)p(yx) ,但这里把bayes可以被转换成softmax, 则公式如下:

    p ( y = c ∣ x ) = softmax ( − d φ ( f θ ( x ) , v c ) ) = exp ⁡ ( − d φ ( f θ ( x ) , v c ) ) ∑ c ′ ∈ C exp ⁡ ( − d φ ( f θ ( x ) , v c ′ ) ) p(y=c\vert\mathbf{x})=\text{softmax}(-d_{\varphi}(f_{\theta}(\mathbf{x}), \mathbf{v}_c))=\frac{\exp\left(-d_{\varphi}(f_{\theta}(\mathbf{x}), \mathbf{v}_c)\right)}{\sum_{c'\in \mathcal{C}}\exp\left(-d_{\varphi}(f_{\theta}(\mathbf{x}), \mathbf{v}_{c'})\right)} p(y=cx)=softmax(dφ(fθ(x),vc))=cCexp(dφ(fθ(x),vc))exp(dφ(fθ(x),vc))
    在这里插入图片描述

    • 最后用cross_entropy_loss(pred, target)

    数据

    N_WAY = 5
    K_SHOT = 4
    support_imgs = 20 x 3 x 32 x 32
    query_imgs = 20 x 3 x 32 x32
    train_data_loader =
    val_data_loader

Trian

在这里插入图片描述

Inner loop for MAML ProtoNet

output_weight = (output_weight - init_weight).detach() + init_weight. While this line does not change the value of the variable output_weight, it adds its dependency on the prototype initialization init_weight. Thus, if we call .backward on output_weight, we will automatically calculate the first-order gradients with respect to the prototype initialization in the original model.

 # Optimize inner loop model on support set
        for _ in range(self.hparams.num_inner_steps):
            # Determine loss on the support set
            loss, _, _ = self.run_model(local_model, output_weight, output_bias, support_imgs, support_labels)
            # Calculate gradients and perform inner loop update
            loss.backward()
            local_optim.step()
            # Update output layer via SGD
            output_weight.data -= self.hparams.lr_output * output_weight.grad
            output_bias.data -= self.hparams.lr_output * output_bias.grad
            # Reset gradients
            local_optim.zero_grad()
            output_weight.grad.fill_(0)
            output_bias.grad.fill_(0)

每个inner的step,weight和bias都要置0。在计算SGD时,更新后的weight = 更新前的weight - learning_rate * 之前weight的gradient.

gradients

由于是多任务的检测,则针对于各个support set,每轮优化都会有这些个对应的gradients产生.因此最终需要将这些gradients想加成为最终graidnets.

preds = F.linear(feats, output_weight, output_bias)
loss = F.cross_entropy(preds, labels)

这里的weight与features区分开。

当second order gradient不方便计算时,使用以下tricks

  1. 公式: 最新weight = (新weight-初始weight).detach() + 初始weight
  2. 以上是靠first ordermodel 去 estimate second order model的
  3. 把(新weight-初始weight).detach()当作fine-tune的结果
  4. 最终结果是需要包括初始weight + fine-tune过的weight

AirDet: Few-Shot Detection without Fine-tuning for Autonomous Exploration

https://arxiv.org/abs/2112.01740

Support object

用于后续帮助提取信息的样板图,文中使用了class-agnostic relation with the support images in all the modules of AirDet,来充分提取support image的信息。主要处理步骤属于SCS & GLR & PRE

为什么AirDet 不需要 fine tune?

off-line 的 fine-tune相当昂贵,对于机器人场景不耐受
off-line 的 fine-tune会有检测small Objects limitation
灵感来自【6】

为什么要fine tune ?

目前大多数都是:class-specific model design

SCS module 是怎么从cross-scale relations里提取multi-scale feature的?

Detection head

组成

用处

class prototype 用在head哪里?

怎么通过multi-shot support feature 得到class prototype

Shot aggregation

proposal(SCS) 与 support feature(GLR)里有location怎么样的信息

cross-relation怎么建立

怎么做BBOX regression?


[6]Few-Shot Object Detection with Attention-RPN and Multi-Relation Detector

Fan, Q., Zhuo, W., Tang, C.K., Tai, Y.W.: Few-Shot Object Detection
with Attention-RPN and Multi-Relation Detector. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

贡献

we introduce a novel attention RPN and detector with multi-relation modules to produce an accurate
parsing between support and potential boxes in the query.
使用的attention module代替了fine-tune
early stage 可以提高proposal的质量
later stage在confusing的背景下,过滤误检

启发

FSOD

由于few-shot 非常取决于‘generalization ability of the pertinent model’ when presented with novel
categories
因此,diversity 且 种类丰富的数据是极好的

Attention-Based Region Proposal Network

RPN :
不仅需要 distinguish between objects and non-objects
还需要filter out negative objects not belonging to the support category.
For N种类 training:
we extend the network by adding N − 1 support branches where each
branch has its own attention RPN and multi-relation detector with the query image.
For K样本量 training:
we obtain all the support feature through the weight-shared network
Support feature来自:use the average feature across all the supports belonging to the same category
RPN优点:

  • which uses support information to enable filtering out most background boxes and those in non-matching categories.
  • Thus a smaller and more precise set of candidate proposals is generated with high potential containing target objects

Similarity map (attention map)

计算similarity between the feature map of support and that of the query in a depth-wise cross correlation manner
用于build the proposal generation

global feature can provide a good object prior for objectness classification

The attention RPN with loss Lrpn is trained jointly


[18]Simarpn++

https://arxiv.org/pdf/1812.11703v1.pdf

对AirDet贡献

Channel Relation 启发自 features of different classes are often stored in different channels 。

挑战

Problem1

  • 就算Alex net + 复杂的Res net, 也没有给model带来改善

Solution for problem 1

问题出现是由于严格的translation invariance造成:
(1)要么target可以在search 的ROI里任意地方出现
(2)target templete 学习应该要再多元一下,达到spatial incariant.
* 条件:zero-padding variant of AlexNet

Problem 2

In addition, an interesting phenomena is observed that objects in the same categories have high response on the same channels while responses of the rest channels are suppressed. The orthogonal property may also improve the tracking performance.

Solution for problem 2

By analyzing the Siamese network structure for cross-correlations, we find that its two network branches are highly imbalanced in terms of parameter number; therefore we further propose a depth-wise separable correlation struc- ture which not only greatly reduces the parameter number in the target template branch, but also stabilizes the training procedure of the whole model. In addition, an interesting phenomena is observed that objects in the same categories have high response on the same channels while responses of the rest channels are suppressed. The orthogonal property may also improve the tracking performance.

贡献:

Siamese tracker 的准确率下行,可能和破坏了“strict translation invariance”有关

破除“spatial invariance restriction” 可以帮助训练以ResNet为准的模型

得益于ResNet结构。我们可对“cross- correlation operation” 打造layer wise feature aggregation 结构,借此帮助tracker预测来自多个等级学到的features得到的similarity map.

得益于ResNet结构。我们对“cross- correlation” 改进depth wise separable correlation 结构,借此产生来自不同semantic meanings相关联的多个similarity maps

Copyright © 2010-2022 dgrt.cn 版权所有 |关于我们| 联系方式