FUNIT

0. 前言

paper: FUNIT: Few-Shot Unsupervised Image-to-Image Translation
code: code

代码还需要等段时间，才能下载到所有代码。

这篇文章和郑哲东的 Joint Discriminative and Generative Learning for Person Re-identification 这篇文章看着有点像，果然是高手的思路都是相同的。不对，两个都是 NVIDIA 的，哈哈哈。

这篇文章思路奇特，实现简单。

1. Introduction

传统的 GAN 在训练阶段需要 many images in both source and target classes (其中 source 提供 content, target 提供 classes)(这里的 content 可以粗略的理解成形态或者表情或者属性或者姿势这种与类别无关的内容，classes 表示与类别相关的内容)，并且测试图片需要来自训练时所使用的 source and target classes.而作者提出的方法在训练阶段需要很少的 images in both source and target classes 并且测试时可以提供训练未曾见到过的 target classes.

其基本假设是当人类见到 new object (target class) 时，可以根据以往的经验推断出它的其他形态，比如曾经见过猫的站立和趴着的状态，见到老虎时，能自动脑补出老虎站立和趴着的状态。

Question: 如果见过猫的站立和趴着的状态，看到蛇会咋想，蛇没有站或者趴这种说法，这种转换会考虑到这种情况吗？还是会生成一个站着的蛇？哈哈哈？

Question: 测试时如果把 target dataset 作为 content image，不知道会是什么效果。

Question: 作者是否隐藏了一个假设：content image 提供的信息或者属性是所有 object 都共享的？

2. Few-shot Unsupervised Image Translation

明确几个相关定义：

数据集一共分为两个：source dataset (source class images), target dataset (target class images), 训练时只使用 source dataset, 测试时使用 source dataset 和 target dataset。
对于模型而言，输入分为一张 content image 和多张 class image，content image 提供姿势等信息，class image 提供类别等信息.
模型在训练时，content image 和 class image 都来自 source dataset，在测试时，content image 来自 source dataset，class image 来自 target dataset.

generator G 的输入是一张 content image $x$ and a set of K class images $\lbrace y_1,…,y_k \rbrace$, 从而得到输出图片 $\bar{x}$:

$\bar{x}=G(x, \lbrace y_1,...,y_k \rbrace)$

其中，content image 属于 object class $c_x$，K 张 class images 属于 object class $c_y$，一般情况下，K 比较小并且 $c_x$ 不等于 $c_y$，称 G 为 few-shot image translator.得到的结果在外形看更像 $c_y$，在姿势等方面更像 $c_x$.

令 $\mathbb{S}$ and $\mathbb{T}$ 表示 the set of source classes and the set of target classes.在训练阶段，随机从 $\mathbb{S}$ 中选取两类图片 $c_x, c_y \in \mathbb{S}$ 输入 G，在测试时，class image 选自 $\mathbb{T}$，content image 选自 $\mathbb{S}$.

2.1 Few-shot Image Translator

G: generator, consists of a content encoder $E_x$, a class encoder $E_y$, and a decoder $F_x$
$E_x: x \to z_x$: class-invariant latent representation, determines the local structure
$E_y: \lbrace y_1,…,y_k \rbrace \to z_y$: class-specific latent representation, control the global look, K 可大可小
$F_x: (z_x, z_y)\to \bar{x}$

2.2 Multi-task Adversarial Discriminator

D: discriminator, multiple adversarial binary classification task
D 输入是一张图片，得到 $|\mathbb{S}|$ 个输出
当输入图片取自 real image of source class $c_x$，希望第 $c_x$ 个输出为真
当输入图片取自 fake image of translation out $c_x$，希望第 $c_x$ 个输出为假
属于 $c_x$ 类的图片预测为其他类 $(\mathbb{S}\setminus {c_x})$ 是真是假，则不关心
更新 G 时，仅仅希望第 $c_x$ 个输出为假
作者通过经验发现 $|\mathbb{S}|$ 个二分类器比一个 $|\mathbb{S}|$ 分类器得到的效果更好

Question: 这种设置分类器的方法倒是没想过，不知道作者的思路由来，感觉这种多个二分类器要比多分类器更容易扩展，如果多一个类别，可以在不改变原有分类器的情况下加一个即可，但是多分类器的话需要全部替换

2.3 Learning

GAN loss:

$L_{GAN}(G,D)=E_x[-\log D^{c_x}(x)]+E_{x,\lbrace y_1,...,y_k \rbrace}[\log (1-D^{c_y}(\bar{x}))]$

content reconstruction loss:

$L_{R}(G)=E_x[\parallel x-G(x, \lbrace x \rbrace ) \parallel]$

Question: 这个损失是为了约束 $\bar{x}$ 与 $x$ 相似吗？

feature matching loss:

记 $D_f$ 为 feature extractor, 即移除 D 的分类器。

$L_F(G)=E_{x,\lbrace y_1,...,y_k \rbrace}[ \parallel D_f(\bar{x}-\sum _k \frac{D_f(y_k)}{K}) \parallel _1^1]$

Question: 按理来说，通过分类损失就已经可以约束 $\bar{x}$ 和 $\lbrace y_1,…,y_k \rbrace$ 相似了，再加上 feature matching 的效果更明显吗？

Question: 是怎么约束 $\bar{x}$ 与 $x$ 在姿势等方面相似的？

4. Experiments

训练时 $K=1$，测试时 $K=1, 5, 10, 15, 20$

Baseline: 作者把 target class image 在训练阶段是否出现划分为 fair (unavailable) and unfair (available)

Fair: StarGAN-Fair-K
Unfair: StarGAN-Unfair-K, CycleGAN-Unfair-K, UNIT-Unfair-K, MUNIT-Unfair-K

至于关于 GAN 的其他指标，则不列了，一是肯定效果好，二是自己也不是特别懂这些指标的意义和难度。

训练时 source classes 越多，效果越好。

在跨物种时，效果也会很差，只能改变 content image 的颜色。其类别没有发生变化。